Auto-Suggest From Popular Queries Using EdgeNGrams | SearchHub, brought to you by LucidWorks

Jay Taylor's notes

Auto-Suggest From Popular Queries Using EdgeNGrams | SearchHub, brought to you by LucidWorks

Tags: elasticsearch lucene search auto-suggest n-grams edge-n-grams searchhub.org

Clipped on: 2012-11-01

A popular feature of most modern search applications is the auto-suggest or auto-complete feature where, as a user types their query into a text box, suggestions of popular queries are presented. As each additional character is typed in by the user the list of suggestions is refined. There are several different approaches in Solr to provide this functionality, but we will be looking at an approach that involves using EdgeNGrams as part of the analysis chain. Two other approaches are to use either the TermsComponent (new in Solr 1.4) or faceting.

N-grams and Edge N-grams

An N-gram is an n-character substring of a longer sequence of characters. For example, the term “cash” is composed of the following n-grams:

unigrams: “c”, “a”, “s”, “h”
bigrams: “ca”, “as”, “sh”
trigrams: “cas”, “ash”
4-grams: “cash”

N-grams can be useful when substrings of terms need to be searched. An Edge n-gram is an n-gram built from one side or edge of a term. Edge n-grams for the term “cash” would be:

“c”, “ca”, “cas”, “cash”

It’s easy to see how edge n-grams could be used to suggest queries as a user types in a search query character by character.

An Overview of the Process

In order to provide query suggestions we will need to have typical queries entered by users available in a Solr index. It is a good practice to capture and analyze the queries that are being entered by the users of a search application. Ideally you might have a scheduled process to parse your Solr output logs to capture queries entered by users of the application. The queries might be stored in a relational database where they could be analyzed independently of your running production Solr instances. Another benefit of storing queries (and the number of times they have been entered) is that it is then possible to use Solr’s DataImportHandler to build an index of query information that can be used to power an auto-suggest feature. You might design an auto-auggest index as a separate core hosted in a single Solr instance along with a core for your main index. Note that it is probably not worth indexing queries that return zero results since we won’t want to suggest those to a user, so we’ll include a boolean field to let us know which queries contain results. A minimal table design for our needs might look like this:

create table autosuggest (
 query varchar(250),
 hasResults boolean,
 count int);

More metadata could certainly be added as needed for reporting and analysis – these are just the columns we will need to demonstrate how to build the auto-suggest feature. As queries are parsed from the log files the process will need to query the main Solr index to deterine whether the query has one or more results in order to populate the hasResults column.

Typically an AJAX front-end would be used to query the “auto-suggest” index. For the purposes of this article we won’t deal with how to build the AJAX component, nor will we go into details about the pre-processing involved in building the database table. Instead we will focus on how to configure Solr to index the queries and then search the index, with responses written in JSON format using Solr’s JSON response writer.

In general the steps we will take are:

Parse log files to get a list of queries entered by users and load those queries (and any other useful metadata) into a database table. This should be an ongoing scheduled process.
Configure schema.xml.
Configure a dih-config.xml file.
Do a full-import with the DataImportHandler.
Build queries that can be used by an AJAX client.

Configuring schema.xml

We will need to define a fieldType that doesn’t tokenize the query and does very minimal analysis. We can use the KeywordTokenizerFactory which will leave our queries intact as a single term. We’ll include the LowerCaseFilterFactory to simplify input, and then, most importantly, we will include the EdgeNgramFilterFactory. The EdgeNgramFilterFactory will break up our query sources into a series of EdgeNgrams. We need to specify a minGramSize and a maxGramSize. For this example we’ll set the minimum to “1″ and the maximum to “25″. Note that storing n-grams in an index will require some extra storage, but since our “auto-suggest” index will only contain two fields (and only one of them stored), and since we can probably expect a document count that is not too high, this should not be a problem.

Here is our definition for a fieldType to accomplish what we need:

<fieldType name="edgytext" class="solr.TextField" positionIncrementGap="100">
 <analyzer type="index">
   <tokenizer class="solr.KeywordTokenizerFactory"/>
   <filter class="solr.LowerCaseFilterFactory"/>
   <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" />
 </analyzer>
 <analyzer type="query">
   <tokenizer class="solr.KeywordTokenizerFactory"/>
   <filter class="solr.LowerCaseFilterFactory"/>
 </analyzer>
</fieldType>

We should only need two fields defined, “user_query” to hold the query and it’s n-grams, and “count” which is the number of times a query has been found in the logs, and is what we can sort on to present more popular queries higher in the suggested list:

<field name="user_query" type="edgytext"
  indexed="true" stored="true" omitNorms="true"
  omitTermFreqAndPositions="true" />
<field name="count" type="int" indexed="true"
  stored="false" omitNorms="true"
  omitTermFreqAndPositions="true" />

Configuring dih-config.xml

Solr’s DataImportHandler (DIH) is an extremely fast and efficient tool to use for indexing data in a relational database. There are two places where configuration needs to be done to enable the DIH. First, we need to set up a request handler in solrconfig.xml:

<requestHandler name="/indexer/autosuggest"
    class="org.apache.solr.handler.dataimport.DataImportHandler">
 <lst name="defaults">
 <str name="config">dih-config.xml</str>
 </lst>
 <lst name="invariants">
 <str name="optimize">false</str>
 </lst>
</requestHandler>

Then we need to create the dih-config.xml indicated in the request handler configuration. Create a file named “dih-config.xml” in the same directory as solrconfig.xml and schema.xml with thw following contents:

<?xml version="1.0"?>
<dataConfig>
 <dataSource
   type="JdbcDataSource"
   readOnly="true"
   driver="com.mysql.jdbc.Driver"
   url="jdbc:mysql://localhost:3306/myDatabase"
   user="user"
   password="password"/>

  <document name="autoSuggester">
    <entity name="main"
       query="select query, count from autosuggest where hasResults = true">
     <field column="query" name="user_query"/>
     <field column="count" name="count"/>
   </entity>
 </document>
</dataConfig>

The dataConfig is pretty straightforward. We set up a connection to our datasource and then run a simple select query to get all records that have results. It would also be possible to set up a delta-import query but depending on how many user-entered queries you are indexing it may not be necessary. The DIH is very fast, and my tests were able to index about 75,000 documents a minute (on a MacBook Pro with a 2GB heap size), so running a full index once or several times a day may be easier. If you want to set up delta queries you would have to add a date field to the table that could be included in the delta queries. (For more details on DIH configuration see the wiki: http://wiki.apache.org/solr/DataImportHandler)

Indexing The Data

Now we’re ready to index the data in the autosuggest table. This can be done in two different ways:

Using a utility like “curl” or “wget”: curl ‘http://localhost:8983/solr/indexer/autosuggest?command=full-import (note: when developing and debugging you can add the parameter “&rows=10″ or whatever number you would like to limit the import.)
From the dataimport.jsp page: http://localhost:8983/solr/admin/dataimport.jsp. You should see a link to the handler defined in solrconfig.xml (which we named /indexer/autosuggest). Click on that link. The dih-config.xml file is displayed along with various buttons for different operations. Clicking on “Full-import” will fire off the full import process.

Now we should have documents in our auto-suggest index. We can use the analysis page from the admin console (http://localhost:8983/solr/admin/analysis.jsp) to get a look at how the n-grams are created for a query. From the analysis page select “type” from the “Field” drop-down box and enter a value of “edgytext”, the fieldType we defined in schema.xml” Check “verbose output” in the “Field value” section and enter “i’m not down” in the text area. Click “Analyze” and observe how the EdgeNGramFilterFactory breaks up our query into n-grams:

analysis

Running Some Queries

Now we can do some testing to see if we get the results we expect. My test index was for a music site where users are often entering song titles as keyword searches. Let’s say I’m searching for the song “I’m Not Down”. The first character typed should trigger the following query sent from the AJAX front-end: http://localhost:8983/solr/select/?q=user_query:”i”&wt=json&fl=user_query&indent=on&echoParams=none&rows=10&sort=count desc

Note that we are using the JSON response writer (&wt=json), asking only for the field we’re interested in (&fl=user_query), turning off echoParams (&echoParams=none) to keep the response as small as possible, and sorting on the count field descending (&sort=count desc). (&indent=on is set for clean display, but this should be omitted for a production site.) Since we entered the count values for each query into the database we can sort on that to get the most popular searches at the top of our results. The response for this query with only the first character entered looks like this:

{
 "responseHeader":{
 "status":0,
 "QTime":1},
 "response":{"numFound":12,"start":0,"docs":[
 {
 "user_query":"i'm only sleeping"},
 {
 "user_query":"i'm glad"},
 {
 "user_query":"i'm not down"},
 {
 "user_query":"i'm a believer"},
 {
 "user_query":"i'm not your stepping stone"},
 {
 "user_query":"i'm not in love"},
 {
 "user_query":"i'm shakin'"},
 {
 "user_query":"i've been everywhere"},
 {
 "user_query":"i'm hurtin'"},
 {
 "user_query":"i'm in the mood for love"}]
 }}

Let’s jump ahead to the fifth character entered: “i’m n” – now the results are more restricted:

{
 "responseHeader":{
 "status":0,
 "QTime":4},
 "response":{"numFound":3,"start":0,"docs":[
 {
 "user_query":"i'm not down"},
 {
 "user_query":"i'm not your stepping stone"},
 {
 "user_query":"i'm not in love"}]
 }}

Note that it’s necessary to wrap the query in double-quotes as a phrase. Otherwise unpredictable and unwanted matches can occur.

In Conclusion

We’ve shown one way to implement an auto-suggest feature. But no matter which approach you take it’s important to know what kinds of queries your users are entering, and how to index that data in such a way that it improves the end-user experience for your search application. Parsing log files to capture queries and storing them in a database not only makes it easy to index your query data for auto-suggestions, but the data can also be useful for reporting and analysis, as well as building query lists for load testing.

Jay Taylor's notes