Search engines and query processing are not recent in Computer Science: this field known as Information Retrieval has a pretty vast set of state of the art practices. Today most search engines on the market come with a large set of features that developers can use to create their query processing pipeline, but this task is far more difficult than it seems and most people never manage to achieve a good result. In this post, we will cover the classical approaches and how we are handling it in the Algolia engine.
Introduction to query processing
To be able to find results, a search engine has to be able to understand deeply what was asked. That’s the role of the query processing: to analyse the query, and eventually transform it to make it easier to process by the search engine.
And to do so, a search engine will process the query in two big steps, both of which can be very challenging.
- 1) Detect words in the query: this process is called tokenization and identifies what is a word by answering questions like:
“is t-shirt one word or two?”
- 2) Search for alternatives on this set of words: the goal of this step is to be less rigid by adding alternative corrections or synonyms, to avoid missing results that don’t contain exactly what was searched, but something equivalent. It decreases the need to reformulate the query to find what you want.
Why is tokenization complex?
The process of detecting words involves a set of rules to determine what is a word. It is applied both on the query and when indexing the words, to ensure both have the same format of words, which makes them easier to match.
This process seems simple at first – at least for alphabet-based languages – but as often, complexity creeps in when you consider the edge cases. And you soon realize that it’s way harder than it seemed, with no easy solutions.
To give you an idea of the complexity of tokenization, here is a non-exhaustive list of tokenization issues:
- Symbols: You cannot simply ignore symbols. C, C++ and C# are different words. And there are expressions that are composed of symbols only like the “!!!” band!
- Compound words: some languages like Dutch and German can aggregate words to form one expression. For example in Dutch, the word
"disability insurance"and is composed of 3 words: arbeid (labour) + ongeschiktheid (inaptitude) + verzekering (insurance). Of course you would like to have a document containing this long expression match on the
- Agglutination: some languages like Japanese, Chinese and Korean don’t have separators between words (i.e. spaces). Those languages require a statistical approach that helps to detect words in the sequence of ideograms, the search engine also needs to be reliable to an error of word recognition. As with compound words, you want to be able to search for a document with just one of the words.
- Acronyms with Punctuation: you want to find a record containing U.S.A with the USA query and vice versa
- Common words containing periods: sometimes the period can have an importance like for domain names (.com, .net, .org), titles (mr., dr.) or abbreviations (st.)
- Hyphenated words: the hyphen is not an easy case as it can be sometimes considered as a separator like in
"San francisco-based"but sometimes it is important like in
- Apostrophes: they can be used for clitic contractions (we’re ⇒ we are), as genitive markers (Carl’s haircut) or as quotative markers
- CamelCase: pretty common practice of writing compound words or phrases such that each word or abbreviation begins with a capital letter. You probably want to find your
"#AskGaryVee"tag via the
- Others: HTML entities, dates, time, measures, phone numbers, etc.
To know what to search for, the engine needs to know which words compose the query. But all of these exceptions make the process of detecting words pretty challenging. Not an easy problem to solve.
Why is searching for alternatives also complex?
Once the query is tokenized, you have a set of words that will be searched in your documents. We will now enrich the query by searching for alternative words: by expanding the set of words we’ll be able to find matches that are close to the query but not exactly identical.
This process is only performed on the query and not when we index the record. (In a future post, we’ll explain why doing so on indexing is a bad idea.).
If you do not apply this process, you’ll have two problems:
- 1) You will miss some results if there is a small difference between the words in your records and the query (for example a singular versus plural). You can also have typos or errors in your query that cause those records to be unmatched.
- 2) The words in the record can be different than the query and lead to a different segmentation. For example, a record that contains ‘hispeed’ will produce one token whereas the query ‘hi speed’ will produce two tokens. This difference is pretty big as we cannot simply search for records containing the query terms anymore.
That’s when it becomes complex. As you can see in this last example, searching for alternatives can potentially change the number of words in the query, which is always a bit of a mess to manage!
We can distinguish four different types of alternatives that can be added to expand the query:
- 1) An alternative word on a query word: your query contains
"egg"and your record contains
- 2) A multi-word alternative on a query word: your query contains
"NY"and your record contains
"New York"or if your query contains
"iphone"and your record contains
- 3) A word expression as alternative of several query words: your query contains
"New York"and you search for a record containing
"NY"or if your query contains
"i phone"and want to search for a record containing
- 4) A multi-words expression as alternative of several query words: this is the most complex one to handle, this is the case if you add a synonym set containing
['NY', 'New York', 'New York City']. In this case, if your query is
"hotel new york city", it will be extended in:
OR(AND("hotel","New York"), AND("hotel","NY"), AND("hotel","new","york"))
The first type is properly handled in any search engine as it is just an OR between several terms but very few engines handle natively the last three cases which often require a lot of custom code.
Algolia’s way of tokenizing
Our approach of tokenization was always to keep it as simple as possible. In the end, most challenges of tokenization can be solved via an alternative added at query time! That said, there are some cases which require a special processing at indexing time to be perfect:
- 1) Some abbreviation & apostrophes: If you have
"U.S.A"in your record, it is very unlikely that you want to have it match for the query
"a". Which means you just want to have the token
"usa"to avoid noise that could break the relevance. It is the same for the
"we're"expression, in this case you just want to have one token. There are of course exceptions, for example
"l'hotel"(French expression meaning
"the hotel") needs to be found as
"hotel"and we do not want a token “lhotel”. All those cases are handled automatically by our engine without having to configure anything!
- 2) Agglutinative languages: For languages such as Chinese, Japanese and Korean we use a word dictionary with frequencies to split a sequence of ideograms in several tokens. We also detect the change of alphabet as a new token (this is useful, for example, in Japanese to produce a new token when we move from Hiragana to Kanji). All this processing is done automatically and we improve this segmentation all the time based on our user feedback.
- 3) Symbols (&, +, _ …): This part is the only one that is configurable since it depends on the use case and cannot be packaged in a generic way. We have a configuration setting to let you specify the symbols which are important for you. For example we used this approach on a website that helps students. They have to index mathematica formulas and make the difference between
"2x - 1"and
"2x + 1". In this case we have indexed all mathematical symbols (addition, subtraction, multiplication, …)
- 4) Compounded words and camel case: A part of this processing has to be done at indexing if your record contains a compounded word like
"AskGaryVee"and you want to be able to retrieve this record via the
"Gary Vee"query. That’s the only step that we do not have today in production but which is on our roadmap. (You can expect to see this in the future!)
At first sight, the process of tokenization can be perceived as simple. But as you have seen it is pretty complex in practice and requires quite a lot of work. We constantly work on improving it.
We only address a few of the tokenization challenges described at the beginning of this article. We actually solve most of them via alternatives added after tokenization. The next section describes those alternatives in details.
Algolia’s way of searching for alternatives
When we launched the engine in 2013, we had only typo tolerance as an alternative. But since then, we have improved the ability of the engine to find your records even with complex mistakes.
Today the engine expands a query via six different types of alternatives:
1) Typo tolerance on words: after tokenization, we look for potential approximations of each query word in the index dictionary (it contains all words used in this index). In practice, we compute a Damerau-Levenshtein edit-distance between the query word and each word of the dictionary and accept the alternative if the number of typos is acceptable.
This Damerau-Levenshtein edit distance represents the number of letters inserted, deleted, substituted or transposed to change the query word to the dictionary word. For example the distance between
"mikcael" is 1 as there is one transposition between
"kc" (transposition corresponds to a common typing mistake, two letter are inverted in the word).
In practice, we simplify the process. By looking at the maximum distance that is acceptable for each word, we apply a recursive scan on the dictionary (represented as a radix tree). Then we prune the branches: optimizing the scan by re-evaluating on the fly what doesn’t need to be scanned.
By default we accept a maximum distance of 2 if the query word contains at least 8 letters and a maximum distance of 1 if the query word contains at least 4 letters (although those values are configurable).
Last but not least, the typo-tolerance function is able to correct a prefix, which is important in an instant search scenario. For example
"mikc" will be considered as one typo of
"mick" which is the prefix of
"mickael" (In this case the dictionary does not contains the word
"mick" but only the word
In our engine, typos are tolerated on words and prefix of words natively, without having to configure extra processes like ngram computation. This native approach has also the big advantage of being very efficient at indexing and producing an exact computation of relevance. You can learn more about the way we store our dictionary and prefixes in part 2 of this series.
2) Concatenation: Typo tolerance on a tokenized query is already a challenge for performance but this is unfortunately not enough: the query tokenization can be different than the one performed at indexing time. This is the case if the query contains
"i phone" and the record contains
"iphone" and you would like to have an alternative with the concatenation of query terms! We automatically compute a concatenation for all pairs of terms plus a general concatenation of all the query terms. Which means, for example, that the query
"i phone case" will generate the query:
OR(AND("i","phone","case"), AND("iphone","case"), AND("i","phonecase"), "iphonecase")
In order to avoid polluting the results by doing an approximation on an approximation, we do not apply typo tolerance on those new alternatives; they need to be identical in the dictionary. In this query the typo tolerance is only applied on each words of the
AND("i", "phone", "case") part.
3) Split of query tokens: Concatenation is a very good added value of the engine but it unfortunately cannot catch all the cases. The word can be concatenated in the query and split in the record, and in this case you have no other choice than splitting the query term in several words. You have this issue if the query contains the hash tag #searchengine and your record contains
"search engine". The Algolia engine will automatically try to split all your query terms in two words using the dictionary. In this case we know the frequency of the “searchengine” term and the frequency of all possible splits. We look at all the possible ways to split the words and select the one that has the best score for
min(freq(word1),freq(word2)). In this case we take the best score between:
min(freq(se), freq(archengine)), … ,
min(freq(search), freq(engine)), … ,
min(freq(searchengin), freq(e)). It is obvious that the maximum will be obtained for the split in
"engine" and we will extend the query term
"searchengine" with a phrase query between
"engine". (Phrase query means that the terms
"engine" need to be just after
"search" in the record to match.)
4) Transliteration: All of the approaches we have described before work well for alphabet-based languages but can’t be applied on ideogram-based languages such as Chinese, Japanese and Korean. For these languages we use an extension file of the unicode standard called
"Unihan_variants" that contains links between ideograms. For example those two lines contains a mapping between a simplified Chinese and traditional Chinese character:
U+673A kTraditionalVariant U+6A5F
U+6A5F kSimplifiedVariant U+673A
This unicode file also contains a Z-variant of the same ideogram (share the same etymology but have slightly different appearances), for example:
U+3588 kZVariant U+439B
U+439B kZVariant U+3588
The unicode format goes even further by containing ideogram with overlapping meanings, for example:
U+349A kSemanticVariant U+7A69<kFenn,kMatthews
U+349A kSpecializedSemanticVariant U+6587<kFenn
U+6587 kSpecializedSemanticVariant U+349A<kFenn U+7A69<kFenn
U+7A69 kSemanticVariant U+349A<kMatthews
The engine will consider all those transliteration as one or two typo for ideogram-based languages. You will be able to search for simplified Chinese using a traditional Chinese query and vice versa!
5) Lemmatisation: Typo tolerance is pretty efficient in tolerating most differences between singular and plural forms. There are of course exceptions like
"foot/feet". But the most important thing is that you probably don’t want to consider those small differences as a typo. This is why we have packaged a dictionary that contains singular/plurals form of words in 88 languages. This dictionary is called a lemmatizer and is exposed with the ignorePlural setting. You can simply add singular/plural alternatives without having to care about building such a resource.
6) Synonyms: The last type of alternative added in a query are synonyms. We have a pretty powerful synonym feature that handles all types of synonyms:
- Add an alternative on a query term (the alternative can be a single word or a multi-word expression)
- Add an alternative on a multi-term query expression. The alternative can be a multi-word expression and can contain any number of words
- Add a placeholder token in your text that matches a list of words. You can, for example, add <streetnumber> in your record and configure it to match any numeric expression from 1 to 2000. Of course the record will not be found on the
“<streetnumber> Downing Street”will be found for the query
“10 Downing Street”.
All those synonyms can be handled with a correct highlighting (
“iPhone” will be correctly highlighted for the query
"iphine", despite the typo) with a correct handling of proximity between expressions. This part is pretty complex and not correctly handled by most engines, we will explain how we handle it in a future post!
We do not package synonyms with the engine. The reason is that synonyms are tightly linked to a domain and most of the time you just need a few of them to increase a lot the quality of your search. For example, I saw the synonym
"tee" on several e-commerce website. This synonym makes sense as we see users typing
"tee" to search for t-shirt, but it can be dangerous if you have sports articles (
"tee ball" or
The main issue behind that is the polysemy of natural languages and the various specificities of each website. Our take is to handle automatically all the potential typos that few search engines handle automatically (like t-shirt = tshirt = t shirt) and use synonyms as a domain specific tuning you can perform to refine it when needed.
Building a good query processing takes time!
We have built all those query processing steps over the last four years and of course new ones will come! Our first users only had typo tolerance at the beginning but they have seen the benefits of each improvement! Better: it did not have any negative impact on their relevance due to our way of handling ranking. We also managed to bring all those improvements without sacrificing performance! Before each added alternative, we have implemented a new optimization in the engine to counterbalance the cost of our alternatives!
We recommend to read the other posts of this series: