14-day free trial
Create a full-featured search experience in no time.
Sorry, there is no results for this query
Today most search engines on the market come with a large set of features that developers can use to create their query processing pipeline, but this task is far more difficult than it seems and most people never manage to achieve a good result. In this post, we will cover the classical approaches and how we are handling it in the Algolia engine.
To be able to find results, a search engine has to be able to understand deeply what was asked. That’s the role of the query processing: to analyse the query, and eventually transform it to make it easier to process by the search engine.
And to do so, a search engine will process the query in two big steps, both of which can be very challenging.
“is t-shirt one word or two?”
The process of detecting words involves a set of rules to determine what is a word. It is applied both on the query and when indexing the words, to ensure both have the same format of words, which makes them easier to match.
This process seems simple at first – at least for alphabet-based languages – but as often, complexity creeps in when you consider the edge cases. And you soon realize that it’s way harder than it seemed, with no easy solutions.
To give you an idea of the complexity of tokenization, here is a non-exhaustive list of tokenization issues:
"disability insurance"and is composed of 3 words: arbeid (labour) + ongeschiktheid (inaptitude) + verzekering (insurance). Of course you would like to have a document containing this long expression match on the
"San francisco-based"but sometimes it is important like in
"#AskGaryVee"tag via the
To know what to search for, the engine needs to know which words compose the query. But all of these exceptions make the process of detecting words pretty challenging. Not an easy problem to solve.
Once the query is tokenized, you have a set of words that will be searched in your documents. We will now enrich the query by searching for alternative words: by expanding the set of words we’ll be able to find matches that are close to the query but not exactly identical.
This process is only performed on the query and not when we index the record. (In a future post, we’ll explain why doing so on indexing is a bad idea.).
If you do not apply this process, you’ll have two problems:
That’s when it becomes complex. As you can see in this last example, searching for alternatives can potentially change the number of words in the query, which is always a bit of a mess to manage!
We can distinguish four different types of alternatives that can be added to expand the query:
"egg"and your record contains
"NY"and your record contains
"New York"or if your query contains
"iphone"and your record contains
"New York"and you search for a record containing
"NY"or if your query contains
"i phone"and want to search for a record containing
['NY', 'New York', 'New York City']. In this case, if your query is
"hotel new york city", it will be extended in:
OR(AND("hotel","New York"), AND("hotel","NY"), AND("hotel","new","york"))
The first type is properly handled in any search engine as it is just an OR between several terms but very few engines handle natively the last three cases which often require a lot of custom code.
Our approach of tokenization was always to keep it as simple as possible. In the end, most challenges of tokenization can be solved via an alternative added at query time! That said, there are some cases which require a special processing at indexing time to be perfect:
"U.S.A"in your record, it is very unlikely that you want to have it match for the query
"a". Which means you just want to have the token
"usa"to avoid noise that could break the relevance. It is the same for the
"we're"expression, in this case you just want to have one token. There are of course exceptions, for example
"l'hotel"(French expression meaning
"the hotel") needs to be found as
"hotel"and we do not want a token “lhotel”. All those cases are handled automatically by our engine without having to configure anything!
"2x - 1"and
"2x + 1". In this case we have indexed all mathematical symbols (addition, subtraction, multiplication, …)
"AskGaryVee"and you want to be able to retrieve this record via the
"Gary Vee"query. That’s the only step that we do not have today in production but which is on our roadmap. (You can expect to see this in the future!)
At first sight, the process of tokenization can be perceived as simple. But as you have seen it is pretty complex in practice and requires quite a lot of work. We constantly work on improving it.
We only address a few of the tokenization challenges described at the beginning of this article. We actually solve most of them via alternatives added after tokenization. The next section describes those alternatives in details.
When we launched the engine in 2013, we had only typo tolerance as an alternative. But since then, we have improved the ability of the engine to find your records even with complex mistakes.
Today the engine expands a query via six different types of alternatives:
1) Typo tolerance on words: after tokenization, we look for potential approximations of each query word in the index dictionary (it contains all words used in this index). In practice, we compute a Damerau-Levenshtein edit-distance between the query word and each word of the dictionary and accept the alternative if the number of typos is acceptable.
This Damerau-Levenshtein edit distance represents the number of letters inserted, deleted, substituted or transposed to change the query word to the dictionary word. For example the distance between
"mikcael" is 1 as there is one transposition between
"kc" (transposition corresponds to a common typing mistake, two letter are inverted in the word).
In practice, we simplify the process. By looking at the maximum distance that is acceptable for each word, we apply a recursive scan on the dictionary (represented as a radix tree). Then we prune the branches: optimizing the scan by re-evaluating on the fly what doesn’t need to be scanned.
By default we accept a maximum distance of 2 if the query word contains at least 8 letters and a maximum distance of 1 if the query word contains at least 4 letters (although those values are configurable).
Last but not least, the typo-tolerance function is able to correct a prefix, which is important in an instant search scenario. For example
"mikc" will be considered as one typo of
"mick" which is the prefix of
"mickael" (In this case the dictionary does not contains the word
"mick" but only the word
In our engine, typos are tolerated on words and prefix of words natively, without having to configure extra processes like ngram computation. This native approach has also the big advantage of being very efficient at indexing and producing an exact computation of relevance. You can learn more about the way we store our dictionary and prefixes in part 2 of this series.
2) Concatenation: Typo tolerance on a tokenized query is already a challenge for performance but this is unfortunately not enough: the query tokenization can be different than the one performed at indexing time. This is the case if the query contains
"i phone" and the record contains
"iphone" and you would like to have an alternative with the concatenation of query terms! We automatically compute a concatenation for all pairs of terms plus a general concatenation of all the query terms. Which means, for example, that the query
"i phone case" will generate the query:
OR(AND("i","phone","case"), AND("iphone","case"), AND("i","phonecase"), "iphonecase")
In order to avoid polluting the results by doing an approximation on an approximation, we do not apply typo tolerance on those new alternatives; they need to be identical in the dictionary. In this query the typo tolerance is only applied on each words of the
AND("i", "phone", "case") part.
3) Split of query tokens: Concatenation is a very good added value of the engine but it unfortunately cannot catch all the cases. The word can be concatenated in the query and split in the record, and in this case you have no other choice than splitting the query term in several words. You have this issue if the query contains the hash tag #searchengine and your record contains
"search engine". The Algolia engine will automatically try to split all your query terms in two words using the dictionary. In this case we know the frequency of the “searchengine” term and the frequency of all possible splits. We look at all the possible ways to split the words and select the one that has the best score for
min(freq(word1),freq(word2)). In this case we take the best score between:
min(freq(se), freq(archengine)), … ,
min(freq(search), freq(engine)), … ,
min(freq(searchengin), freq(e)). It is obvious that the maximum will be obtained for the split in
"engine" and we will extend the query term
"searchengine" with a phrase query between
"engine". (Phrase query means that the terms
"engine" need to be just after
"search" in the record to match.)
4) Transliteration: All of the approaches we have described before work well for alphabet-based languages but can’t be applied on ideogram-based languages such as Chinese, Japanese and Korean. For these languages we use an extension file of the unicode standard called
"Unihan_variants" that contains links between ideograms. For example those two lines contains a mapping between a simplified Chinese and traditional Chinese character:
U+673A kTraditionalVariant U+6A5F
U+6A5F kSimplifiedVariant U+673A
This unicode file also contains a Z-variant of the same ideogram (share the same etymology but have slightly different appearances), for example:
U+3588 kZVariant U+439B
U+439B kZVariant U+3588
The unicode format goes even further by containing ideogram with overlapping meanings, for example:
U+349A kSemanticVariant U+7A69<kFenn,kMatthews
U+349A kSpecializedSemanticVariant U+6587<kFenn
U+6587 kSpecializedSemanticVariant U+349A<kFenn U+7A69<kFenn
U+7A69 kSemanticVariant U+349A<kMatthews
The engine will consider all those transliteration as one or two typo for ideogram-based languages. You will be able to search for simplified Chinese using a traditional Chinese query and vice versa!
5) Lemmatisation: Typo tolerance is pretty efficient in tolerating most differences between singular and plural forms. There are of course exceptions like
"foot/feet". But the most important thing is that you probably don’t want to consider those small differences as a typo. This is why we have packaged a dictionary that contains singular/plurals form of words in 88 languages. This dictionary is called a lemmatizer and is exposed with the ignorePlural setting. You can simply add singular/plural alternatives without having to care about building such a resource.
6) Synonyms: The last type of alternative added in a query are synonyms. We have a pretty powerful synonym feature that handles all types of synonyms:
“<streetnumber> Downing Street”will be found for the query
“10 Downing Street”.
All those synonyms can be handled with a correct highlighting (
“iPhone” will be correctly highlighted for the query
"iphine", despite the typo) with a correct handling of proximity between expressions. This part is pretty complex and not correctly handled by most engines, we will explain how we handle it in a future post!
We do not package synonyms with the engine. The reason is that synonyms are tightly linked to a domain and most of the time you just need a few of them to increase a lot the quality of your search. For example, I saw the synonym
"tee" on several e-commerce website. This synonym makes sense as we see users typing
"tee" to search for t-shirt, but it can be dangerous if you have sports articles (
"tee ball" or
The main issue behind that is the polysemy of natural languages and the various specificities of each website. Our take is to handle automatically all the potential typos that few search engines handle automatically (like t-shirt = tshirt = t shirt) and use synonyms as a domain specific tuning you can perform to refine it when needed.
We have built all those query processing steps over the last four years and of course new ones will come! Our first users only had typo tolerance at the beginning but they have seen the benefits of each improvement! Better: it did not have any negative impact on their relevance due to our way of handling ranking. We also managed to bring all those improvements without sacrificing performance! Before each added alternative, we have implemented a new optimization in the engine to counterbalance the cost of our alternatives!
We recommend to read the other posts of this series: