Search by Algolia
Feature Spotlight: Query Rules
product

Feature Spotlight: Query Rules

You’re running an ecommerce site for an electronics retailer, and you’re seeing in your analytics that users keep ...

Jaden Baptista

Technical Writer

An introduction to transformer models in neural networks and machine learning
ai

An introduction to transformer models in neural networks and machine learning

What do OpenAI and DeepMind have in common? Give up? These innovative organizations both utilize technology known as transformer models ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

What’s the secret of online merchandise management? Giving store merchandisers the right tools
e-commerce

What’s the secret of online merchandise management? Giving store merchandisers the right tools

As a successful in-store boutique manager in 1994, you might have had your merchandisers adorn your street-facing storefront ...

Catherine Dee

Search and Discovery writer

New features and capabilities in Algolia InstantSearch
engineering

New features and capabilities in Algolia InstantSearch

At Algolia, our business is more than search and discovery, it’s the continuous improvement of site search. If you ...

Haroen Viaene

JavaScript Library Developer

Feature Spotlight: Analytics
product

Feature Spotlight: Analytics

Analytics brings math and data into the otherwise very subjective world of ecommerce. It helps companies quantify how well their ...

Jaden Baptista

Technical Writer

What is clustering?
ai

What is clustering?

Amid all the momentous developments in the generative AI data space, are you a data scientist struggling to make sense ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

What is a vector database?
product

What is a vector database?

Fashion ideas for guest aunt informal summer wedding Funny movie to get my bored high-schoolers off their addictive gaming ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

Unlock the power of image-based recommendation with Algolia’s LookingSimilar
engineering

Unlock the power of image-based recommendation with Algolia’s LookingSimilar

Imagine you're visiting an online art gallery and a specific painting catches your eye. You'd like to find ...

Raed Chammam

Senior Software Engineer

Empowering Change: Algolia's Global Giving Days Impact Report
algolia

Empowering Change: Algolia's Global Giving Days Impact Report

At Algolia, our commitment to making a positive impact extends far beyond the digital landscape. We believe in the power ...

Amy Ciba

Senior Manager, People Success

Retail personalization: Give your ecommerce customers the tailored shopping experiences they expect and deserve
e-commerce

Retail personalization: Give your ecommerce customers the tailored shopping experiences they expect and deserve

In today’s post-pandemic-yet-still-super-competitive retail landscape, gaining, keeping, and converting ecommerce customers is no easy ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

Algolia x eTail | A busy few days in Boston
algolia

Algolia x eTail | A busy few days in Boston

There are few atmospheres as unique as that of a conference exhibit hall: the air always filled with an indescribable ...

Marissa Wharton

Marketing Content Manager

What are vectors and how do they apply to machine learning?
ai

What are vectors and how do they apply to machine learning?

To consider the question of what vectors are, it helps to be a mathematician, or at least someone who’s ...

Catherine Dee

Search and Discovery writer

Why imports are important in JS
engineering

Why imports are important in JS

My first foray into programming was writing Python on a Raspberry Pi to flicker some LED lights — it wasn’t ...

Jaden Baptista

Technical Writer

What is ecommerce? The complete guide
e-commerce

What is ecommerce? The complete guide

How well do you know the world of modern ecommerce?  With retail ecommerce sales having exceeded $5.7 trillion worldwide ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

Data is king: The role of data capture and integrity in embracing AI
ai

Data is king: The role of data capture and integrity in embracing AI

In a world of artificial intelligence (AI), data serves as the foundation for machine learning (ML) models to identify trends ...

Alexandra Anghel

Director of AI Engineering

What are data privacy and data security? Why are they  critical for an organization?
product

What are data privacy and data security? Why are they critical for an organization?

Imagine you’re a leading healthcare provider that performs extensive data collection as part of your patient management. You’re ...

Catherine Dee

Search and Discovery writer

Achieving digital excellence: Algolia's insights from the GDS Retail Digital Summit
e-commerce

Achieving digital excellence: Algolia's insights from the GDS Retail Digital Summit

In an era where customer experience reigns supreme, achieving digital excellence is a worthy goal for retail leaders. But what ...

Marissa Wharton

Marketing Content Manager

AI at scale: Managing ML models over time & across use cases
ai

AI at scale: Managing ML models over time & across use cases

Just a few years ago it would have required considerable resources to build a new AI service from scratch. Of ...

Benoit Perrot

VP, Engineering

Looking for something?

facebookfacebooklinkedinlinkedintwittertwittermailmail

Today most search engines on the market come with a large set of features that developers can use to create their query processing pipeline, but this task is far more difficult than it seems and most people never manage to achieve a good result. In this post, we will cover the classical approaches and how we are handling it in the Algolia engine.

Introduction to query processing

To be able to find results, a search engine has to be able to understand deeply what was asked. That’s the role of the query processing: to analyse the query, and eventually transform it to make it easier to process by the search engine.

And to do so, a search engine will process the query in two big steps, both of which can be very challenging.

  1. Detect words in the query: this process is called tokenization and identifies what is a word by answering questions like: “is t-shirt one word or two?”
  2. Search for alternatives on this set of words: the goal of this step is to be less rigid by adding alternative corrections or synonyms, to avoid missing results that don’t contain exactly what was searched, but something equivalent. It decreases the need to reformulate the query to find what you want.
  3. Why is tokenization complex?

    The process of detecting words involves a set of rules to determine what is a word. It is applied both on the query and when indexing the words, to ensure both have the same format of words, which makes them easier to match.

    This process seems simple at first – at least for alphabet-based languages – but as often, complexity creeps in when you consider the edge cases. And you soon realize that it’s way harder than it seemed, with no easy solutions.

    To give you an idea of the complexity of tokenization, here is a non-exhaustive list of tokenization issues:

    • Symbols: You cannot simply ignore symbols. C, C++ and C# are different words. And there are expressions that are composed of symbols only like the “!!!” band!
    • Compound words: some languages like Dutch and German can aggregate words to form one expression. For example in Dutch, the word "arbeidsongeschiktheidsverzekering" means "disability insurance" and is composed of 3 words: arbeid (labour) + ongeschiktheid (inaptitude) + verzekering (insurance). Of course you would like to have a document containing this long expression match on the "verzekering" (insurance) query
    • Agglutination: some languages like Japanese, Chinese and Korean don’t  have separators between words (i.e. spaces). Those languages require a statistical approach that helps to detect words in the sequence of ideograms, the search engine also needs to be reliable to an error of word recognition. As with compound words, you want to be able to search for a document with just one of the words.
    • Acronyms with Punctuation: you want to find a record containing U.S.A with the USA query and vice versa
    • Common words containing periods: sometimes the period can have an importance like for domain names (.com, .net, .org), titles (mr., dr.) or abbreviations (st.)
    • Hyphenated words: the hyphen is not an easy case as it can be sometimes considered as a separator like in "forty-two" or "San francisco-based" but sometimes it is important like in "pre-processing"
    • Apostrophes: they can be used for clitic contractions (we’re ⇒ we are), as genitive markers (Carl’s haircut) or as quotative markers
    • CamelCase: pretty common practice of writing compound words or phrases such that each word or abbreviation begins with a capital letter. You probably want to find your "#AskGaryVee" tag via the "Gary Vee" query
    • Others: HTML entities, dates, time, measures, phone numbers, etc.

    To know what to search for, the engine needs to know which words compose the query. But all of these exceptions make the process of detecting words pretty challenging. Not an easy problem to solve.

Why is searching for alternatives also complex?

Once the query is tokenized, you have a set of words that will be searched in your documents. We will now enrich the query by searching for alternative words: by expanding the set of words we’ll be able to find matches that are close to the query but not exactly identical.

This process is only performed on the query and not when we index the record. (In a future post, we’ll explain why doing so on indexing is a bad idea.).

If you do not apply this process, you’ll have two problems:

  1. You will miss some results if there is a small difference between the words in your records and the query (for example a singular versus plural). You can also have typos or errors in your query that cause those records to be unmatched.
  2. The words in the record can be different than the query and lead to a different segmentation. For example, a record that contains ‘hispeed’ will produce one token whereas the query ‘hi speed’ will produce two tokens. This difference is pretty big as we cannot simply search for records containing the query terms anymore.

That’s when it becomes complex. As you can see in this last example, searching for alternatives can potentially change the number of words in the query, which is always a bit of a mess to manage!

We can distinguish four different types of alternatives that can be added to expand the query:

  1. An alternative word on a query word: your query contains "egg" and your record contains "eggs".
  2. A multi-word alternative on a query word: your query contains "NY" and your record contains "New York" or if your query contains "iphone" and your record contains "i phone".
  3. A word expression as alternative of several query words: your query contains "New York" and you search for a record containing "NY" or if your query contains "i phone" and want to search for a record containing "iphone"
  4. A multi-words expression as alternative of several query words: this is the most complex one to handle, this is the case if you add a synonym set containing ['NY', 'New York', 'New York City']. In this case, if your query is "hotel new york city", it will be extended in:
    OR(AND("hotel","New York"),
       AND("hotel","NY"),
       AND("hotel","new","york"))
    

The first type is properly handled in any search engine as it is just an OR between several terms but very few engines handle natively the last three cases which often require a lot of custom code.

Algolia’s way of tokenizing

Our approach of tokenization was always to keep it as simple as possible. In the end, most challenges of tokenization can be solved via an alternative added at query time! That said, there are some cases which require a special processing at indexing time to be perfect:

  1. Some abbreviation & apostrophes: If you have "U.S.A" in your record, it is very unlikely that you want to have it match for the query "a". Which means you just want to have the token "usa" to avoid noise that could break the relevance. It is the same for the "we're" expression, in this case you just want to have one token. There are of course exceptions, for example "l'hotel" (French expression meaning "the hotel") needs to be found as "hotel" and we do not want a token “lhotel”. All those cases are handled automatically by our engine without having to configure anything!
  2. Agglutinative languages: For languages such as Chinese, Japanese and Korean we use a word dictionary with frequencies to split a sequence of ideograms in several tokens. We also detect the change of alphabet as a new token (this is useful, for example, in Japanese to produce a new token when we move from Hiragana to Kanji). All this processing is done automatically and we improve this segmentation all the time based on our user feedback.
  3. Symbols (&, +, _ …): This part is the only one that is configurable since it depends on the use case and cannot be packaged in a generic way. We have a configuration setting to let you specify the symbols which are important for you. For example we used this approach on a website that helps students. They have to index mathematica formulas and make the difference between "2x - 1" and "2x + 1". In this case we have indexed all mathematical symbols (addition, subtraction, multiplication, …)
  4. Compounded words and camel case: A part of this processing has to be done at indexing if your record contains a compounded word like "AskGaryVee" and you want to be able to retrieve this record via the "Gary Vee" query. That’s the only step that we do not have today in production but which is on our roadmap. (You can expect to see this in the future!)

At first sight, the process of tokenization can be perceived as simple. But as you have seen it is pretty complex in practice and requires quite a lot of work. We constantly work on improving it.

We only address a few of the tokenization challenges described at the beginning of this article. We actually solve most of them via alternatives added after tokenization. The next section describes those alternatives in details.

Algolia’s way of searching for alternatives

When we launched the engine in 2013, we had only typo tolerance as an alternative. But since then, we have improved the ability of the engine to find your records even with complex mistakes.

Today the engine expands a query via six different types of alternatives:

1) Typo tolerance on words: after tokenization, we look for potential approximations of each query word in the index dictionary (it contains all words used in this index). In practice, we compute a Damerau-Levenshtein edit-distance between the query word and each word of the dictionary and accept the alternative if the number of typos is acceptable.

This Damerau-Levenshtein edit distance represents the number of letters inserted, deleted, substituted or transposed to change the query word to the dictionary word. For example the distance between "mickael" and "mikcael" is 1 as there is one transposition between "ck" and "kc" (transposition corresponds to a common typing mistake, two letter are inverted in the word).

In practice, we simplify the process. By looking at the maximum distance that is acceptable for each word, we apply a recursive scan on the dictionary (represented as a radix tree). Then we prune the branches: optimizing the scan by re-evaluating on the fly what doesn’t need to be scanned.

By default we accept a maximum distance of 2 if the query word contains at least 8 letters and a maximum distance of 1 if the query word contains at least 4 letters (although those values are configurable).

Last but not least, the typo-tolerance function is able to correct a prefix, which is important in an instant search scenario. For example "mikc" will be considered as one typo of "mick" which is the prefix of "mickael" (In this case the dictionary does not contains the word "mick" but only the word "mickael")

In our engine, typos are tolerated on words and prefix of words natively, without having to configure extra processes like ngram computation. This native approach has also the big advantage of being very efficient at indexing and producing an exact computation of relevance. You can learn more about the way we store our dictionary and prefixes in part 2 of this series.

2) Concatenation: Typo tolerance on a tokenized query is already a challenge for performance but this is unfortunately not enough: the query tokenization can be different than the one performed at indexing time. This is the case if the query contains "i phone" and the record contains "iphone" and you would like to have an alternative with the concatenation of query terms! We automatically compute a concatenation for all pairs of terms plus a general concatenation of all the query terms. Which means, for example, that the query "i phone case" will generate the query:

OR(AND("i","phone","case"),
   AND("iphone","case"),
   AND("i","phonecase"),
   "iphonecase")

In order to avoid polluting the results by doing an approximation on an approximation, we do not apply typo tolerance on those new alternatives; they need to be identical in the dictionary. In this query the typo tolerance is only applied on each words of the AND("i", "phone", "case") part.

3) Split of query tokens: Concatenation is a very good added value of the engine but it unfortunately cannot catch all the cases. The word can be concatenated in the query and split in the record, and in this case you have no other choice than splitting the query term in several words. You have this issue if the query contains the hash tag #searchengine and your record contains "search engine". The Algolia engine will automatically try to split all your query terms in two words using the dictionary. In this case we know the frequency of the “searchengine” term and the frequency of all possible splits. We look at all the possible ways to split the words and select the one that has the best score for min(freq(word1),freq(word2)). In this case we take the best score between: min(freq(s), freq(earchengine)), min(freq(se), freq(archengine)),  … , min(freq(search), freq(engine)), … , min(freq(searchengin), freq(e)). It is obvious that the maximum will be obtained for the split in "search" + "engine" and we will extend the query term "searchengine" with a phrase query between "search" and "engine". (Phrase query means that the terms "engine" need to be just after "search" in the record to match.)

4) Transliteration: All of the approaches we have described before work well for alphabet-based languages but can’t be applied on ideogram-based languages such as Chinese, Japanese and Korean. For these languages we use an extension file of the unicode standard called "Unihan_variants" that contains links between ideograms. For example those two lines contains a mapping between a simplified Chinese and traditional Chinese character:

U+673A  kTraditionalVariant     U+6A5F

U+6A5F  kSimplifiedVariant      U+673A

This unicode file also contains a Z-variant of the same ideogram (share the same etymology but have slightly different appearances), for example:

U+3588 kZVariant U+439B

U+439B kZVariant U+3588

The unicode format goes even further by containing ideogram with overlapping meanings, for example:

U+349A kSemanticVariant U+7A69<kFenn,kMatthews

U+349A kSpecializedSemanticVariant U+6587<kFenn

U+6587 kSpecializedSemanticVariant U+349A<kFenn U+7A69<kFenn

U+7A69 kSemanticVariant U+349A<kMatthews

The engine will consider all those transliteration as one or two typo for ideogram-based languages. You will be able to search for simplified Chinese using a traditional Chinese query and vice versa!

5) Lemmatisation: Typo tolerance is pretty efficient in tolerating most differences between singular and plural forms. There are of course exceptions like "city/cities" or "foot/feet". But the most important thing is that you probably don’t want to consider those small differences as a typo. This is why we have packaged a dictionary that contains singular/plurals form of words in 88 languages. This dictionary is called a lemmatizer and is exposed with the ignorePlural setting. You can simply add singular/plural alternatives without having to care about building such a resource.

6) Synonyms: The last type of alternative added in a query are synonyms. We have a pretty powerful synonym feature that handles all types of synonyms:

  • Add an alternative on a query term (the alternative can be a single word or a multi-word expression)
  • Add an alternative on a multi-term query expression. The alternative can be a multi-word expression and can contain any number of words
  • Add a placeholder token in your text that matches a list of words. You can, for example, add <streetnumber> in your record and configure it to match any numeric expression from 1 to 2000. Of course the record will not be found on the "streetnumber" query, but “<streetnumber> Downing Street” will be found for the query “10 Downing Street”.

All those synonyms can be handled with a correct highlighting (“iPhone” will be correctly highlighted for the query "iphine", despite the typo) with a correct handling of proximity between expressions. This part is pretty complex and not correctly handled by most engines, we will explain how we handle it in a future post!

We do not package synonyms with the engine. The reason is that synonyms are tightly linked to a domain and most of the time you just need a few of them to increase a lot the quality of your search. For example, I saw the synonym "t-shirt" = "tee" on several e-commerce website. This synonym makes sense as we see users typing "tee" to search for t-shirt, but it can be dangerous if you have sports articles ("tee ball" or "gold tee").

The main issue behind that is the polysemy of natural languages and the various specificities of each website. Our take is to handle automatically all the potential typos that few search engines handle automatically (like t-shirt = tshirt = t shirt) and use synonyms as a domain specific tuning you can perform to refine it when needed.

Building a good query processing takes time!

We have built all those query processing steps over the last four years and of course new ones will come! Our first users only had typo tolerance at the beginning but they have seen the benefits of each improvement! Better: it did not have any negative impact on their relevance due to our way of handling ranking. We also managed to bring all those improvements without sacrificing performance! Before each added alternative, we have implemented a new optimization in the engine to counterbalance the cost of our alternatives!

We recommend to read the other posts of this series:

About the author
Julien Lemoine

Co-founder & former CTO at Algolia

githublinkedintwitter

14-day free trial

Create a full-featured search experience in no time.

Get started
14-day free trial

Recommended Articles

Powered byAlgolia Algolia Recommend

Handling Natural Languages in Search
engineering

Léo Ercolanelli

Software Engineer

Algolia's top 10 tips to achieve highly relevant search results
product

Julien Lemoine

Co-founder & former CTO at Algolia

Inside the Algolia Engine Part 5 – Highlighting, a Cornerstone of Search UX
engineering

Julien Lemoine

Co-founder & former CTO at Algolia