Search by Algolia
What is end-to-end AI search?
ai

What is end-to-end AI search?

Simplicity is critical for search engines, but building one that enables that simplicity is complex.  Over the last 20+ years ...

Abhijit Mehta

Director of Product Management

Comparing AI search solutions in a crowded market landscape
ai

Comparing AI search solutions in a crowded market landscape

Many new AI-powered search solutions have been released this year, and each promises to provide great results, but as ...

Andy Jones

Marketing Campaign Production Manager

What is B2B ecommerce? Everything you need to know
e-commerce

What is B2B ecommerce? Everything you need to know

When you think of “customer experience,” what comes to mind? People, right? Specifically, consumers. Retail customers. That’s easy; the ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

What is ecommerce merchandising? Key components and best practices
e-commerce

What is ecommerce merchandising? Key components and best practices

A potential customer is about to land on the home page of your ecommerce platform, curious to see what cool ...

Catherine Dee

Search and Discovery writer

AI-powered search: From keywords to conversations
ai

AI-powered search: From keywords to conversations

By now, everyone’s had the opportunity to experiment with AI tools like ChatGPT or Midjourney and ponder their inner ...

Chris Stevenson

Director, Product Marketing

Vector vs Keyword Search: Why You Should Care
ai

Vector vs Keyword Search: Why You Should Care

Search has been around for a while, to the point that it is now considered a standard requirement in many ...

Nicolas Fiorini

Senior Machine Learning Engineer

What is AI-powered site search?
ai

What is AI-powered site search?

With the advent of artificial intelligence (AI) technologies enabling services such as Alexa, Google search, and self-driving cars, the ...

John Stewart

VP Corporate Marketing

What is a B2B marketplace?
e-commerce

What is a B2B marketplace?

It’s no secret that B2B (business-to-business) transactions have largely migrated online. According to Gartner, by 2025, 80 ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

3 strategies for B2B ecommerce growth: key takeaways from B2B Online - Chicago
e-commerce

3 strategies for B2B ecommerce growth: key takeaways from B2B Online - Chicago

Twice a year, B2B Online brings together industry leaders to discuss the trends affecting the B2B ecommerce industry. At the ...

Elena Moravec

Director of Product Marketing & Strategy

Deconstructing smart digital merchandising
e-commerce

Deconstructing smart digital merchandising

This is Part 2 of a series that dives into the transformational journey made by digital merchandising to drive positive ...

Benoit Reulier
Reshma Iyer

Benoit Reulier &

Reshma Iyer

The death of traditional shopping: How AI-powered conversational commerce changes everything
ai

The death of traditional shopping: How AI-powered conversational commerce changes everything

Get ready for the ride: online shopping is about to be completely upended by AI. Over the past few years ...

Aayush Iyer

Director, User Experience & UI Platform

What is B2C ecommerce? Models, examples, and definitions
e-commerce

What is B2C ecommerce? Models, examples, and definitions

Remember life before online shopping? When you had to actually leave the house for a brick-and-mortar store to ...

Catherine Dee

Search and Discovery writer

What are marketplace platforms and software? Why are they important?
e-commerce

What are marketplace platforms and software? Why are they important?

If you imagine pushing a virtual shopping cart down the aisles of an online store, or browsing items in an ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

What is an online marketplace?
e-commerce

What is an online marketplace?

Remember the world before the convenience of online commerce? Before the pandemic, before the proliferation of ecommerce sites, when the ...

Catherine Dee

Search and Discovery writer

10 ways AI is transforming ecommerce
e-commerce

10 ways AI is transforming ecommerce

Artificial intelligence (AI) is no longer just the stuff of scary futuristic movies; it’s recently burst into the headlines ...

Catherine Dee

Search and Discovery writer

AI as a Service (AIaaS) in the era of "buy not build"
ai

AI as a Service (AIaaS) in the era of "buy not build"

Imagine you are the CTO of a company that has just undergone a massive decade long digital transformation. You’ve ...

Sean Mullaney

CTO @Algolia

By the numbers: the ROI of keyword and AI site search for digital commerce
product

By the numbers: the ROI of keyword and AI site search for digital commerce

Did you know that the tiny search bar at the top of many ecommerce sites can offer an outsized return ...

Jon Silvers

Director, Digital Marketing

Using pre-trained AI algorithms to solve the cold start problem
ai

Using pre-trained AI algorithms to solve the cold start problem

Artificial intelligence (AI) has quickly moved from hot topic to everyday life. Now, ecommerce businesses are beginning to clearly see ...

Etienne Martin

VP of Product

Looking for something?

facebookfacebooklinkedinlinkedintwittertwittermailmail

In any search engine, adding synonyms is one of the most important customization to introduce domain-specific knowledge. Synonyms make it possible to tune search results without having to make major changes to records.

Supporting synonyms is no easy task especially for multi-word expressions, which introduce complexity in the handling of the proximity measure between matched words. We will describe in this post how we handle this complexity via a rewriting of word positions at query time.

Different types of synonyms

The most common way to define synonyms is to use a set of expressions that are all identical. We call such a set an N-way Synonym because any expression of the synonym set is a synonym of another expression.

For example, if New York is unambiguously a City in your use case, you probably want to have a synonym set containing “NY” = “NYC” = “New York” = “New York City”. This synonym set is configured as N-way, which means that:

  • “NY” in the query also matches “NYC” or “New York” or “New York City”
  • “NYC” in the query also matches “NY” or “New York” or “New York City”
  • “New York” in the query also matches “NY” or “NYC” or “New York” or “New York City”
  • “New York City” in the query also matches “NY” or “NYC” or “New York”

Less common but still used are asymmetrical synonyms, in which you want to apply an equivalency only in one direction – we call this a 1-way Synonym. For example, “smartphone” = [“iphone”, “android”] means that you want the “smartphone” query to be expended in “iphone” and “android” because you can have a record containing iphone without having the word smartphone. That said, you do not want the query “iphone” to returns all records containing the word “smartphone” – that wouldn’t be relevant for the user.

The Algolia engine supports both 1-way and N-way synonyms, and both types can contain any type of expression (one word or multiple words like “New York City”).

The dangers of generic synonym sets

We are often asked by users when we will include a standard synonym dictionary in the engine. The question is understandable – after all, we are taught in school that most words have synonyms, so it seems easy to package all this knowledge into one generic resource (per language). Unfortunately, it’s not that simple.

There are varying degrees to which two words or phrases are synonymous. For example, “substitute” is a stronger synonym to “alternative” than “choice” even if you can find both in a standard English synonym dictionary. Adding “substitute” as a synonym of “alternative” can be a very good idea for a website that compares technologies or offers; that being said, it can easily add noise on other use cases, especially if the name of a technology contains “substitute” in the name like the Substitute or jquery-substitute open source projects.

Polysemy is present in all Natural languages, which means that a word can have different meanings depending on the context. For example, the word “crane” can be a bird or a construction equipment and the context is required to select the correct meaning. Our customers are all in a specific semantic context and the dictionary need to be specific to give the best results. If you are on a tech website you don’t want to have a generic synonym of the word “push” because there is a strong meaning in tech and using synonyms for the verb or the noun “push” would lead to weird results.

The most useful synonyms are always very specific to your domain. This is the case of the synonym set [“NY” = “NYC” = “New York” = “New York City”] we mentioned before. This synonym set is valid if you only have cities in your use data set, it could introduce a bad relevance if you also have States!

Even specific synonyms can have a different meaning in another use case. For example, I saw the synonym T-shirt = Tee on several ecommerce websites. This synonym makes sense as a lot of users are typing “tee” to search for a t-shirt. That said, this synonym is dangerous if you have a sports website as you can have a query for “tee ball” or “golf tee” and won’t like to return t-shirt!

With all those constraints in mind, it is close to impossible to package a set of synonyms that would make sense for all use cases as it is too much use case dependent. In practice it would lead to a bad relevance and results would be hard to understand for your users.

This is why we do not plan to package a generic synonym dictionary in the engine. That said, there are other resources that we already package in a generic way like the singular/plural forms of a language that we expose in the ignorePlurals feature. We plan to continue introducing such resource in the future that provides a lot of value without having the drawback described before.

Computing synonym at indexing or query

One of the main question with synonyms is to make the choice between pre-computation at indexing time or expansion at search time. Both approaches have advantage and drawbacks:

With pre-computation at Indexing time, if you have a synonym “NY” = “NYC”, you will index the document with both words when one of the two words is found in a record. This method is the best for performance as there is no OR between the two synonym words at search time. The biggest issue with this approach is to introduce the support of multi-words synonyms, which require introducing a new knowledge in the inverted list.

For example, if you have the text “New York Subway”, the word “New” will be indexed as position 0, the word “York” will be indexed at position 1 and the word “Subway” will be indexed at position 2. If you have a synonym “New York” = “NY”, you will have to introduce in your index the token “NY” at position 0 with a padding of 2 positions so that the query “NY subway” will have correct positions. This additional information has a big impact on the index size and performance that usually nullify a lot the gain of indexing in most use cases.

Conversely, with computation at search time, you will search for all synonyms at query time with an OR operand between them. The big advantage of this approach is that you can compute the exact position of each matching synonym with a post-processing task (see next section for details). The disadvantage is that it requires more CPU as we have to compute the OR between several inverted lists.

Lastly, A hybrid approach is a mix of both approaches where all synonyms are searched at query time but some expensive computations are precomputed in one inverted list. Most of the time the expensive computation that is indexed is the search of consecutive words (phrase query). For example, the “New York” synonym would be indexed as one inverted list. The complexity of this implementation is to make it work with prefix search as we will describe below.

The implementation of synonyms in the Algolia engine is fully done at search time as we want to have an exact computation of the proximity (see next chapter for more details). We plan to implement a hybrid approach in the future where phrase queries will be pre-indexed in order to improve performance and have the best of both worlds.

Handling multi-words synonyms is hard

Even if our implementation is currently fully done at search time, we are still forced to rewrite the matched word position in order to have an exact computation of the relevance. We have explained in details this process in part 4 of this series in section 2.3.2.

Without this processing, having a synonym “NY” = “New York” would lead to different proximity for the query “New York Subway” on those two records:

{"Title":“Why New York Subway Lines Are Missing Countdown Clocks”}
{"Title":“NYC subway math”}

 

The complexity is that “New York” is identified as position 1 and 2 in the first record, followed by “Subway” at position 3, whereas the second record contains “NY” at position 0, followed by “subway” at position 1. Rewriting this will ensure the proximity measure between the matched word will be identical for both records. This process essentially considers that “NYC” was composed of the two original words “new” and “york” and will increase all following positions.

The following diagram shows the transformation done on the position of words to have an exactly identical proximity computation for records with a different number of words in the synonym (Note: words are indexed starting at position 0. For example, the first record is indexed with “why” at position 0, “new” at position 1, etc.)

Handling synonyms the right way

Reducing noise in results

Synonyms are just one of the various alternative expressions that are added on the queries. We also add typo tolerance, concatenation & split of query terms, transliteration and lemmatization. All those alternative expressions are described in the part 3 of this series.

In order to avoid adding noise, we do not apply the other alternative expressions on synonyms. This is the big reason why a search for “New York” won’t give the same number of hits than “NY” when you have a synonym “NY” = “New York”, records containing “NY” and “New York” will be retrieved in both cases but the record containing “NewYork” for example will be only returned for the “New York” query via the concatenation expression.

The following diagram explains the way we add alternative expressions, all steps are computed at the same time and are not re-entrant. In other words, we do not use the output of one alternative expression to feed the input of another step.

Handling synonyms the right way

Prefix match on synonyms

Handling synonyms correctly in an instant search use case is pretty complex. Most engines fall short in this respect and have a bad user experience because of this processing. For example, let’s say you have 5 records containing “NY” and 5 records containing “New York” and have defined the synonym “NY” = “New York”:

  • the query “n” will match all words starting with “n”, so records containing “NY” and “New York” will be matched: the 10 records are found
  • the query “ne”, “new”, “new y”, “new yo”, “new yor” will only match the 5 records containing “New York” as the synonym is not triggered
  • The synonym will only be triggered when the query contains “New York”, returning the 10 records

In order to avoid this flickering of search results, we consider that the last word of an expression is allowed to match as a prefix. So the synonym will be triggered as soon as the query contains “new y” (of course the synonym will also be triggered for the query “new yo”, “new yor” and “new york”). This approach reduces a lot the problem of flickering as the synonym is added as soon as the last word is started and deliver an intuitive behavior:

  • It would be very complex to understand that the query “ne” match “NY” because we match a prefix of the expression “New York” (would be the case if we allow to match the complete synonym as a prefix)
  • It is pretty easy to understand that the query “new y” is matching as a prefix of “New York”

This is why, in the Algolia engine, the final word of a synonym expression can be matched as a prefix. Of course, we continue to add the synonym when the expression is only a part of the query, so the query “New York subway” will still match the synonym “New York” = “NY” and records containing “NY” and “subway” will match. This recognition of expressions at any position of the search query is not performed by most engines.

Difference between original terms and synonyms

Most users want to consider synonyms exactly identical to the search terms but this is not always the case, as we described before there is several degrees of synonyms and some synonyms are closer than others.

We have several tunings available in the engine to let you configure the way to define the expressions that should be considered as identical or not:

alternativesAsExact is an index setting that specifies the type of alternatives that should be counted in the “exact” criterion of the ranking formula. As we explained in detail in part 4 of this series, it counts the number of terms that exactly match the query. By default, this settings is set to [“ignorePlurals”, “monoWordSynonym”], which means that the singular/plural alternative and the synonyms containing one word are considered as exact but a synonym containing several words won’t be considered as exact as they are often semantically farther. You can consider them as exact by adding “multiWordsSynonym” in the array.

The synonyms dictionary also contains alternative corrections that are different semantically and should be considered as one or two typos. The spelling can be very far from the original words and so won’t be cached by the typo detection, so you can define them manually. For the moment this feature is limited to single-word expressions.

Implementation details matter!

The configuration of synonyms is a key aspect of the search relevance and should be configured in most use cases. We have seen in this post that there are a lot of implementation details in the way synonyms are handled that have a direct consequence on the user experience or the ranking.

As with most of the features we introduce in the engine, we have spent a lot of time designing synonyms and how they should behave. This is why we introduced their support progressively: we started with support of single-word synonyms in May 2014, followed by multi-word synonyms in April 2015 and a dedicated API to handle synonyms in June 2016.

We took the time to think about these implementation details and to our knowledge we are the engine that goes the deepest in terms of supporting as-you-type synonyms and multi-word synonym positions to have a perfectly identical ranking. We still have improvements to the way we handle the synonyms in our roadmap – like the hybrid indexing of multi-word synonyms to improve performance.

If you have any other ideas of improvements, we would be very interested to discuss them with you!

We recommend to read the other posts of this series:

About the author
Julien Lemoine

Co-founder & former CTO at Algolia

githublinkedintwitter

14-day free trial

Create a full-featured search experience in no time.

Get started
14-day free trial

Recommended Articles

Powered byAlgolia Algolia Recommend

Inside the Algolia Engine Part 3 — Query Processing
engineering

Julien Lemoine

Co-founder & former CTO at Algolia

Inside the Algolia Engine Part 5 – Highlighting, a Cornerstone of Search UX
engineering

Julien Lemoine

Co-founder & former CTO at Algolia

Inside the Algolia Engine Part 4 — Textual Relevance
engineering

Julien Lemoine

Co-founder & former CTO at Algolia