Product

Algolia’s top 10 tips to achieve highly relevant search results
facebooklinkedintwittermail

As a hosted-search engine service, we discuss the relevance aspect of search with our customers and prospects all day long. We now have more than 1500 customers and have seen a large variety of real-life search problems. It’s interesting to note that more often than not, these problems are in some way connected to the R word. Relevance.

Relevance is a well understood concept in search engines, but is pretty complex to measure, as a high degree of subjectivity is implicit in the notion of relevance. In fact, we’ve seen time and again that too many people spend too much time trying to control their relevance.

In this post, we’ll share our top 10 tips to help people achieve good relevance in their search results.

1) Structure your data

It might seem like we’re stating the obvious, but the first step to having good relevance is structuring your data correctly. Having a long string that contains all your information concatenated won’t put you on the path to good relevance.

You should have an object structured with different attributes and different strings to help the engine associate the importance of matches in your preferred order. Avoiding string concatenation to list values also ensures that the proximity measure will not be reflected inaccurately because the last word of one value is close to the first word of another value. (Proximity is an important element of the relevance that measures how close the query terms are in the matched document.)

Here is an example using a movie title:

{
  "title": "Fast & Furious 6",
  "alternative_titles": [
    "The Fast and the Furious 6",
    "速度与激情6",
    ...
  ],
  "genre": [
    "Action",
    "Thriller",
    "Crime"
  ],
  "objectID": "440309800"
}

Which works better than:

{
  "movie": "Fast & Furious 6 | The Fast and the Furious 6 | 速度与激情6 | Action, Thriller, Crime"
}

2) Handle typo-tolerance

Before doing advance tuning, there are a lot of small checks you can do to achieve a decent level of textual relevance. For example you should be able to find a record that contains "iphone" for the query "i-phone" and "i phone" or find a record containing "hi-speed usb" with a query "hispeed usb". You should also check that you handle abbreviations and are able to find a record containing "U.S.A" with the "USA" query and vice versa.

Typos are very frequent, especially because of the small virtual keywords of mobile devices (and the famous fat-finger effect). If your record contains "iphone", you should be able to find it via "ipjone" or "iphoen". Users also love as-you-type search experiences, so you should ensure that you are able to tolerate typos on the prefix. For example, the code should be such that the  query "mikca" is considered as a prefix of "mickael" (because mikca = one typo of micka).

All these cases are automatically handled for you by the Algolia engine.

3) Don’t fix issues locally, think global

Relevance is not a one day job, and you will discover specific queries with relevance issues over time. You should try not to optimize your relevance for a specific query without having the big picture in mind and having a setup that allows you to efficiently test your modifications on the entire set of query logs ensuring you won’t degrade the relevance of other queries. A good non-regression test suite is mandatory.

4) Stop thinking about “boost”

When we talk to different search teams, we see that they are all used to configuring "boost" and it comes as a reflex for them. By boost they mean integers that they configure in their ranking formula like “I configured a boost of 5 on my title attribute, 2 on my brand attribute and 1 on my description attribute” which is kind of code for “The attribute title is 5 times more important than description and brand is twice as important as description”.

Unfortunately these "boosts" are not so great. Changing one from a value X to Y is the most common source of issues we see everyday! No engineer is able to predict what will happen because this boost will be combined with a lot of other factors that make the modification unpredictable (you can see it as one integer mixed with other elements in a big mathematical formula). In other terms, even if you have non-regression tests, a change of boost will just totally change your result set and it will be close to impossible to say if the new configuration is better than the previous one or not.

This is actually why we designed a different way to handle search relevance with a predictable approach.

5) Always explain the search results

Sometimes your search can be good but the user can perceive the results as bad because there is no explanation of the match.
The best–and most intuitive–way to explain a search result is to visually highlight the matching query terms in the result, but there are also two cases that are often not well handled:

  1. 1. When you are searching in multiple attributes but only display one of them statically: in this case it would be better to display the matching attributes. This is, for example, the case if you return “Elon Musk” for the query “Tesla.” This is great, but it would be even better to add a sub-line with “CEO: Elon Musk” to make sure someone who doesn’t know Elon Musk would understand the result.
  1. 2. When you are matching via a typo: it can sometimes be non-intuitive to see where the typo is, and highlighting the matched term is very useful to help the user quickly understand the match.

6) Think twice before removing stop words

As obvious as it seems, before trying to use a recipe you should be sure that your use case is compatible with it. This is the case with stop word removal (removing the most commonly used words in a given language like "the", "of", "to", "be", "or", …).

But those words can be very useful and removing them sometimes hurts the relevance. For example, if you try to search for the “To Beta or not to Beta” article on hacker news without stop words, the engine will end up with the query, Beta.

There are even worse queries, like if you want to search the artist “The The.” In this case you would just have a no-results page!

Of course, there are cases where removing stop words are useful. If you have a query in natural language or if you are trying to search for similar content. But those cases are more of an exception than the norm. Be wary of removing stop words!

7) There are entries that are complex to find

There will always be some specific queries that can be complex to handle, this is the case for the TV show called “V.” This query is particularly challenging in an instant search use case:

  • For the sake of good UX, you should launch the search from the very first character typed (we’ve tested it many times, and have found that any heuristic that launches the query after N letters leads to a poorer UX and conversion rates)
  • You should have a preference for the exact match over the prefix because there are probably a lot of words that start by "v" in your data set.

Another type of corner cases is the usage of symbols, this is the case if you are looking for the band “!!!.” We encounter such problems with symbols in almost every use case.

8) Be careful with language specific techniques

Natural languages have a lot of variety that can cause your records to not be returned. For example, if you are using a singular word in your query and your record contains the plural word. There is some language specific heuristics that help to address this problem. The most popular are:

  • Stemming: Reduction of a word to the simplest form called a stem. Most of the time by removing the suffix of the word, for example transforming "running" in "run". The most popular open source stemmer is Snowball and is based on a set of rules per language.
  • Phonetization: Compute a replacement of the word that represents its pronunciation. Most phonetic algorithms are based or derived of the Soundex algorithm and only work with English language.
  • Lemmatization: Reduction of all different inflected forms of a word. This is similar to the stemming approach except it is based on a dictionary developed by linguists, and it usually contains mainly nouns and verbs.

The major drawback of these approaches is that they only address one language. We see in practice very few cases when there are only words from one language and those techniques can produce noise on proper names such as last name or brand. You can, for example, think about a search of people on a social network, where those approaches can introduce bad results.

9) Use business data in your relevance

The first eight tips target the textual relevance, but you should also include business data in order to have good relevance. It can be just a basic metric like the number of page views or something more advanced like the number of times a product was put in a cart.

It can even be an advanced metric which relates to the query like “the number of times a product was bought when searched with a particular query”.

From our experience, the addition of business data makes a big difference if the textual relevance is good. That said, the business relevance should not bypass the textual relevance or you risk loosing all the benefits of the hard work done on relevance! Textual relevance should (almost always) go first and in case the textual relevance doesn’t help to decide whether one hit or the other should go first, then the engine should use the associated business data.

10) Personalize the ranking for each user

Personalization of search is the final touch to get the perfect relevance and is the part that most people don’t really see. Let’s take a simple example: if you search for “milk” on your favorite grocery store that applied all the previous tips, you will find the most popular milk bottle. But if you are a regular user of this store and have already bought a particular bottle of milk several times in the past, you’re likely to expect this one first. This is the ultimate way to make the user love the search result and avoid the perception of a bad relevance. In other words, it’s the icing on top of the cake!

Not an exhaustive list

We hope this list of advice will be useful to help you get a better search functionality on your website or app. This list is unfortunately not exhaustive as relevance is a pretty complex domain and there are a lot of specific problems that we do not cover in this list.
Our team is dedicated to help you have a better relevance, fill free to contact us at contact(at)algolia.com to share your problems and we will be happy to analyse them with you.

About the authorJulien Lemoine

Julien Lemoine

Co-founder & former CTO at Algolia

Recommended Articles

Powered by Algolia AI Recommendations

Comparing Algolia and Elasticsearch For Consumer-Grade Search Part 2: Relevance Isn’t Luck
Engineering

Comparing Algolia and Elasticsearch For Consumer-Grade Search Part 2: Relevance Isn’t Luck

Josh Dzielak

Josh Dzielak

What is search relevance in the era of browsing, discovery, and recommendations?
Product

What is search relevance in the era of browsing, discovery, and recommendations?

Peter Villani

Peter Villani

Sr. Tech & Business Writer
Inside the Algolia Engine Part 3 — Query Processing
Engineering

Inside the Algolia Engine Part 3 — Query Processing

Julien Lemoine

Julien Lemoine

Co-founder & former CTO at Algolia