When we started Algolia, we wanted to build an offline search SDK for mobile applications. Unlike other search engines on the market, we couldn’t rely on pre-existing search engine software such as Lucene because they weren’t available on mobile.
We had to build everything from the ground up within the limitations imposed by smartphones such as low memory and low processing power. Ultimately, we spent a considerable amount of time reinventing the search engine from scratch.
When we eventually decided to pivot to a SaaS solution, we had built a completely new ranking algorithm. And in a typical “constraints foster creativity” situation, we had made two huge improvements: Speed and Relevance.
Algolia is known for its blazing-fast speed. It’s usually the first thing people notice when they start using it and also what makes it so exhilarating to use. But what I consider to be the most impressive and important achievement of Algolia is not actually speed but Relevance.
Algolia reinvented how search engines define relevance.
Before diving into the specifics behind Algolia’s ranking formula, let’s briefly go over how other search engines work.
In the late 90s, the Text Retrieval Conference organized a series of annual workshops financed in part by the US Department of Defense. The goal of these workshops was to determine the best ranking algorithm for searching a dataset consisting of long unstructured documents. The conference resulted in several enhancements to search engine ranking algorithms, the most defining being improvements to TF-IDF, what was then the main component of the most powerful ranking formulas.
TF-IDF is a statistical weighting scheme based on two elements: TF or Term Frequency (the number of occurrences of a word in a document, which is calculated on a per document basis) and IDF or Inverse Document Frequency (the importance of a query word within the whole corpus of documents).
Basically, if a query word is repeated multiple times within a document, that document will rank higher in the results than a document that only contains the query word once. IDF comes into play to reduce the importance of words that occur a lot in the corpus such as “the”, “of”, “a”, etc. The more times a query word recurs within the whole corpus, the lower a document containing that query word will be ranked. The length of the document is also taken into account; a short document with query word matches will rank higher than a long document with the same match.
TF-IDF (along with similar algorithms like BM25) became a central tool to score/rank documents given a user query, and it was used to search any type of data. Unfortunately, it was not actually designed for that! TD-IDF was built as an answer to TREC’s workshops and was designed specifically to work within long unstructured documents such as PDFs or enterprise documents.
The issue with this kind of statistical approach is that it surprisingly only makes sense for a very small number of use-cases, like enterprise document search, where:
Algolia wasn’t built to search long unstructured documents. It was built to power the search of websites and mobile applications. And in 99% of cases, this data is very different.
Say that you’re on IMDB searching for an actor, and you type “Brad”. You don’t care about the Term Frequency (how many times the document contains the word Brad), or the Inverse Document Frequency (how many times the word Brad is present in the list of all actors).
Fortunately, there are other signals that matter:
As you can guess, these questions give us a much better chance at finding Brad Pitt than a statistical approach like TF-IDF.
This is the first pillar of Algolia’s revolutionary improvements—the rules taken into account in the Ranking algorithm.
Algolia doesn’t rely on any variation of TF-IDF. Instead, it uses six default rules to evaluate the textual relevance of an object for a specific query:
Most search engines have somewhat similar implementations of the Words and Attribute rules (some also have an implementation of Proximity). The rest of these rules are unique to Algolia. You can of course disable or customize the rules to fit your specific use-case, and you can also use non-textual criteria such as geo-location.
These rules are a great improvement over what existed before, but the biggest and most important change Algolia brought to the table is something else — how these rules are used to rank the objects.
The biggest issue with the relevance of most search engines is not TF-IDF (which can actually be disabled in most search engines using it). It’s the way the ranking criteria are used. Other search engines compute a big floating-point score that mixes everything. They have a very complex mathematical formula with a lot of coefficients/boosts that merges every rule into one score for each document. Then, they use this unique score to rank the search results.
The first problem with this approach is that it mixes apples and oranges. For example, if your ranking formula takes into account TF-IDF and the number of likes of the documents:
This exception-driven approach is not maintainable.
Algolia’s ranking formula is designed to fix this. And it does so with a different ranking approach via the Tie-Breaking algorithm.
Instead of calculating one big score and then sorting the results once, a tie-breaking algorithm calculates N scores and applies N successive sorting. Here’s how it works:
The order of the criteria determines their importance. It’s like in a children’s game—if you sort by color and then by shape, you don’t get the same results as when you sort by shape then color.
Configuring the relevance of Algolia for a specific use-case is as simple as figuring out the set of rules that need to be applied and ordering them by importance.
Oh, and everything is customizable—from the order of the attributes to which attributes are used. So there’s virtually no ranking strategy that you can’t build with this ranking formula.
Let’s consider an example with the following dataset:
[ { "name": "Jon Black", "featured": true, "number_of_likes": 4 }, { "name": "John Jackson", "featured": false, "number_of_likes": 17 }, { "name": "John Paul", "featured": false, "number_of_likes": 3 }, { "name": "Jon White", "featured": false, "number_of_likes": 9 }, { "name": "John Thompson", "featured": true, "number_of_likes": 8 } ]
And this index configuration:
To simplify our example, let’s ignore some of the rules and focus only on Typo, Featured and Number of Likes. For the query “John,” we’ll get the following results:
In this example, we:
A few things to note:
As of writing this, Algolia is live on thousands of websites and mobile applications, and this ranking strategy has consistently proven successful. We have spent a long time fine-tuning every detail and are pleased with the level of performance. We continue to make relevance a core focus, in particular to introduce new criteria to better enable the personalization of results per user.
We’re always happy to help you figure out how you could use this algorithm to achieve the ranking strategy that best fits your specific use-case. And we’re eager to know what you think, so feel free to leave us a comment.
Nicolas Baissas
Powered by Algolia AI Recommendations
Josh Dzielak
Julien Lemoine
Co-founder & former CTO at AlgoliaCatherine Dee
Search and Discovery writer