Search engine usage has evolved a lot in the last few years. Today’s users expect instant feedback without having to click on the “enter” button and wait for the page to load to get their results. They also expect the engine to tolerate typing mistakes on the fly without having to validate proposals of query re-phrasing, like the (in)famous “did you mean?”
Of course all these changes have pretty big consequences on the engine itself. In this post, we will concentrate on the way we handle indexing at Algolia, why we do it differently in order to maximize indexing speed and search performance and how we natively address these evolutions in search.
Introduction to indexing
The goal of the indexing process is to store your data in a specific data-structure that is optimized for search. Without indexing, we would need to scan all the documents in a database to detect which ones match a query. This would take forever and be highly inefficient, ultimately resulting in a terrible user experience. This is one of the reasons we continue to focus on optimizing indexing speed as part of our overall efforts to make search more efficient.
The main part of the indexing process is to create a data-structure that contains a mapping of each word to the associated list of documents containing this word. One mapping of the word to the list of documents is called an inverted list, with the name coming from the inversion of the space: instead of having documents that contains words, we have now words with the list of documents containing each word. The concept is similar to the index you have at the end of a book, except all words are in the index.
There are a lot of algorithms to store those inverted lists in an efficient way while keeping good performance. Most of those techniques are pretty old and well described in the very good Managing Gigabytes book.
The creation of those inverted lists can be decomposed in two big parts:
For each document, we extract the list of words and build a hash-table that associates words to documents
When all documents are processed, we compute an on-disk binary data-structure containing the mapping of words to documents. This data-structure is the index we will use to process queries.
How search-as-you-type is usually implemented
As-you-type display of search results was introduced by Google in 2010 as “Google Instant” and later referred to as “Instant Search.” Users love this experience as it creates a seamless interaction with the data they are looking for! It removes the intermediate steps such as query validation, “did-you-mean” re-phrasing and decreases the time spent waiting for the results to show up! It also means indexing and search have to be redesigned because you don’t simply search for words anymore, but also for partially written words (prefixes) and it needs to be very fast! There are four main different approaches engineers use today to build this type of instant search:
1) Autocomplete (also called Suggest)
This approach is the most commonly used and is based on a static data-structure on the side of the search engine. You can see it as a big dictionary in the form of a Trie (special type of Tree that associates string to an output, all the descendants of a node have a common prefix of the string).
Very fast response time (usually few milliseconds)
Scales well, can easily handle thousands of queries per second on one machine
The relevance is not the same as the search engine itself, it is just a static dictionary. It works well if you propose “frequent queries,” but starts to have weird effects if you use it to search on the same records as the fully-featured search engine itself. The relevance will be different, of lesser quality and will lead to a lot of confusion.
2) Index prefixes
This approach is simply to build an inverted list for all prefixes of a word instead of just the word itself. For example instead of just having an inverted list for the word “engine”, you will also have one for “e”, “en”, “eng”, “engi” and “engin”. Those lists will usually contain a specific indicator that allows at query time to make the difference between a prefix and an exact word (both exact and prefix inverted lists will be queried with a preference for exact).
Fast resolution of prefixes as they are pre-computed
Ability to keep the relevance of the search-engine for all interaction with the engine, including as-you-type experience (user is never lost between two different ways to rank results)
Increase the time to publish results because the indexing process is far more expensive than without inverted list and consumes far more memory (inverted list are kept in memory)
Increase a lot the size of the index, which reduces a lot the likelihood of fitting in memory and so can hurt performance (even on a SSD)
3) ngrams computation
This approach is very similar to the indexing of all prefixes but it limits the computation of prefixes by allowing a minimum and a maximum number of letters in the prefix. The main advantage is to reduce the cost compared to the indexing of all prefixes, if you query a prefix that is bigger than the maximum number of letters, the query will be automatically be transformed into a query on several ngrams (queries with fewer letters than the minimum will not trigger the prefix search).
Advantages and drawbacks are the same as the indexing of all prefixes; it is just a small improvement to reduce a bit the cost of indexing by adding more computation in the search.
4) Search prefixes at search time
This approach keeps the indexing process pretty light by keeping only one inverted list per word but it requires to do all the processing at query-time. For example if you perform the query ‘a’, it will mean looking for all words that start by ‘a’ in a big dictionary and then perform a query with all those words in a big OR operand. This approach is very expensive and usually people that select this approach start the prefix search after a certain number of letters in the prefix to avoid having queries that are too expensive.
No indexing overhead for prefix searches
No compromises on the relevance of the search-engine
Produces very complex and slow queries
Only works if the number of documents and words is very small, it does not scale
Engineers working on search have no other choice than spending time to test those different approaches and select the one that seems the best for their use cases, unfortunately all of them have serious drawbacks.
A different approach
At the beginning of Algolia, we worked on a different product: an offline search engine SDK for mobile app developers. And we wanted this engine to provide a fast instant search experience on old devices like an iPhone 3G.
None of the standard approaches described above would have worked on such a low-end hardware, so we had to design a new way to do the indexing with very little RAM and CPU. All that while maintaining an instant search experience (one of the purpose of indexing being to speed up the search operations).
Having already worked on multiple data-structures for search engines prior to Algolia helped me a lot in designing a different way to solve the problem. The first inspiration came from a generational string B-tree I built several years ago. It was pretty similar to what the LevelDB fast key-value store developed by Google is doing (associate an arbitrary size value to an arbitrary size string).
First, I knew that for Algolia I wanted to have a generational data-structure.
Having a generational data-structure makes updating the data a lot faster. Typically, you’ll create two (or more) “generations” of your data-structure. A big one containing the “old” data, and a small one containing the “new” data.
Then, if you need to add/update/delete an object, you’ll update a smaller structure containing only the “new” information, which is much faster.
Merge of Generational data-structure
At some point (generally when your smallest generation reaches a certain size), you’ll need to merge several generations into one for a better efficiency. This operation is called a merge.
In order to have an efficient merge, each generation needs to be sorted, so that you can scan them in parallel while generating the result (merge of the parallel scan is done via a standard heap-merge algorithm).
For Algolia, I decided to represent the words that will be searchable as a radix tree (space-optimized representation of a Trie, below we look at what a radix tree is in practice).
First because a radix tree is sorted, which allowed us to have an efficient merge and to use multiple generations of the tree.
Second because it works very well to process the typo-tolerance in a very efficient way (more details on the typo-tolerance will come in another post).
It is easy to build a radix tree in memory before writing it on disk, but it consumes a lot of RAM. And I wanted to use the least amount of RAM possible, because:
For the initial offline search SDK, we simply didn’t have enough RAM on an iPhone 3G.
For our current SaaS engine, we want the RAM to be focused on making search fast (the indices are all stored on RAM for search), which means indexing should consume the minimum amount of RAM.
What’s great is that you can actually write a radix tree on the fly without having to build the tree in memory if you’re indexing a list of sorted words.
For example, let’s consider that the dataset we want to search on contains 5 objects, each having a single word. Each object will be represented by a value from 1 to 5.
Normally, multiple objects could have the same word, so the leaf would have a list of values. But for simplicity’s sake, we’ll take records containing different words.
Each node of our radix tree will contain a string.
If the node is a leaf, it has a value associated to the corresponding object having this word.
If the node is not a leaf, it can have an associated value, but it can also just be an internal node representing a shared prefix.
The radix-tree we’re going to build will look like this:
Now, let’s build our radix-tree. Seems simple now enough, right? The naive approach is to build the complete data-structure in memory and then to dump it on disk. We do not want to use this approach because we want to keep as much memory as possible on our servers to hosts the indices of our users and provide a fast search! This is why we designed all the indexing to use as little memory as possible while optimizing speed of indexing.
Here’s the trick: we can build this data-structure on-disk directly from the list of words and with only a few kilobytes of memory. To do that, we’ll flush from the RAM the nodes as soon as possible, by building the tree on the fly.
The first step is to order alphanumerically the words of the documents we want to index (which is already done in our example).
Then, we’re going to build our tree using a depth-first search: since a radix tree is sorted in the alphanumeric order, it’ll be easy to flush nodes to the disk as soon as we don’t need them.
1) Add the word Fast
2) Add the word Faster
3) Add the word Test
4) Flush the branch “fast” to disk
5) Add the word toaster
6) Flush the branch “est”
7) Add the word toasting
8) Finally, flush the branch “er”, then “ing”, “oast” and finally “t”
This process is a very efficient way to build a tree because it only keeps in memory a very small number of nodes (in the worst case scenario you will have a number of nodes in memory that is equal to the longest key in the tree, in our case we always have fewer than 100 nodes in memory).Now that this is done, let’s focus on the next part: prefix search!
Remember that we wanted an instant search experience. This means that the engine must be able to retrieve results as-you-type, at each character.
If you type “f”, you want to retrieve the results “fast” and “faster”, so the engine must know that both these words contain the prefix “f”.
And we don’t want to calculate the association prefix-object for all the prefixes, which would take a lot of time and memory, but only when it’s really necessary.
The good news is that our radix tree can tell us precisely where we need to store prefixes: all the nodes in the tree which have children.
In our example, it is only necessary for the nodes “fast”, “t” and “toast”.
You see what I just did right there? Yes! the prefixes can be computed on the fly too! We just need to remember what was in the last few branches that we flushed.
For example, if I just flushed the branch “est”, I’ll remember that the prefix “t” is associated to the object 3 (test). I’ll do the same with “toaster” (object 4) and “toasting” (object 5). And when we’re done processing all the words beginning with “t”, I’ll know that this node should contain the objects 3, 4 and 5.
A few numbers
This way to build prefixes has been built into our engine since 2012 and we index billions of jobs every day with this algorithm. We have measured the impact on a lot of different indices, the impact on search is pretty obvious as we always have the prefix inverted list when we do a query (and we’ll have a lot of RAM handy), but the impact on indexing time and on the size are less obvious. Here is a small summary of the impacts that are very small compared to the advantages it gives at query time:
Impact of prefixes on index size
Impact of prefixes on indexing speed
This approach is significantly better than any of the other approaches presented before, it is actually optimal at query time and has a very reasonable impact on indexing performance!
This optimization of the prefixed indexing is just one of the numerous optimizations we have built in the Algolia engine. The reason we built it is because of our focus on performance and the fact that our engine was developed specifically to allow for the strong constraints of mobiles devices. We also had to think a bit out of the box, mainly by considering the word dictionary and inverted lists as one unique data-structure instead of two independent one.
I look forward to reading your thoughts and comments on this post and continuing to explain how our engine is implemented internally, transparency is part of our DNA! 🙂
We recommend to read the other posts of this series: