Search has been around for a while, to the point that it is now considered a standard requirement in many applications. Decades ago, we managed to make search engines scale by leveraging inverted indexes, a data structure that allows a very quick lookup of documents containing certain words. We’ve come a long way since then and have moved to vector indexes.
To understand why vector search represents such a significant advancement in search capabilities, we need to look back and see how search functionality has developed over time.
The early stage of search
Information retrieval is a pretty old (and still active) research area. Every year, conferences like ECIR or SIGIR attract considerable interest from researchers and engineers around the world to discuss progress made in this field. The Algolia team attended the recently held ECIR conference, and you can check out our key highlights from the event.
The development of search functionality as we know it can probably be traced as far back as the 1950s, as researchers tried to solve the problem of efficient information retrieval across large databases. Progress was made steadily with the increase in computing power and the introduction of new concepts, such as the inverted index to scale better.
This data structure allows us to fetch documents — matching specific words with great speed and accuracy — while also being potentially distributed, enabling the architecture to scale infinitely. In the late 1990s, this was the kind of structure that allowed Google to scale on the size that was the Internet at the time.
For search functionality that is truly performant, fetching is just one part of the equation, the other is ranking. A major breakthrough in this area has been TF-IDF (term-frequency, inverse-document-frequency). TF-IDF is the dominant term-weighting scheme that assigns a score to each piece of content, ranking it in relevance to each user query. This score computes the number of matches of query words in a document (TF) and the frequency of said words in the whole corpus (IDF).
The combination of the two makes sure that documents ranked first are matching the query well, and ideally on words that are rare and specific. That is, matching on articles such as the word “the” is less important than matching on less frequent, but more important terms like the word “cat.”
While innovations, like inverted indexes and term weighting schemes, are central to making search actually work and scale, there were also some serious limitations that were inherent with these early search engines. Since the 1990s the industry has been working to address these challenges in a variety of ways:
Because search engines used to store keywords and corresponding documents containing those keywords in inverted indexes, there was a need to formalize what each keyword was. This can be a difficult challenge due to a variety of reasons such as: hyphenated words may be segmented differently, languages work differently, uppercase and lowercase words may be treated differently, etc.
In each instance, it may seem easy to solve these challenges but there will always be exceptions for each word that may create edge cases. For example, we may decide to lowercase all text, but then someone looking for “Bush” may find documents that are not about “George Bush.” Coming up with these rules across each language is an extremely tedious process and can result in a suboptimal search experience.
If someone looks for the word “cats,” they will not find documents that contain the word “cat.” The solution in general for this challenge is stemming, where we remove the suffix of all words both in queries and documents, so that someone looking for “cats” will retrieve documents containing “cat” and “cats.”
The limitation of this approach is, once again, the resulting exceptions and edge cases. For example, “universal” and “university” are stemmed to the same token.
Word ambiguity and synonyms
Some words are ambiguous, such as “jaguar.” If the query is just this word, it is impossible to really guess what the intent is (do they mean the American professional football team, the automobile, the animal, or something entirely different). But if the query is “jaguar zoo” or “jaguar price,” intents should be easily differentiable based on the context. However, search engines that rely simply on words alone cannot understand the context by default.
On the other hand, some words are different but refer to the same concept, for example “e-mail” and “email.” Or, it would be relevant for a query “chicken” to retrieve documents containing “hen.” The solution in general is to come up with a list of synonyms so the engine maps words together.
Because spelling needs to be accurate in keyword search engines, misspelled words will not allow the user to retrieve the right results unless an autocorrect feature is in place. Autocorrecting queries need some specific development, since the impact/types of misspellings will differ depending on whether you are an ecommerce platform, a publisher, or a legal document library.
In addition to the challenges presented so far, language support adds another layer of complexity. For each one of the limitations above, additional work will be needed to resolve each challenge across every language supported by the system. Assuming you have fine tuned your engine with proper tokenization, stemming, synonyms, and autocorrect for English, you will need to develop the same for every other supported language, adding tremendous complexity to your project.
New breakthroughs and expanded horizons
Recent years have been marked by the development of more and more powerful large language models (LLMs). Their power comes from different sources such as:
- Multitasking, e.g., ELMo was trained to perform multiple tasks with equal or better quality than independent models
- Transformer models, which scale much better
- Multilingual models, such as M-BERT
- Massive models trained on massive datasets, such as GPT4
The progress of the last 5 years at a very high pace of innovation has led us to where we are today. That is, language models that are much more scalable and powerful than they were back in 2017. This breakthrough is significant for the field of search engines, because it naturally removes a lot of the challenges and limitations mentioned earlier in this article.
LLMs can be used to understand both queries and documents by creating their vector representation. This is a semantic representation of the entire content, which removes the points of failure listed earlier that occur because of the word-by-word approach of traditional keyword search.
Stemming is also not needed anymore, since the vector representation of the word “cat” will be very close to that of “cats.’ Even better, the vector representation of “when was Barack Obama born” will be very close to the one of “how old is the 44th President of the US.” Vector search engines are naturally much better at assessing semantic similarity, while with keyword search a lot of manual work is needed to reach this level.
Cross-language engines that are resilient to user misspelling are also much more accessible using vector search. First, because these models support more and more languages out of the box. Second, because the data on which these LLMs are trained typically come from the Internet. Misspellings are available at training and models learn about their representation. Furthermore, models are designed to support word variations much better in recent years too.
Hence today, building a search engine that works in all situations and that really understands the query and documents is much more accessible and low maintenance. There are still important challenges but they have shifted from how to be relevant to how to make these models scale. Even if models are more scalable, running them in real-time and making predictions with high throughput remains a complex problem.
Luckily, Algolia has a proven track record in scalability, availability, and robustness to bring the best of both worlds to our customers. We’ve also designed a proprietary NeuralHashing solution to compress vectors to a fraction of their size while retaining up to 99% of the information. This allows us to deliver vector-based results as fast as keyword results and can even combine them into a single API call. With vector search, you will spend time on things that are specific to your use case, and less on fine tuning relevance, optimizing some queries, or wondering how to build a reliable search on your app.
Vector search is definitely solving a lot of problems that you can experience with more standard keyword search. But it is important to remember that keyword search still has many advantages outside of the edge-cases presented above. Someone looking for an iPhone, for example, will simply type “iphone,” and there is no need to rely on vector search for this straightforward query. A great advantage of keyword search is that because it’s a pretty standard technology, it’s extremely fast and very cost effective.
So, an ideal system would benefit from the best of both types of search functionality, i.e., using keyword search when the query is simple and known, and vector search for the long tail of unique and rare queries. Such a system can make sure that every query generated by users will always retrieve results that are fast, relevant, and accurate.
Where do we go from here?
At Algolia, we have managed to build out such a solution to combine these functionalities. Learn more about our newly launched Algolia NeuralSearch solution.
Contact our team to learn more about how you can start leveraging Algolia Neuralsearch today.