Search by Algolia
What’s the secret of online merchandise management? Giving store merchandisers the right tools
e-commerce

What’s the secret of online merchandise management? Giving store merchandisers the right tools

As a successful in-store boutique manager in 1994, you might have had your merchandisers adorn your street-facing storefront ...

Catherine Dee

Search and Discovery writer

New features and capabilities in Algolia InstantSearch
engineering

New features and capabilities in Algolia InstantSearch

At Algolia, our business is more than search and discovery, it’s the continuous improvement of site search. If you ...

Haroen Viaene

JavaScript Library Developer

Feature Spotlight: Analytics
product

Feature Spotlight: Analytics

Analytics brings math and data into the otherwise very subjective world of ecommerce. It helps companies quantify how well their ...

Jaden Baptista

Technical Writer

What is clustering?
ai

What is clustering?

Amid all the momentous developments in the generative AI data space, are you a data scientist struggling to make sense ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

What is a vector database?
product

What is a vector database?

Fashion ideas for guest aunt informal summer wedding Funny movie to get my bored high-schoolers off their addictive gaming ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

Unlock the power of image-based recommendation with Algolia’s LookingSimilar
engineering

Unlock the power of image-based recommendation with Algolia’s LookingSimilar

Imagine you're visiting an online art gallery and a specific painting catches your eye. You'd like to find ...

Raed Chammam

Senior Software Engineer

Empowering Change: Algolia's Global Giving Days Impact Report
algolia

Empowering Change: Algolia's Global Giving Days Impact Report

At Algolia, our commitment to making a positive impact extends far beyond the digital landscape. We believe in the power ...

Amy Ciba

Senior Manager, People Success

Retail personalization: Give your ecommerce customers the tailored shopping experiences they expect and deserve
e-commerce

Retail personalization: Give your ecommerce customers the tailored shopping experiences they expect and deserve

In today’s post-pandemic-yet-still-super-competitive retail landscape, gaining, keeping, and converting ecommerce customers is no easy ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

Algolia x eTail | A busy few days in Boston
algolia

Algolia x eTail | A busy few days in Boston

There are few atmospheres as unique as that of a conference exhibit hall: the air always filled with an indescribable ...

Marissa Wharton

Marketing Content Manager

What are vectors and how do they apply to machine learning?
ai

What are vectors and how do they apply to machine learning?

To consider the question of what vectors are, it helps to be a mathematician, or at least someone who’s ...

Catherine Dee

Search and Discovery writer

Why imports are important in JS
engineering

Why imports are important in JS

My first foray into programming was writing Python on a Raspberry Pi to flicker some LED lights — it wasn’t ...

Jaden Baptista

Technical Writer

What is ecommerce? The complete guide
e-commerce

What is ecommerce? The complete guide

How well do you know the world of modern ecommerce?  With retail ecommerce sales having exceeded $5.7 trillion worldwide ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

Data is king: The role of data capture and integrity in embracing AI
ai

Data is king: The role of data capture and integrity in embracing AI

In a world of artificial intelligence (AI), data serves as the foundation for machine learning (ML) models to identify trends ...

Alexandra Anghel

Director of AI Engineering

What are data privacy and data security? Why are they  critical for an organization?
product

What are data privacy and data security? Why are they critical for an organization?

Imagine you’re a leading healthcare provider that performs extensive data collection as part of your patient management. You’re ...

Catherine Dee

Search and Discovery writer

Achieving digital excellence: Algolia's insights from the GDS Retail Digital Summit
e-commerce

Achieving digital excellence: Algolia's insights from the GDS Retail Digital Summit

In an era where customer experience reigns supreme, achieving digital excellence is a worthy goal for retail leaders. But what ...

Marissa Wharton

Marketing Content Manager

AI at scale: Managing ML models over time & across use cases
ai

AI at scale: Managing ML models over time & across use cases

Just a few years ago it would have required considerable resources to build a new AI service from scratch. Of ...

Benoit Perrot

VP, Engineering

How continuous learning lets machine learning  provide increasingly accurate predictions and recommendations
ai

How continuous learning lets machine learning provide increasingly accurate predictions and recommendations

What new data points have you learned lately? Learning is never ending (hence the phrase “lifelong learning”), so chances are ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

Using personalization to boost ecommerce ROI: trends, facts, tips
e-commerce

Using personalization to boost ecommerce ROI: trends, facts, tips

You love blue. You’ve bought several blue items from a particular up-and-coming brand. You’re also a ...

Catherine Dee

Search and Discovery writer

Looking for something?

facebookfacebooklinkedinlinkedintwittertwittermailmail

Search has been around for a while, to the point that it is now considered a standard requirement in many applications. Decades ago, we managed to make search engines scale by leveraging inverted indexes, a data structure that allows a very quick lookup of documents containing certain words. We’ve come a long way since then and have moved to vector indexes.

To understand why vector search represents such a significant advancement in search capabilities, we need to look back and see how search functionality has developed over time. 

The early stage of search

Information retrieval is a pretty old (and still active) research area. Every year, conferences like ECIR or SIGIR attract considerable interest from researchers and engineers around the world to discuss progress made in this field. The Algolia team attended the recently held ECIR conference, and you can check out our key highlights from the event.

The development of search functionality as we know it can probably be traced as far back as the 1950s, as researchers tried to solve the problem of efficient information retrieval across large databases. Progress was made steadily with the increase in computing power and the introduction of new concepts, such as the inverted index to scale better.

This data structure allows us to fetch documents — matching specific words with great speed and accuracy — while also being potentially distributed, enabling the architecture to scale infinitely. In the late 1990s, this was the kind of structure that allowed Google to scale on the size that was the Internet at the time.

For search functionality that is truly performant, fetching is just one part of the equation, the other is ranking. A major breakthrough in this area has been TF-IDF (term-frequency, inverse-document-frequency). TF-IDF is the dominant term-weighting scheme that assigns a score to each piece of content, ranking it in relevance to each user query. This score computes the number of matches of query words in a document (TF) and the frequency of said words in the whole corpus (IDF).

The combination of the two makes sure that documents ranked first are matching the query well, and ideally on words that are rare and specific. That is, matching on articles such as the word “the” is less important than matching on less frequent, but more important terms like the word “cat.”

ai-ebook-banner

Limitations

While innovations, like inverted indexes and term weighting schemes, are central to making search actually work and scale, there were also some serious limitations that were inherent with these early search engines. Since the 1990s the industry has been working to address these challenges in a variety of ways: 

Text processing

Because search engines used to  store keywords and corresponding documents containing those keywords in inverted indexes, there was a need to formalize what each keyword was. This can be a difficult challenge due to a variety of reasons such as: hyphenated words may be segmented differently, languages work differently, uppercase and lowercase words may be treated differently, etc.

In each instance, it may seem easy to solve these challenges but there will always be exceptions for each word that may create edge cases. For example, we may decide to lowercase all text, but then someone looking for “Bush” may find documents that are not about “George Bush.” Coming up with these rules across each language is an extremely tedious process and can result in a suboptimal search experience.

Exact matches

If someone looks for the word “cats,” they will not find documents that contain the word “cat.” The solution in general for this challenge is stemming, where we remove the suffix of all words both in queries and documents, so that someone looking for “cats” will retrieve documents containing “cat” and “cats.”

The limitation of this approach is, once again, the resulting exceptions and edge cases. For example, “universal” and “university” are stemmed to the same token.

Word ambiguity and synonyms

Some words are ambiguous, such as “jaguar.” If the query is just this word, it is impossible to really guess what the intent is (do they mean the American professional football team, the automobile, the animal, or something entirely different). But if the query is “jaguar zoo” or “jaguar price,” intents should be easily differentiable based on the context. However, search engines that rely simply on words alone cannot understand the context by default.

On the other hand, some words are different but refer to the same concept, for example “e-mail” and “email.” Or, it would be relevant for a query “chicken” to retrieve documents containing “hen.” The solution in general is to come up with a list of synonyms so the engine maps words together. 

Misspelling

Because spelling needs to be accurate in keyword search engines, misspelled words will not allow the user to retrieve the right results unless an autocorrect feature is in place. Autocorrecting queries need some specific development, since the impact/types of misspellings will differ depending on whether you are an ecommerce platform, a publisher, or a legal document library. 

Language

In addition to the challenges presented so far, language support adds another layer of complexity. For each one of the limitations above, additional work will be needed to resolve each challenge across every language supported by the system. Assuming you have fine tuned your engine with proper tokenization, stemming, synonyms, and autocorrect for English, you will need to develop the same for every other supported language, adding tremendous complexity to your project.

New breakthroughs and expanded horizons 

Recent years have been marked by the development of more and more powerful large language models (LLMs). Their power comes from different sources such as:

  • Multitasking, e.g., ELMo was trained to perform multiple tasks with equal or better quality than independent models
  • Transformer models, which scale much better
  • Multilingual models, such as M-BERT
  • Massive models trained on massive datasets, such as GPT4

The progress of the last 5 years at a very high pace of innovation has led us to where we are today. That is, language models that are much more scalable and powerful than they were back in 2017. This breakthrough is significant for the field of search engines, because it naturally removes a lot of the challenges and limitations mentioned earlier in this article. 

LLMs can be used to understand both queries and documents by creating their vector representation. This is a semantic representation of the entire content, which removes the points of failure listed earlier that occur because of the word-by-word approach of traditional keyword search.

Stemming is also not needed anymore, since the vector representation of the word “cat” will be very close to that of “cats.’ Even better, the vector representation of “when was Barack Obama born” will be very close to the one of “how old is the 44th President of the US.” Vector search engines are naturally much better at assessing semantic similarity, while with keyword search a lot of manual work is needed to reach this level. 

Cross-language engines that are resilient to user misspelling are also much more accessible using vector search. First, because these models support more and more languages out of the box. Second, because the data on which these LLMs are trained typically come from the Internet. Misspellings are available at training and models learn about their representation. Furthermore, models are designed to support word variations much better in recent years too.

Hence today, building a search engine that works in all situations and that really understands the query and documents is much more accessible and low maintenance. There are still important challenges but they have shifted from how to be relevant to how to make these models scale. Even if models are more scalable, running them in real-time and making predictions with high throughput remains a complex problem.

Luckily, Algolia has a proven track record in scalability, availability, and robustness to bring the best of both worlds to our customers. We’ve also designed a proprietary NeuralHashing solution to compress vectors to a fraction of their size while retaining up to 99% of the information. This allows us to deliver vector-based results as fast as keyword results and can even combine them into a single API call. With vector search, you will spend time on things that are specific to your use case, and less on fine tuning relevance, optimizing some queries, or wondering how to build a reliable search on your app.

Vector search is definitely solving a lot of problems that you can experience with more standard keyword search. But it is important to remember that keyword search still has many advantages outside of the edge-cases presented above. Someone looking for an iPhone, for example, will simply type “iphone,” and there is no need to rely on vector search for this straightforward query. A great advantage of keyword search is that because it’s a pretty standard technology, it’s extremely fast and very cost effective.

So, an ideal system would benefit from the best of both types of search functionality, i.e., using keyword search when the query is simple and known, and vector search for the long tail of unique and rare queries. Such a system can make sure that every query generated by users will always retrieve results that are fast, relevant, and accurate. 

Where do we go from here?

At Algolia, we have managed to build out such a solution to combine these functionalities. Learn more about our newly launched Algolia NeuralSearch solution.

Contact our team to learn more about how you can start leveraging Algolia Neuralsearch today.

About the author
Nicolas Fiorini

Senior Machine Learning Engineer

Recommended Articles

Powered byAlgolia Algolia Recommend

The past, present, and future of semantic search
ai

Julien Lemoine

Co-founder & former CTO at Algolia

What is search relevance?
product

Jon Silvers

Director, Digital Marketing

What is concept search?
ai

Hamish Ogilvy

VP, Artificial Intelligence