Search by Algolia
Easily integrate Algolia into native apps with FlutterFlow
engineering

Easily integrate Algolia into native apps with FlutterFlow

Algolia's advanced search capabilities pair seamlessly with iOS or Android Apps when using FlutterFlow. App development and search design ...

Chuck Meyer

Sr. Developer Relations Engineer

Algolia's search propels 1,000s of retailers to Black Friday success
e-commerce

Algolia's search propels 1,000s of retailers to Black Friday success

In the midst of the Black Friday shopping frenzy, Algolia soared to new heights, setting new records and delivering an ...

Bernadette Nixon

Chief Executive Officer and Board Member at Algolia

Generative AI’s impact on the ecommerce industry
ai

Generative AI’s impact on the ecommerce industry

When was your last online shopping trip, and how did it go? For consumers, it’s becoming arguably tougher to ...

Vincent Caruana

Senior Digital Marketing Manager, SEO

What’s the average ecommerce conversion rate and how does yours compare?
e-commerce

What’s the average ecommerce conversion rate and how does yours compare?

Have you put your blood, sweat, and tears into perfecting your online store, only to see your conversion rates stuck ...

Vincent Caruana

Senior Digital Marketing Manager, SEO

What are AI chatbots, how do they work, and how have they impacted ecommerce?
ai

What are AI chatbots, how do they work, and how have they impacted ecommerce?

“Hello, how can I help you today?”  This has to be the most tired, but nevertheless tried-and-true ...

Catherine Dee

Search and Discovery writer

Algolia named a leader in IDC MarketScape
algolia

Algolia named a leader in IDC MarketScape

We are proud to announce that Algolia was named a leader in the IDC Marketscape in the Worldwide General-Purpose ...

John Stewart

VP Corporate Marketing

Mastering the channel shift: How leading distributors provide excellent online buying experiences
e-commerce

Mastering the channel shift: How leading distributors provide excellent online buying experiences

Twice a year, B2B Online brings together America’s leading manufacturers and distributors to uncover learnings and industry trends. This ...

Jack Moberger

Director, Sales Enablement & B2B Practice Leader

Large language models (LLMs) vs generative AI: what’s the difference?
ai

Large language models (LLMs) vs generative AI: what’s the difference?

Generative AI and large language models (LLMs). These two cutting-edge AI technologies sound like totally different, incomparable things. One ...

Catherine Dee

Search and Discovery writer

What is generative AI and how does it work?
ai

What is generative AI and how does it work?

ChatGPT, Bing, Bard, YouChat, DALL-E, Jasper…chances are good you’re leveraging some version of generative artificial intelligence on ...

Catherine Dee

Search and Discovery writer

Feature Spotlight: Query Suggestions
product

Feature Spotlight: Query Suggestions

Your users are spoiled. They’re used to Google’s refined and convenient search interface, so they have high expectations ...

Jaden Baptista

Technical Writer

What does it take to build and train a large language model? An introduction
ai

What does it take to build and train a large language model? An introduction

Imagine if, as your final exam for a computer science class, you had to create a real-world large language ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

The pros and cons of AI language models
ai

The pros and cons of AI language models

What do you think of the OpenAI ChatGPT app and AI language models? There’s lots going on: GPT-3 ...

Catherine Dee

Search and Discovery writer

How AI is transforming merchandising from reactive to proactive
e-commerce

How AI is transforming merchandising from reactive to proactive

In the fast-paced and dynamic realm of digital merchandising, being reactive to customer trends has been the norm. In ...

Lorna Rivera

Staff User Researcher

Top examples of some of the best large language models out there
ai

Top examples of some of the best large language models out there

You’re at a dinner party when the conversation takes a computer-science-y turn. Have you tried ChatGPT? What ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

What are large language models?
ai

What are large language models?

It’s the era of Big Data, and super-sized language models are the latest stars. When it comes to ...

Catherine Dee

Search and Discovery writer

Mobile search done right: Common pitfalls and best practices
ux

Mobile search done right: Common pitfalls and best practices

Did you know that 86% of the global population uses a smartphone? The 7 billion devices connected to the Internet ...

Alexandre Collin

Staff SME Business & Optimization - UI/UX

Cloud Native meetup: Observability & Sustainability
engineering

Cloud Native meetup: Observability & Sustainability

The Cloud Native Foundation is known for being the organization behind Kubernetes and many other Cloud Native tools. To foster ...

Tim Carry

Algolia DocSearch is now free for all docs sites
product

Algolia DocSearch is now free for all docs sites

TL;DR Revamp your technical documentation search experience with DocSearch! Previously only available to open-source projects, we're excited ...

Shane Afsar

Senior Engineering Manager

Looking for something?

facebookfacebooklinkedinlinkedintwittertwittermailmail

Search has been around for a while, to the point that it is now considered a standard requirement in many applications. Decades ago, we managed to make search engines scale by leveraging inverted indexes, a data structure that allows a very quick lookup of documents containing certain words. We’ve come a long way since then and have moved to vector indexes.

To understand why vector search represents such a significant advancement in search capabilities, we need to look back and see how search functionality has developed over time. 

The early stage of search

Information retrieval is a pretty old (and still active) research area. Every year, conferences like ECIR or SIGIR attract considerable interest from researchers and engineers around the world to discuss progress made in this field. The Algolia team attended the recently held ECIR conference, and you can check out our key highlights from the event.

End to end AI search banner

The development of search functionality as we know it can probably be traced as far back as the 1950s, as researchers tried to solve the problem of efficient information retrieval across large databases. Progress was made steadily with the increase in computing power and the introduction of new concepts, such as the inverted index to scale better.

This data structure allows us to fetch documents — matching specific words with great speed and accuracy — while also being potentially distributed, enabling the architecture to scale infinitely. In the late 1990s, this was the kind of structure that allowed Google to scale on the size that was the Internet at the time.

For search functionality that is truly performant, fetching is just one part of the equation, the other is ranking. A major breakthrough in this area has been TF-IDF (term-frequency, inverse-document-frequency). TF-IDF is the dominant term-weighting scheme that assigns a score to each piece of content, ranking it in relevance to each user query. This score computes the number of matches of query words in a document (TF) and the frequency of said words in the whole corpus (IDF).

The combination of the two makes sure that documents ranked first are matching the query well, and ideally on words that are rare and specific. That is, matching on articles such as the word “the” is less important than matching on less frequent, but more important terms like the word “cat.”

Limitations

While innovations, like inverted indexes and term weighting schemes, are central to making search actually work and scale, there were also some serious limitations that were inherent with these early search engines. Since the 1990s the industry has been working to address these challenges in a variety of ways: 

Text processing

Because search engines used to  store keywords and corresponding documents containing those keywords in inverted indexes, there was a need to formalize what each keyword was. This can be a difficult challenge due to a variety of reasons such as: hyphenated words may be segmented differently, languages work differently, uppercase and lowercase words may be treated differently, etc.

In each instance, it may seem easy to solve these challenges but there will always be exceptions for each word that may create edge cases. For example, we may decide to lowercase all text, but then someone looking for “Bush” may find documents that are not about “George Bush.” Coming up with these rules across each language is an extremely tedious process and can result in a suboptimal search experience.

Exact matches

If someone looks for the word “cats,” they will not find documents that contain the word “cat.” The solution in general for this challenge is stemming, where we remove the suffix of all words both in queries and documents, so that someone looking for “cats” will retrieve documents containing “cat” and “cats.”

The limitation of this approach is, once again, the resulting exceptions and edge cases. For example, “universal” and “university” are stemmed to the same token.

Word ambiguity and synonyms

Some words are ambiguous, such as “jaguar.” If the query is just this word, it is impossible to really guess what the intent is (do they mean the American professional football team, the automobile, the animal, or something entirely different). But if the query is “jaguar zoo” or “jaguar price,” intents should be easily differentiable based on the context. However, search engines that rely simply on words alone cannot understand the context by default.

On the other hand, some words are different but refer to the same concept, for example “e-mail” and “email.” Or, it would be relevant for a query “chicken” to retrieve documents containing “hen.” The solution in general is to come up with a list of synonyms so the engine maps words together. 

Misspelling

Because spelling needs to be accurate in keyword search engines, misspelled words will not allow the user to retrieve the right results unless an autocorrect feature is in place. Autocorrecting queries need some specific development, since the impact/types of misspellings will differ depending on whether you are an ecommerce platform, a publisher, or a legal document library. 

Language

In addition to the challenges presented so far, language support adds another layer of complexity. For each one of the limitations above, additional work will be needed to resolve each challenge across every language supported by the system. Assuming you have fine tuned your engine with proper tokenization, stemming, synonyms, and autocorrect for English, you will need to develop the same for every other supported language, adding tremendous complexity to your project.

New breakthroughs and expanded horizons 

Recent years have been marked by the development of more and more powerful large language models (LLMs). Their power comes from different sources such as:

  • Multitasking, e.g., ELMo was trained to perform multiple tasks with equal or better quality than independent models
  • Transformer models, which scale much better
  • Multilingual models, such as M-BERT
  • Massive models trained on massive datasets, such as GPT4

The progress of the last 5 years at a very high pace of innovation has led us to where we are today. That is, language models that are much more scalable and powerful than they were back in 2017. This breakthrough is significant for the field of search engines, because it naturally removes a lot of the challenges and limitations mentioned earlier in this article. 

LLMs can be used to understand both queries and documents by creating their vector representation. This is a semantic representation of the entire content, which removes the points of failure listed earlier that occur because of the word-by-word approach of traditional keyword search.

Stemming is also not needed anymore, since the vector representation of the word “cat” will be very close to that of “cats.’ Even better, the vector representation of “when was Barack Obama born” will be very close to the one of “how old is the 44th President of the US.” Vector search engines are naturally much better at assessing semantic similarity, while with keyword search a lot of manual work is needed to reach this level. 

Cross-language engines that are resilient to user misspelling are also much more accessible using vector search. First, because these models support more and more languages out of the box. Second, because the data on which these LLMs are trained typically come from the Internet. Misspellings are available at training and models learn about their representation. Furthermore, models are designed to support word variations much better in recent years too.

Hence today, building a search engine that works in all situations and that really understands the query and documents is much more accessible and low maintenance. There are still important challenges but they have shifted from how to be relevant to how to make these models scale. Even if models are more scalable, running them in real-time and making predictions with high throughput remains a complex problem.

Luckily, Algolia has a proven track record in scalability, availability, and robustness to bring the best of both worlds to our customers. We’ve also designed a proprietary NeuralHashing solution to compress vectors to a fraction of their size while retaining up to 99% of the information. This allows us to deliver vector-based results as fast as keyword results and can even combine them into a single API call. With vector search, you will spend time on things that are specific to your use case, and less on fine tuning relevance, optimizing some queries, or wondering how to build a reliable search on your app.

Vector search is definitely solving a lot of problems that you can experience with more standard keyword search. But it is important to remember that keyword search still has many advantages outside of the edge-cases presented above. Someone looking for an iPhone, for example, will simply type “iphone,” and there is no need to rely on vector search for this straightforward query. A great advantage of keyword search is that because it’s a pretty standard technology, it’s extremely fast and very cost effective.

So, an ideal system would benefit from the best of both types of search functionality, i.e., using keyword search when the query is simple and known, and vector search for the long tail of unique and rare queries. Such a system can make sure that every query generated by users will always retrieve results that are fast, relevant, and accurate. 

Where do we go from here?

At Algolia, we have managed to build out such a solution to combine these functionalities. Learn more about our newly launched Algolia NeuralSearch solution.

Contact our team to learn more about how you can start leveraging Algolia Neuralsearch today.

About the author
Nicolas Fiorini

Director, AI Engineering

Recommended Articles

Powered byAlgolia Algolia Recommend

The past, present, and future of semantic search
ai

Julien Lemoine

Co-founder & former CTO at Algolia

What is concept search?
ai

Hamish Ogilvy

VP, Artificial Intelligence

What is search relevance?
product

Jon Silvers

Director, Digital Marketing