Vector vs Keyword Search: Why You Should Care
Search has been around for a while, to the point that it is now considered a standard requirement in many ...
Senior Machine Learning Engineer
Search has been around for a while, to the point that it is now considered a standard requirement in many ...
Senior Machine Learning Engineer
It’s no secret that B2B (business-to-business) transactions have largely migrated online. According to Gartner, by 2025, 80 ...
Sr. SEO Web Digital Marketing Manager
Twice a year, B2B Online brings together industry leaders to discuss the trends affecting the B2B ecommerce industry. At the ...
Director of Product Marketing & Strategy
This is Part 2 of a series that dives into the transformational journey made by digital merchandising to drive positive ...
Benoit Reulier &
Reshma Iyer
Get ready for the ride: online shopping is about to be completely upended by AI. Over the past few years ...
Director, User Experience & UI Platform
Remember life before online shopping? When you had to actually leave the house for a brick-and-mortar store to ...
Search and Discovery writer
If you imagine pushing a virtual shopping cart down the aisles of an online store, or browsing items in an ...
Sr. SEO Web Digital Marketing Manager
Remember the world before the convenience of online commerce? Before the pandemic, before the proliferation of ecommerce sites, when the ...
Search and Discovery writer
Artificial intelligence (AI) is no longer just the stuff of scary futuristic movies; it’s recently burst into the headlines ...
Search and Discovery writer
Imagine you are the CTO of a company that has just undergone a massive decade long digital transformation. You’ve ...
CTO @Algolia
Did you know that the tiny search bar at the top of many ecommerce sites can offer an outsized return ...
Director, Digital Marketing
Artificial intelligence (AI) has quickly moved from hot topic to everyday life. Now, ecommerce businesses are beginning to clearly see ...
VP of Product
We couldn’t be more excited to announce the availability of our breakthrough product, Algolia NeuralSearch. The world has stepped ...
Chief Executive Officer and Board Member at Algolia
The ecommerce industry has experienced steady and reliable growth over the last 20 years (albeit interrupted briefly by a global ...
CTO @Algolia
As an ecommerce professional, you know the importance of providing a five-star search experience on your site or in ...
Sr. SEO Web Digital Marketing Manager
Hashing. Yep, you read that right. Not hashtags. Not golden, crisp-on-the-outside, melty-on-the-inside hash browns ...
Search and Discovery writer
We’re just back from ECIR23, the leading European conference around Information Retrieval systems, which ran its 45th edition in ...
Senior ML Engineer
Your grandfather wears those comfy slipper-y shoes all day, every day, and they’re starting to get holes in ...
Sr. SEO Web Digital Marketing Manager
Jan 17th 2023 ai
For many years search engines have predominantly relied on keywords, much like the indexes you find in the back of books. Unless a query matches a keyword in your index, the search engine can come up empty-handed. While the concept of “matching” has traditionally powered search engines, a major shift from “matching” to “understanding” is currently underway. This is being driven by AI, which is used to represent text mathematically such that it can be conceptually understood by machines. Concepts are taking over from keywords and it’s great news for everyone.
In this article, I’ll explain a bit about what concept search is and how the semantic machine learning technology around it is changing. It’s helpful to first understand the limitations of traditional keyword-based models.
Around 80% of all data applicable to business is unstructured (as opposed to structured data like age, weight, price, addresses, etc.). In order to find things in unstructured information, search engines have been the tool of choice. The main methodology behind this has been the tokenization of keywords, which splits text into common pieces (essentially lookup keys) which are then used to build indexes.
Each token (word, phrase, ngram, stem, lemma, etc) is linked to the records where it occurs. The same tokenization process is then applied to queries, the resulting tokens can then be used to find matching items and this “matching” process forms the basis of keyword search retrieval. In this context retrieval means to retrieve relevant matches for a query. Ranking is then typically used to order the results in the most useful order.
For some time TF-IDF was the standard for keyword search. This formula looks at the Term Frequency (TF), which is the number of occurrences of a keyword in a matching document (more is better) and Inverse Document Frequency (IDF), which looks at how popular the keyword is in the document corpus (less popular is better, hence the “inverse”).
TF-IDF worked ok, but the gold standard today are variants of BM25. BM25 looked to solve some of the deficiencies of TF-IDF, mostly that TF is very susceptible to spamming. It introduces a dampening to the TF formula so more matches are increasingly less important as per below. It also uses a document length to correct for longer documents containing more keywords.
Today the most important variant is BM25F, which includes relative field importance in the calculation. This allows a title match to be more than a match in the middle of the document text etc.
Before pointing out the issues with BM25F, keep in mind this is still the gold standard in 2023 for keyword search. This is the benchmark technique to beat in academia.
Everything mentioned above works on a “bag of words” approach. The sequence of words is ignored and only their individual intersection with the target document is important. This has many problems in real world search scenarios, particularly shorter form structured data.
As organizations moved online, enterprise search became a key requirement for knowledge management. As data and information assets in general further exploded in volume, the importance of enterprise search has only increased exponentially. Yet without document enrichment with intelligent metadata, auto-classification, taxonomy management, and other methods of adding structure, relevance has typically been poor. The result is that people at work cannot find relevant documents — and this is a big problem.
Keywords are hard for search engines. You have synonymy (multiple words with shared meaning), polysemy (words with multiple meanings), sequence (order is sometimes important but not always), abbreviations, asymmetry (query words not expected to appear in target results), and more.
In general, keyword search implies you already know the answer to what you’re looking for and how it will be explicitly described. For example:
There are workarounds to these problems, but they can be time-consuming and never-ending.
In a traditional sense the goal of search is to take a query and try to find occurrences of it in a set of items, much like the index in the back of a book. This assumes a symmetrical relationship between the query and result text, i.e., you search with the answer, not with the question. Symmetry assumes you already know the answer.
The context of keywords is typically not useful enough to determine the searcher’s intent. Take the simple example of “bank”. When someone types this, they could mean:
Above is a good example of polysemy. This can also be extended to asymmetry. For example, if someone searches for “plane turning” this may not return a result that says “plane banked”, yet the meaning is similar. “Plane” itself is also an example of polysemy and an abbreviation of “aeroplane”!
Compound term processing works to combine terms into groups that have their own meaning that is different from the individual terms. One example is “new jersey”, which has totally different meaning to “new” AND “jersey” as individual terms. In practice, keyword search usually handles the compounded queries well; it typically requires all terms to match, scoring sequences higher than containing all individual terms. However it struggles with partially compounded terms, “bank” being a great example. It will match all contextual occurrences of “bank” as there is no way to determine which context is correct.
Note: the above is also assuming queries are treated as AND (require all terms to match). In practice, some keyword search uses OR, which can match any of the query terms and is thus far more likely to return contextually irrelevant results. Some search technologies also use a hybrid approach which treats some text as AND and others as OR, which can be smart or naive in nature. Boolean search is a way to give the searcher access to control how things are matched by allowing the use of syntax in the query such as quotes, “AND”, “OR”, and “NOT” operators. This can be useful but is generally beyond comprehension for the average person searching.
Keyword search works well for the “fat head” queries that represent the most popular searches. However, “long tail” searches frequently fail, and they can be 50% or more of the queries in your catalog. The ways that keywords fail when searching are endless. People have spent massive amounts of time writing rules, dictionaries, synonym libraries, and more. As I’ll show, keywords are still quite useful, but they’re even better when paired with AI.
Keywords (and their associated tokens) are relatively binary in respect to search, particular words either exist or they do not. Concept searching is based on vectors. The mathematics of vectors allow for the measurement of closeness, thus the relationship of text is no longer binary but rather a distribution.
Text is represented as vectors, and text with close conceptual meanings share very similar vectors. Typically the vector orientation is used rather than the magnitude, so the angle between the vectors becomes a measure of similarity. This is called cosine similarity and would be very familiar to anyone who has done high school math! The only difference is that vectors representing text use hundreds of dimensions, so it’s harder to visually represent as per above (2 dimensions).
How is text turned into vectors? Neural networks are used to look at word sequences and build vector-based models that can convert text to vectors, called embeddings. There are many examples of these and more appearing all the time. For example, AirBnB uses embeddings to help power their similar listings feature.
Concepts are great, but they also blur out the query meaning, so keywords are actually still useful. Thus, state of the art search is actually built on what is called “hybrid retrieval”, which is a combination of keyword and concept-based search.
Here are some of the ways we designed hybrid retrieval in our new AI engine.
With the addition of neural hashing, Algolia is the only company with a scalable hybrid offering capable of working for many different use cases right out of the box. We can now offer search that is just as fast (and often faster) and more accurate than keywords-only. One of my favorite examples is running a query on a Best Buy dataset for the phrase “something to keep my beer cold.” If someone walked into your store and asked for “something to keep my beer cold,” you would know exactly what they mean. A keyword-only search engine would have a rough time. However, a hybrid retrieval engine is able to understand the concepts to deliver incredible results in 0.001043 seconds!
Our demo site doesn’t contain any additional metadata. The terms “cold” and “beer” don’t appear on any of the records on the site, but the site understands the concepts!
Stay tuned. The new Algolia search experience is coming soon! Or, sign up to be notified when it’s available.
Powered by Algolia Recommend