Search by Algolia
Haystack EU 2023: Learnings and reflections from our team
ai

Haystack EU 2023: Learnings and reflections from our team

If you have built search experiences, you know creating a great search experience is a never ending process: the data ...

Paul-Louis Nech

Senior ML Engineer

What is k-means clustering? An introduction
product

What is k-means clustering? An introduction

Just as with a school kid who’s left unsupervised when their teacher steps outside to deal with a distraction ...

Catherine Dee

Search and Discovery writer

Feature Spotlight: Synonyms
product

Feature Spotlight: Synonyms

Back in May 2014, we added support for synonyms inside Algolia. We took our time to really nail the details ...

Jaden Baptista

Technical Writer

Feature Spotlight: Query Rules
product

Feature Spotlight: Query Rules

You’re running an ecommerce site for an electronics retailer, and you’re seeing in your analytics that users keep ...

Jaden Baptista

Technical Writer

An introduction to transformer models in neural networks and machine learning
ai

An introduction to transformer models in neural networks and machine learning

What do OpenAI and DeepMind have in common? Give up? These innovative organizations both utilize technology known as transformer models ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

What’s the secret of online merchandise management? Giving store merchandisers the right tools
e-commerce

What’s the secret of online merchandise management? Giving store merchandisers the right tools

As a successful in-store boutique manager in 1994, you might have had your merchandisers adorn your street-facing storefront ...

Catherine Dee

Search and Discovery writer

New features and capabilities in Algolia InstantSearch
engineering

New features and capabilities in Algolia InstantSearch

At Algolia, our business is more than search and discovery, it’s the continuous improvement of site search. If you ...

Haroen Viaene

JavaScript Library Developer

Feature Spotlight: Analytics
product

Feature Spotlight: Analytics

Analytics brings math and data into the otherwise very subjective world of ecommerce. It helps companies quantify how well their ...

Jaden Baptista

Technical Writer

What is clustering?
ai

What is clustering?

Amid all the momentous developments in the generative AI data space, are you a data scientist struggling to make sense ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

What is a vector database?
product

What is a vector database?

Fashion ideas for guest aunt informal summer wedding Funny movie to get my bored high-schoolers off their addictive gaming ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

Unlock the power of image-based recommendation with Algolia’s LookingSimilar
engineering

Unlock the power of image-based recommendation with Algolia’s LookingSimilar

Imagine you're visiting an online art gallery and a specific painting catches your eye. You'd like to find ...

Raed Chammam

Senior Software Engineer

Empowering Change: Algolia's Global Giving Days Impact Report
algolia

Empowering Change: Algolia's Global Giving Days Impact Report

At Algolia, our commitment to making a positive impact extends far beyond the digital landscape. We believe in the power ...

Amy Ciba

Senior Manager, People Success

Retail personalization: Give your ecommerce customers the tailored shopping experiences they expect and deserve
e-commerce

Retail personalization: Give your ecommerce customers the tailored shopping experiences they expect and deserve

In today’s post-pandemic-yet-still-super-competitive retail landscape, gaining, keeping, and converting ecommerce customers is no easy ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

Algolia x eTail | A busy few days in Boston
algolia

Algolia x eTail | A busy few days in Boston

There are few atmospheres as unique as that of a conference exhibit hall: the air always filled with an indescribable ...

Marissa Wharton

Marketing Content Manager

What are vectors and how do they apply to machine learning?
ai

What are vectors and how do they apply to machine learning?

To consider the question of what vectors are, it helps to be a mathematician, or at least someone who’s ...

Catherine Dee

Search and Discovery writer

Why imports are important in JS
engineering

Why imports are important in JS

My first foray into programming was writing Python on a Raspberry Pi to flicker some LED lights — it wasn’t ...

Jaden Baptista

Technical Writer

What is ecommerce? The complete guide
e-commerce

What is ecommerce? The complete guide

How well do you know the world of modern ecommerce?  With retail ecommerce sales having exceeded $5.7 trillion worldwide ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

Data is king: The role of data capture and integrity in embracing AI
ai

Data is king: The role of data capture and integrity in embracing AI

In a world of artificial intelligence (AI), data serves as the foundation for machine learning (ML) models to identify trends ...

Alexandra Anghel

Director of AI Engineering

Looking for something?

facebookfacebooklinkedinlinkedintwittertwittermailmail

What is query understanding?

Query understanding is the process of analyzing search queries and translating them into an enhanced query that can produce better search results. It’s one of the most important keys to great search experiences.

What are some examples of query understanding?

Query rewriting, synonyms, spelling corrections, classification, NLP, vectorization, bigram, and trigram detection for query segmentation, semantic query understanding, personalization, localization, and query scoping (attribute mapping).

How important is query understanding to great search results?

Absolutely critical. People make a lot of spelling mistakes and language is also inherently ambiguous (“test”, “mobile”, “apps”, “summer”, “north” are all nouns; “Jaguar” is a car, an operating system, an animal, etc.). People also use slang and mention things that aren’t necessarily ever mentioned in result item text at all (e.g. “size 14 shoes”, “near a park”, “next Thursday evening”, etc.). All of this needs to be translated into something meaningful that can better query the underlying data structure.

In addition, query understanding can also be used to prioritize results. For example, historically, in the US or Australia, when people search a government website for “license,” they were most often looking for a driver’s license renewal page — not the hundreds of other pages mentioning “license”. Historical performance data can automatically elevate the value of the preferred destination, even if there is no indication otherwise in the query itself that it is most important.

Query understanding is often a great test of search technology. What if you need to map “size 14” to a size attribute, or “next Thursday” evening to a time filter, or automatically spell correct queries? Can your current implementation do this? If not, then you are frustrating your customers and falling behind.

Where should I begin with query understanding techniques?

Many highly useful query understanding techniques, such as synonyms, spell correction, and semantics, belong to the family of query rewriting, which aims to modify the query to represent the query intent better and thus improve overall precision, recall, or both. The order in which these techniques are applied can significantly impact the query outcomes as well as the processing complexity. Below are some of those techniques most commonly used.

Synonyms

Synonyms are fairly straightforward. They are typically run as an alternate reality, so if someone searched for “gigantic shoes,” this may expand to “(gigantic OR enormous) shoes”. This is pretty simple but can be more complex if working out which to prioritize. It can also be even more complex if synonyms are generated using machine learning, e.g., via word embeddings. Newer semantic search capabilities, such as vector search, have a built-in understanding of similar, common terms, so there’s no need to create additional synonym libraries. We’ll touch more on this below.

Spelling correction

Spell correction can work in different ways. One approach is to correct words by coming up with a single likeliest replacement and then allowing users to override it. Another approach is to be tolerant and allow up to two-letter spelling errors. This is called typo tolerance. However, the best alternative is not always to correct (language is ambiguous) and the penalty for being wrong is very high, as people may lose confidence and give up. Like synonyms, vector search solutions can help with many common misspellings.

spell correction
Spell correction and typo tolerance is a powerful query understanding tool that is deceptively difficult to execute well.

Another approach to spell correction is to look at every possible variant and run all of them. This can explode very quickly and slow down queries. To retain query speed, a search engine would have to bake the prioritization of the ranking algorithm into the data structure itself at the expense of query flexibility. Depending on your application, this may or may not be a good tradeoff.

Our spell correction technique (as of this writing) is a combination of techniques. It looks at alternatives but can weigh them efficiently in terms of likelihood of the searcher’s intent using distance. This is very effective and is trained from not only query data, but also result items themselves — product names, titles, etc. It will understand your data immediately — in any language — and then continue to improve over time with more queries.

Query categorization

Query categorization helps the search engine to predict the most relevant category or selection of items. Search models will use previous searches and search data from your site to predict the likelihood that a query belongs to one category or another. In the case of Algolia’s Query Categorization, we use a vector-based semantic model to predict which classifications a search query is most likely to align with.

Semantic query understanding

Semantic query understanding is the process of actually trying to understand the intent of queries. Language is very ambiguous by nature and the notion of polysemy aptly describes why this is such an issue for search: “poly” means multiple and “semy” in this case means senses, or meanings. “Bank” is a classic example of this; does this mean a financial institution or the side of a river? Without added context, it is difficult to know. English, in particular, is littered with examples. Luckily there are methods to handle this.

For queries with multiple words, the context is typically more obvious. For single search terms, it is more difficult, but in these cases, the past query sequence history can be useful. For example, if someone searched for “atm” and then searched for “bank”, it would be unlikely the second query was to do with the side of a river!

Classification can also be used to understand the type of query intent. This is most useful for query boxes that search datasets with multiple distinct data types. An example is LinkedIn, where you can search for companies, people, jobs, etc. The historical patterns for each query can assist with predicting the most likely result type. Ecommerce is another example, where a support query is different from a product search. For example, searching for “return shoes” should include results for returns, not display shoes for purchasing.

search intent ambiguity
LinkedIn is a good example where the search intent can be highly ambiguous. Data is key to getting the intent right more often than wrong.

Natural Language Processing (NLP)

NLP is the process of analyzing unstructured text to infer structure and meaning. Structure, in this case, is referring to information that is highly defined, for example, a category or a number, much like fields in a database. It can also represent relationships between things. Common examples include sizes, colors, places, names, times, entities, and intent, but there are many more. NLP is most valuable when the underlying data has a lot of structure that can be mapped from the queries.

NLP will also parse the query to make it simpler to understand. A common example is the use of stemming to revert terms to their roots — “running” and “ran” become “run”. By mapping words to their stems, we can identify synonyms to improve lookup. NLP will also have a list of stop words — words such as “the”, “is”, “and”, and other commonly used terms — to reduce noise from the query.

nlp processing
Natural language processing (NLP) is getting closer to the human level interpretation of written language.

Query scoping

query scoping
NLP offers user-friendly search capabilities.

Query scoping can make ordinary text search appear highly intelligent. This technique attempts to find structure within the query that doesn’t necessarily match unstructured text (reverse indexes) but instead maps directly to structured data attributes.

An example may be the query “black size 14 basketball shoes”. Product information is highly unlikely to mention the text “size 14”, but the sizing is highly likely to appear in a list of sizes on relevant products. “Black” as color may be in the description, but may also map to a more specific color attribute. So this query may actually remove this text and query only for “basketball shoes”, but at the same time filter the result set to size=14 and color=black. This dramatically increases precision, and in the case of a strict AND based search can also increase recall.

The downside of the above technique is for understanding queries with potentially mixed meaning. For the example query “size 6 nike”, the sizing may refer to shoe size which could be men’s, women’s, kid’s, US, UK or other, a shirt size, sports bra size, or even a basketball size! Over-filtering (or over-boosting certain results) can cause a significant loss of precision.

Despite the potential issues, the upside far outweighs any downsides. The typical approach to implementing this technique is to remove specific text identified from the query and convert it into a structured operation (i.e., filter or boost). This allows people to use natural language to describe what they want and have it transitioned into something much more meaningful than the original unstructured query.

Lastly, if a structure is not present on your records, that doesn’t mean you can’t generate it yourself. Index time analysis and data extraction can be a powerful tool to add structure to your information. Clustering, classification, topic modeling, tagging, and entity extraction are just some of the powerful techniques available.

Word embeddings

Vectorization is the process of converting words into vectors (numbers) which allows their meaning to be encoded and processed mathematically. This underpins language translation and many other amazing applications. The magic in the vectors is they can also be added and subtracted, so the meaning of the text can also be added and subtracted. In practice, vectors are used for automating synonyms, clustering documents‍, detecting specific meanings, and intents in queries and ranking results.

word2vec
The well-known word-embedding technique word2vec illustrates how language can be converted into mathematical vectors that retain meaning. It’s even more surprising the context can be added and subtracted mathematically that retains meaning.

Query segmentation

Query segmentation is the process of identifying sequences of tokens that mean much more together than they do individually. This assists in improving precision by not returning results for partial segments. Two and three-word phrases (bigram and trigrams in this context) are useful in that they can potentially have much more meaning than individual words (unigrams).

For example, take the phrase “new york”. On their own, these two words don’t mean much, but together they obviously do! So instead of treating them on their own, they can be combined into a single term which can even have its reverse index. We could look up the indexes of both terms and work out when they are in sequence to derive the phrase matches but a) this is more complex and b) we possibly don’t want people searching for “new” to get results with “new york”. Also, if people search for “new” it is possible to match results containing “new york”. This is also problematic for reinforcement learning and other techniques where we don’t want machine learning to penalize partial matches unfairly. Hopefully, this hasn’t lost you, but the key point is maximizing information context.

Languages like English and French use a space to separate words, but languages without separators such as Chinese, Japanese, and Korean need an additional segmentation algorithm. At Algolia, we also build dictionaries into our libraries. Starting from the left of the question, Algolia will attempt to break down the query using known dictionary terms. In the event of uncertainty, we prioritize the solution that maximizes the length of the words while minimizing the number of characters that do not belong to a known word.

Personalization

Personalization is the process of adding additional information to a query based on the individual that is searching to change the results to be more relevant to the individual. This can be as simple as a clothing size preference, gender, location, or any other individual characteristic that can influence their results.

Typically, the first stage of personalization is sending the information through with the queries to be recorded with other analytics. The performance impact can then be analyzed offline to determine if this is valuable. It should be noted that this is useful in its own right from a business intelligence perspective, even if it is not used for personalization.

nlp personal info
Often queries can be layered with personal information to augment and influence queries to produce highly personalized results

Localization

Localization is the process of using the searcher’s location to enhance results. Meetup is a good example of this in action. Google also often prompts users to enable location services so it can deliver local results. It can also be extracted from the query text itself (see NLP above) as is typically done for real estate and job search engines.

Be warned it can cause issues. For example, geolocation in Australia is less than 80% accurate, so you will put a big number of people in the wrong random city, possibly even over 1000 miles away. These quirks are usually navigable, but design with caution and always account for the UX flow assuming the location is incorrect. Meetup.com is an example where the location is so critical to the search that it’s now deeply integrated into the search UX. It’s often worth thinking about the cost of a wrong guess. Query understanding is never going to be perfect, so often it’s better to make it easy for the user to self-select.

location search
Meetup.com integrates location selection directly into the search bar.

Voice search

Voice-based queries are on the rise, and that’s only further increased the need for advanced query understanding techniques, mainly because web search originally trained us to shorten queries (more text = fewer results). But voice search has since increased the query length and included much more structure in the query. Ironically, voice translation may get words wrong, but the spelling is perfect, so this raises other challenges.

Query understanding tips

Query logs are your friend and should help you to set the priorities for query understanding. For example:

  • Zero-result searches with high volume point to opportunities. Why are they failing?
  • If no one is using natural language and/or you don’t have nicely structured information, then query scoping may not be that useful.
  • If business language is not translating well to customer speak, synonyms may be useful.
  • Do your customers make spelling mistakes and how frequently? If they make lots of mistakes then corrections will undoubtedly help.

Advanced techniques like vectorization and personalization should come last as they require much more effort. While personalized search seems like the holy grail more often than not, there is much more value to be extracted by getting the basics right. After all, personalizing bad results is not going to be useful.

About the author
Hamish Ogilvy

VP, Artificial Intelligence

linkedin

Recommended Articles

Powered byAlgolia Algolia Recommend

What is end-to-end AI search?
ai

Abhijit Mehta

Director of Product Management

What is a search query and how is it processed by a search engine?
product

Catherine Dee

Search and Discovery writer

What is concept search?
ai

Hamish Ogilvy

VP, Artificial Intelligence