Query understanding is the process of analyzing search queries and translating them into an enhanced query that can produce better search results. It’s one of the most important keys to great search experiences.
Query rewriting, synonyms, spelling corrections, classification, NLP, vectorization, bigram, and trigram detection for query segmentation, semantic query understanding, personalization, localization, and query scoping (attribute mapping).
Absolutely critical. People make a lot of spelling mistakes and language is also inherently ambiguous (“test”, “mobile”, “apps”, “summer”, “north” are all nouns; “Jaguar” is a car, an operating system, an animal, etc.). People also use slang and mention things that aren’t necessarily ever mentioned in result item text at all (e.g. “size 14 shoes”, “near a park”, “next Thursday evening”, etc.). All of this needs to be translated into something meaningful that can better query the underlying data structure.
In addition, query understanding can also be used to prioritize results. For example, historically, in the US or Australia, when people search a government website for “license,” they were most often looking for a driver’s license renewal page — not the hundreds of other pages mentioning “license”. Historical performance data can automatically elevate the value of the preferred destination, even if there is no indication otherwise in the query itself that it is most important.
Query understanding is often a great test of search technology. What if you need to map “size 14” to a size attribute, or “next Thursday” evening to a time filter, or automatically spell correct queries? Can your current implementation do this? If not, then you are frustrating your customers and falling behind.
Many highly useful query understanding techniques, such as synonyms, spell correction, and semantics, belong to the family of query rewriting, which aims to modify the query to represent the query intent better and thus improve overall precision, recall, or both. The order in which these techniques are applied can significantly impact the query outcomes as well as the processing complexity. Below are some of those techniques most commonly used.
Synonyms are fairly straightforward. They are typically run as an alternate reality, so if someone searched for “gigantic shoes,” this may expand to “(gigantic OR enormous) shoes”. This is pretty simple but can be more complex if working out which to prioritize. It can also be even more complex if synonyms are generated using machine learning, e.g., via word embeddings. Newer semantic search capabilities, such as vector search, have a built-in understanding of similar, common terms, so there’s no need to create additional synonym libraries. We’ll touch more on this below.
Spell correction can work in different ways. One approach is to correct words by coming up with a single likeliest replacement and then allowing users to override it. Another approach is to be tolerant and allow up to two-letter spelling errors. This is called typo tolerance. However, the best alternative is not always to correct (language is ambiguous) and the penalty for being wrong is very high, as people may lose confidence and give up. Like synonyms, vector search solutions can help with many common misspellings.
Another approach to spell correction is to look at every possible variant and run all of them. This can explode very quickly and slow down queries. To retain query speed, a search engine would have to bake the prioritization of the ranking algorithm into the data structure itself at the expense of query flexibility. Depending on your application, this may or may not be a good tradeoff.
Our spell correction technique (as of this writing) is a combination of techniques. It looks at alternatives but can weigh them efficiently in terms of likelihood of the searcher’s intent using distance. This is very effective and is trained from not only query data, but also result items themselves — product names, titles, etc. It will understand your data immediately — in any language — and then continue to improve over time with more queries.
Query categorization helps the search engine to predict the most relevant category or selection of items. Search models will use previous searches and search data from your site to predict the likelihood that a query belongs to one category or another. In the case of Algolia’s Query Categorization, we use a vector-based semantic model to predict which classifications a search query is most likely to align with.
Semantic query understanding is the process of actually trying to understand the intent of queries. Language is very ambiguous by nature and the notion of polysemy aptly describes why this is such an issue for search: “poly” means multiple and “semy” in this case means senses, or meanings. “Bank” is a classic example of this; does this mean a financial institution or the side of a river? Without added context, it is difficult to know. English, in particular, is littered with examples. Luckily there are methods to handle this.
For queries with multiple words, the context is typically more obvious. For single search terms, it is more difficult, but in these cases, the past query sequence history can be useful. For example, if someone searched for “atm” and then searched for “bank”, it would be unlikely the second query was to do with the side of a river!
Classification can also be used to understand the type of query intent. This is most useful for query boxes that search datasets with multiple distinct data types. An example is LinkedIn, where you can search for companies, people, jobs, etc. The historical patterns for each query can assist with predicting the most likely result type. Ecommerce is another example, where a support query is different from a product search. For example, searching for “return shoes” should include results for returns, not display shoes for purchasing.
NLP is the process of analyzing unstructured text to infer structure and meaning. Structure, in this case, is referring to information that is highly defined, for example, a category or a number, much like fields in a database. It can also represent relationships between things. Common examples include sizes, colors, places, names, times, entities, and intent, but there are many more. NLP is most valuable when the underlying data has a lot of structure that can be mapped from the queries.
NLP will also parse the query to make it simpler to understand. A common example is the use of stemming to revert terms to their roots — “running” and “ran” become “run”. By mapping words to their stems, we can identify synonyms to improve lookup. NLP will also have a list of stop words — words such as “the”, “is”, “and”, and other commonly used terms — to reduce noise from the query.
Query scoping can make ordinary text search appear highly intelligent. This technique attempts to find structure within the query that doesn’t necessarily match unstructured text (reverse indexes) but instead maps directly to structured data attributes.
An example may be the query “black size 14 basketball shoes”. Product information is highly unlikely to mention the text “size 14”, but the sizing is highly likely to appear in a list of sizes on relevant products. “Black” as color may be in the description, but may also map to a more specific color attribute. So this query may actually remove this text and query only for “basketball shoes”, but at the same time filter the result set to size=14 and color=black. This dramatically increases precision, and in the case of a strict AND based search can also increase recall.
The downside of the above technique is for understanding queries with potentially mixed meaning. For the example query “size 6 nike”, the sizing may refer to shoe size which could be men’s, women’s, kid’s, US, UK or other, a shirt size, sports bra size, or even a basketball size! Over-filtering (or over-boosting certain results) can cause a significant loss of precision.
Despite the potential issues, the upside far outweighs any downsides. The typical approach to implementing this technique is to remove specific text identified from the query and convert it into a structured operation (i.e., filter or boost). This allows people to use natural language to describe what they want and have it transitioned into something much more meaningful than the original unstructured query.
Lastly, if a structure is not present on your records, that doesn’t mean you can’t generate it yourself. Index time analysis and data extraction can be a powerful tool to add structure to your information. Clustering, classification, topic modeling, tagging, and entity extraction are just some of the powerful techniques available.
Vectorization is the process of converting words into vectors (numbers) which allows their meaning to be encoded and processed mathematically. This underpins language translation and many other amazing applications. The magic in the vectors is they can also be added and subtracted, so the meaning of the text can also be added and subtracted. In practice, vectors are used for automating synonyms, clustering documents, detecting specific meanings, and intents in queries and ranking results.
Query segmentation is the process of identifying sequences of tokens that mean much more together than they do individually. This assists in improving precision by not returning results for partial segments. Two and three-word phrases (bigram and trigrams in this context) are useful in that they can potentially have much more meaning than individual words (unigrams).
For example, take the phrase “new york”. On their own, these two words don’t mean much, but together they obviously do! So instead of treating them on their own, they can be combined into a single term which can even have its reverse index. We could look up the indexes of both terms and work out when they are in sequence to derive the phrase matches but a) this is more complex and b) we possibly don’t want people searching for “new” to get results with “new york”. Also, if people search for “new” it is possible to match results containing “new york”. This is also problematic for reinforcement learning and other techniques where we don’t want machine learning to penalize partial matches unfairly. Hopefully, this hasn’t lost you, but the key point is maximizing information context.
Languages like English and French use a space to separate words, but languages without separators such as Chinese, Japanese, and Korean need an additional segmentation algorithm. At Algolia, we also build dictionaries into our libraries. Starting from the left of the question, Algolia will attempt to break down the query using known dictionary terms. In the event of uncertainty, we prioritize the solution that maximizes the length of the words while minimizing the number of characters that do not belong to a known word.
Personalization is the process of adding additional information to a query based on the individual that is searching to change the results to be more relevant to the individual. This can be as simple as a clothing size preference, gender, location, or any other individual characteristic that can influence their results.
Typically, the first stage of personalization is sending the information through with the queries to be recorded with other analytics. The performance impact can then be analyzed offline to determine if this is valuable. It should be noted that this is useful in its own right from a business intelligence perspective, even if it is not used for personalization.
Localization is the process of using the searcher’s location to enhance results. Meetup is a good example of this in action. Google also often prompts users to enable location services so it can deliver local results. It can also be extracted from the query text itself (see NLP above) as is typically done for real estate and job search engines.
Be warned it can cause issues. For example, geolocation in Australia is less than 80% accurate, so you will put a big number of people in the wrong random city, possibly even over 1000 miles away. These quirks are usually navigable, but design with caution and always account for the UX flow assuming the location is incorrect. Meetup.com is an example where the location is so critical to the search that it’s now deeply integrated into the search UX. It’s often worth thinking about the cost of a wrong guess. Query understanding is never going to be perfect, so often it’s better to make it easy for the user to self-select.
Voice-based queries are on the rise, and that’s only further increased the need for advanced query understanding techniques, mainly because web search originally trained us to shorten queries (more text = fewer results). But voice search has since increased the query length and included much more structure in the query. Ironically, voice translation may get words wrong, but the spelling is perfect, so this raises other challenges.
Query logs are your friend and should help you to set the priorities for query understanding. For example:
Advanced techniques like vectorization and personalization should come last as they require much more effort. While personalized search seems like the holy grail more often than not, there is much more value to be extracted by getting the basics right. After all, personalizing bad results is not going to be useful.
Hamish Ogilvy
VP, Artificial IntelligencePowered by Algolia AI Recommendations