NLP and NLU—two (often confused) technologies that make search more intelligent and ensure that people can search and find what they want, without having to type the exact right words as they’re found on a page or in a product.
NLP and NLU are why you can type “dresses” and find that long sought-after “NYE Party Dress”, and why you can type “Matthew McConnahey” and get Mr. McConnaughey back.
NLP stands for “natural language processing.” It’s one of those things that has built up such a large meaning, that it’s easy to look past the fact that it tells you exactly what it is: NLP processes natural language, specifically into a format that computers can understand. These kinds of processing can include such tasks as normalization, spelling correction, or stemming, each of which we’ll look at in more detail.
NLU stands for “natural language understanding.” This technology aims to “understand” what a block of natural language is communicating. It performs tasks that can, for example, identify verbs and nouns in sentences or important items within a text. People or programs can then use this information to complete other tasks.
Computers seem advanced because they can do a lot of actions in a short period of time. However, in a lot of ways, computers are quite daft. They need information to be structured in specific ways to build upon it. For natural language data, that’s where NLP comes in, because it takes messy data (and natural language can be very messy) and processes it into something that computers can work with.
When searchers type text into a search bar, they are trying to find a good match, not play “guess the format.” It would be unfair and unproductive, for example, to require a user to type a query in exactly the same format as the matching words in a record. We use text normalization to do away with this requirement so that the text will be in a standard format no matter where it’s coming from.
What we’ll see as we go through different normalization steps is that there is no approach that everyone follows. Each normalization step generally increases recall and decreases precision.
A quick aside: “recall” means that a search engine finds results that are known to be good. Precision means that a search engine finds only good results. Search results could have 100% recall by returning every single document in an index, but precision would be poor. Conversely, a search engine could have 100% precision by only returning documents that it knows for sure to be a perfect fit, but some good results will likely be missed.
Again, normalization generally increases recall and decreases precision. Whether that movement towards one end of the recall-precision spectrum is valuable depends on the use case and the search technology, so it isn’t a question of applying all normalization techniques, but instead deciding which ones provide the best balance of precision and recall.
The simplest normalization you could imagine would be the handling of letter case. In English, at least, words are generally capitalized at the beginning of sentences, occasionally in titles, and when they are proper nouns. (There are other rules, too, depending on whom you ask.) But in German, all nouns are capitalized. Other languages have their own rules.
These rules are useful, otherwise, we wouldn’t follow them. For example, capitalizing the first words of sentences helps us quickly see where sentences begin. That usefulness, however, is diminished in an information retrieval context. The meanings of words don’t change simply because they are in a title and have their first letter capitalized.
Even trickier is that there are rules, and then there is how people actually write. If I text my wife, “SOMEONE HIT OUR CAR!”, we all know that what I’m talking about is a car, and not something different because the word is capitalized. We can see this clearly by reflecting on how many people don’t use any capitalization at all when communicating informally—which is, incidentally, how most case-normalization works.
Of course, we know that sometimes capitalization does change the meaning of a word or phrase. We can see that “cats” are an animal, and “Cats” is a musical. In most cases, though, the increased precision that comes with not normalizing on case is offset by decreasing recall by far too much. The difference between the two is easy to tell via context, too, which we’ll be able to leverage through natural language understanding.
While less common in English, handling diacritics is also a form of letter normalization. Diacritics are the marks, or “glyphs,” attached to letters, as in á, ë, or ç. Words can be otherwise spelled the same, but added diacritics can change the meaning. In French, “élève” means “student,” while “élevé” means “elevated.” Nonetheless, many people will not include the diacritics when searching, and so another form of normalization is to strip all diacritics, leaving behind the simple (and now ambiguous) “eleve.”
The next normalization challenge is how to break down the text the searcher has typed in the search bar and the text in the document.This step is necessary because word order does not need to be exactly the same between the query and the document text, except when a searcher wraps the query in quotes.
Breaking queries, phrases, and sentences into words may seem like a simple task: just break up the text at each space. Problems show up quicky with this approach. Again, let’s start with English. Separating on spaces alone means that the phrase “Let’s break up this phrase!” yields us let’s, break, up, this, and phrase! as words.
For search, we almost surely don’t want the exclamation point at the end of the word “phrase.” Whether we want to keep the contracted word “let’s” together is not as clear. Some software will break the word down even further (“let” and “‘s”) and some won’t. Some will not break down “let’s” while breaking down “don’t” into two pieces.
This process is called “tokenization.” We call it tokenization for reasons that should now be clear: what we end up with are not words but discrete groups of characters. This is even more true for languages other than English.
German speakers, for example, can merge words (more accurately “morphemes,” but close enough) together to form a larger word. The German word for “dog house” is “Hundehütte,” which contains the words for both “dog” (“Hund”) and “house” (“Hütte”).
Nearly all search engines tokenize text, but there are further steps an engine can take to normalize the tokens. Two related approaches are stemming and lemmatization.
Stemming and lemmatization take different forms of tokens and break them down so that they can be compared. For example, take the words “calculator” and “calculation,” or “slowing” and “slowly.” We can see there are some clear similarities.
Stemming breaks a word down to its “stem,” or what other variants of the word are based off of. Stemming is fairly straightforward; you could do it on your own. What’s the stem of “stemming?” You can probably guess that it’s “stem.” Often stemming means removing prefixes or suffixes, as in this case.
There are multiple stemming algorithms, and the most popular is the Porter Stemming Algorithm, which has been around since the 1980s. It is a series of steps applied to a token to get to the stem.
Stemming can sometimes lead to results that you wouldn’t foresee. Looking at the words “carry” and “carries,” you might expect that the stem of each of these is “carry.” The actual stem, at least according to the Porter Stemming Algorithm, is “carri.” This is because stemming attempts to be able to compare related words, and breaks down words to their smallest possible parts in order to do so, even if that part is not a word itself.
On the other hand, if you want an output that will always be a recognizable word, then you want lemmatization. Again, there are different lemmatizers, such as NLTK using Wordnet.
Lemmatization breaks a token down to its “lemma,” or the word which is considered the base for its derivations. The lemma from Wordnet for “carry” and “carries,” then, is what we expected before: “carry.”
Lemmatization will generally not break down words as much as stemming, nor will as many different word forms be considered the same after the operation. The stems for “say,” “says,” and “saying” are all “say,” while the lemmas from Wordnet are “say,” “say,” and “saying.” In order to get these lemma, lemmatizers are generally corpus based.
If you want the broadest recall possible, you’ll want to use stemming. If you want the best possible precision, use neither stemming nor lemmatization. Which you go with ultimately depends on your goals, but most searches can generally perform very well with neither stemming nor lemmatization, retrieving the right results and not introducing noise.
If you decide not to include lemmatization or stemming in your search engine, there is still one normalization technique that you should consider. That is the normalization of plurals to their singular form.
Generally, ignoring plurals is done through the use of dictionaries. Even if “de-pluralization” seems as simple as chopping off an “-s,” that’s not always the case. The first problem is with irregular plurals, such as “deer,” “oxen,” and “mice.” A second problem is with pluralization that happens with an “-es” suffix, such as “potato.” Finally, there are simply the words that end in an “s” but aren’t plural, like “always.” A dictionary-based approach will ensure that you introduce recall, but not incorrectly.
Just as with lemmatization and stemming, whether you normalize plurals is dependent on your goals. Cast a wider net by normalizing plurals, a more precise one by avoiding normalization. Usually, normalizing plurals is the right choice, and you can remove normalization pairs from your dictionary when you find them causing problems.
One area, however, where you will almost always want to introduce increased recall is when handling typos.
We have all encountered typo tolerance and spell check within search, but it’s useful to think about why it’s present. Sometimes, there are typos because fingers slip and hit the wrong key. Other times, the searcher thinks a word is spelled differently than it is. Increasingly, “typos” can also be a result of poor speech to text understanding. Finally, words can seem like they have typos but really don’t, such as in comparing “scream” and “cream.”
The simplest way to handle these typos, misspellings, and variations is to avoid trying to correct them at all. There are algorithms that can compare different tokens. One of these is the Damerau-Levenshtein Distance algorithm.
This measure looks at how many edits would be needed to go from one token to another. You can then filter out all tokens with a distance that is too high. (Two is generally a good threshold, but you will probably want to adjust this based on the length of the token.) After filtering, you can use the distance for sorting results, or to feed into a ranking algorithm.
Many times, context can matter when determining if a word is misspelled or not. The word “scream” is probably correct after “I,” but not after “ice.” Machine learning can be a solution for this, by bringing context to this NLP task. This spell check software can use the context around a word to identify whether it is likely to be misspelled, and what its most likely correction is.
One thing important to note that we skipped over before is that words may not only be misspelled when a user types it into a search bar. Words may also be misspelled inside a document. This is especially true when the documents are made up of user-generated content.
This detail is relevant because it means that if a search engine is only looking at the query for typos, it is missing half of the information. The best typo tolerance should work across both query and document, and this is why edit distance generally works best for retrieving and ranking results. Spell check can be used to craft a better query or provide feedback to the searcher, but it is often unnecessary, and should never stand alone.
While NLP is all about processing text and natural language, NLU is about understanding that text.
A task that can aid in search is that of named entity recognition, or NER. NER identifies key items, or “entities,” inside of text. While some people will call NER natural language processing and others will call it natural language understanding, what’s clear is that it can find what’s important within a text.
For the query “NYE party dress” you would perhaps get back an entity of “dress” that is mapped to a type of “category.” NER will always map an entity to a type, from as generic as “place” or “person,” to as specific as your own facets.
NER can also use context to identify entities. A query of “white house” may refer to a place, while “white house paint” might refer to a color of “white” and a product category of “paint.”
Named entity recognition is valuable in search because it can be used in conjunction with facet values to provide better search results.
Recalling the “white house paint” example, you can use the “white” color and the “paint” product category to filter down your results to only show those that match those two values. This would give you high precision. If you don’t want to go that far, you can simply boost all products that match one of the two values.
Query categorization can also help with recall. For searches where there are a low number of results, you can use the entities to include related products. Imagine that there are no products that match the keywords “white house paint.” In this case, leveraging the product category of “paint” can return other paints that might be a decent alternative, such as that nice eggshell color.
Another way that named entity recognition can help with search quality is by moving the task from query time to ingestion time (when the document is added to the search index). When ingesting documents, NER can use the text to tag those documents automatically.
These documents will then be easier to find for the searchers. Either the searchers use explicit filtering, or the search engine applies automatic query-categorization filtering, to enable searchers to go directly to the right products using facet values.
Related to entity recognition is intent detection, or determining the action that a user wants to take.
This is not the same as what we talk about when we say identifying searcher intent. Identifying searcher intent is getting people to the right content that they want at the right time.
Intent detection maps a request to a specific, pre-defined intent and then takes an action based on that intent. A user searching for “how to make returns” might trigger the “help” intent, while “red shoes” might trigger the “product” intent. In the first case, you could route the search to your help desk search, and in the second one, to the product search. This isn’t so different from what you see when you search for the weather on Google and you get a weather box at the very top of the page. (Newly launched web search engine Andi takes this concept to the extreme, bundling search in a chatbot.)
For most search engines, intent detection as outlined here isn’t necessary. Most search engines only have a single content type on which to search at a time. When there are multiple content types, federated search can perform admirably by showing multiple search results in a single UI at the same time.
There are plenty of other NLP and NLU tasks, but these are usually less relevant to search. Tasks like sentiment analysis can be useful in some contexts, but search isn’t one of them. You could imagine using translation to search multi-language corpuses, but it rarely happens in practice, and is just as rarely needed.
Question answering is an NLU task that is increasingly implemented into search, especially search engines that expect natural language searches. Once again, you can see this on major web search engines. Google, Bing, and Kagi will all immediately answer the question “how old is the Queen of England?” without needing to click through to any results.
Some search engine technologies have explored implementing question answering for more limited search indices, but outside of help desks or long, action-oriented content, the usage is limited. Few searchers are going to an online clothing store and asking questions to a search bar.
Summarization is an NLU task that is more useful for search. Much like with the use of NER for document tagging, automatic summarization can enrich documents. Summaries can be used to match documents to queries, or to provide a better display of the search results. This better display can help searchers be confident that they have gotten good results, and get them to the right answers more quickly.
Even including newer search technologies using images and audio, the vast, vast majority of searches happen with text. To get the right results, it’s important to make sure that the search is processing and understanding both the query and the documents. NLP and NLU tasks like tokenization, normalization, tagging, typo tolerance, and others can help make sure that searchers don’t need to be search experts, but instead can go from need to solution “naturally” and quickly.
Dustin Coates
Product and GTM ManagerPowered by Algolia AI Recommendations
Julien Lemoine
Co-founder & former CTO at AlgoliaJulien Lemoine
Co-founder & former CTO at AlgoliaHamish Ogilvy
VP, Artificial Intelligence