What is ecommerce merchandising? Key components and best practices
A potential customer is about to land on the home page of your ecommerce platform, curious to see what cool ...
Search and Discovery writer
A potential customer is about to land on the home page of your ecommerce platform, curious to see what cool ...
Search and Discovery writer
By now, everyone’s had the opportunity to experiment with AI tools like ChatGPT or Midjourney and ponder their inner ...
Director, Product Marketing
Search has been around for a while, to the point that it is now considered a standard requirement in many ...
Senior Machine Learning Engineer
With the advent of artificial intelligence (AI) technologies enabling services such as Alexa, Google search, and self-driving cars, the ...
VP Corporate Marketing
It’s no secret that B2B (business-to-business) transactions have largely migrated online. According to Gartner, by 2025, 80 ...
Sr. SEO Web Digital Marketing Manager
Twice a year, B2B Online brings together industry leaders to discuss the trends affecting the B2B ecommerce industry. At the ...
Director of Product Marketing & Strategy
This is Part 2 of a series that dives into the transformational journey made by digital merchandising to drive positive ...
Benoit Reulier &
Reshma Iyer
Get ready for the ride: online shopping is about to be completely upended by AI. Over the past few years ...
Director, User Experience & UI Platform
Remember life before online shopping? When you had to actually leave the house for a brick-and-mortar store to ...
Search and Discovery writer
If you imagine pushing a virtual shopping cart down the aisles of an online store, or browsing items in an ...
Sr. SEO Web Digital Marketing Manager
Remember the world before the convenience of online commerce? Before the pandemic, before the proliferation of ecommerce sites, when the ...
Search and Discovery writer
Artificial intelligence (AI) is no longer just the stuff of scary futuristic movies; it’s recently burst into the headlines ...
Search and Discovery writer
Imagine you are the CTO of a company that has just undergone a massive decade long digital transformation. You’ve ...
CTO @Algolia
Did you know that the tiny search bar at the top of many ecommerce sites can offer an outsized return ...
Director, Digital Marketing
Artificial intelligence (AI) has quickly moved from hot topic to everyday life. Now, ecommerce businesses are beginning to clearly see ...
VP of Product
We couldn’t be more excited to announce the availability of our breakthrough product, Algolia NeuralSearch. The world has stepped ...
Chief Executive Officer and Board Member at Algolia
The ecommerce industry has experienced steady and reliable growth over the last 20 years (albeit interrupted briefly by a global ...
CTO @Algolia
As an ecommerce professional, you know the importance of providing a five-star search experience on your site or in ...
Sr. SEO Web Digital Marketing Manager
Feb 17th 2022 engineering
Language is a funny thing. For example, we take for granted that Cinderella wore glass slippers. Only in a fairy tale can people walk in glass shoes. But maybe it’s metaphorical – the fragility of being Cinderella. Or maybe it’s simply a mistranslation of the original Latin word “vair” (squirrel fur) for the French “verre” (glass).
Language is also hard to pin down. Especially when lost in translation. But no need to despair – sometimes what gets lost can have surprising and moving results.
But we don’t always want to be surprised. Like when we ask direct questions or search for specific items that match our queries. At that point, we aspire to be crystal-clear.
That’s where dictionaries come in. Dictionaries allow us to be clear, by reinforcing the clarity of each word within the context of the larger phrase. We use dictionaries to reinforce our natural language processing (NLP). Here’s how.
Many users still type questions like “What is the best search engine?” instead of the shorter “best search engine”. It’s natural for them to type the way they speak. But other people prefer to use shorter, incomplete phrases to return the same results. With advances in search technology, even nonsensical queries, such as “engine best search”, return great results.
Nevertheless, the full phrase is still in fashion – even more so with voice. The success of voice search depends on allowing people to speak naturally. And stop words are key to that. Stop words reduce a natural phrase to its bare essence: keywords. By dropping such words as “what”, “is”, and “the” from the above query, and leaving only the keywords “best”, “search”, and “engine”, the search engine can match the query to the underlying data in a more reliable and relevant way.
Granted, all words are important – “What” and “Why” are indeed meaningful distinctions – but if a search algorithm relies on textual matching (as opposed to matching on meaning or semantics), its only job is to compare characters and words. By removing stop words, therefore, you remove the false positives that match on the word “the”.
We can say the same about normalization (e.g., removing accents), plurals. Any search algorithm that focuses on text, and not the meaning of text, should ignore textual variations (like plurals) to enable a more relevant and non-ambiguous word matching.
Lastly, textual matching also needs to separate words into useful parts. A “boat house” is not a house or a boat but a boat specially made to be used as a house. To help reach that level of precision, a textual search algorithm needs to break out the constituent parts of a word (atoms), by using techniques like segmentation and decompounding.
The goal of segmentation or decompounding is not to understand the meaning of words, but to find out what a complex word can be decomposed into. We’re trying to find the “atoms” of the word. We don’t use it in English because most words are already decompounded, it’s in the language’s DNA. Same for French. But German, for example, Hundehütte, meaning “dog kennel”, is composed of “Hund” (dog) and “Hütte”‘ (kennel/house). The space we already have between the two words in English is why we don’t need decompounding. Segmentation is essentially the same thing, but for languages where there’s no space at all (i.e., most Asian languages).
That’s where dictionaries come in.
One approach to natural language processing is to use dictionaries, such as a stop-word dictionary, plurals dictionary, and a compound-word dictionary. For example, you can parse a downloaded list of stop words from Wiktionary, not only in English, but in many other languages.
Here’s the process we used:
We do this for every language and it works fairly well for most use cases. But when it doesn’t work, it breaks relevance – which is a critical show-stopper for search engines. Here are some problems we encountered
“Down” is a reasonable stop word, except when you’re searching for “down jackets”. Companies who sell “leather”, “suede”, and “down” jackets cannot remove “down” from the query.
Languages that use accents, like French and Spanish, fare well when normalized with accent removal. For example, “voilà” to “voila” causes no loss in meaning. In fact, it’s rare in French that removing an accent would create an ambiguity. German is not so lucky. For example, the accented “ä”, when normalized to “a”, will change the meaning of some words.
A curious example of this is the German word wählen, meaning “to choose” in English. If you remove the accent, most people will not object – except for the 1500 residents of the small German-speaking Swiss town, Wahlen. It might be hard to find the town “Wahlen” among the many results that match on “wählen” – thus, hurting tourism in that part of the world.
The solution is to do a special custom normalization for german. In this case, normalize “ä” to “ae”. Here’s a complete list:
ä → ae ö → oe ü → ue Ä → Ae Ö → Oe Ü → Ue ß → ss (or SZ for capital)
But this leads to a second problem, which illustrates the gymnastics search engines go through when dealing with languages. (Remember, language is funny…) So we normalize “für” to “fuer”, but now we lose the stop word “fur”, because the now normalized “fuer” is not a stop word.
That’s where custom dictionaries come in.
We realized that one dictionary per category wasn’t enough, we needed to come up with an additional dictionary per customer that they could use to override the defaults of Wiktionary or add their own words. So now we have two dictionaries per category (stop words, plurals, etc.): one per language, which we ship out with our software, and one custom dictionary per customer, which they can add words to. Adding custom dictionaries – meaning, allowing each customer to override and add their own words to our dictionaries – required a bit of refactoring in how we dealt with our standard dictionaries: each dictionary-retrieval function had a different interface and each dictionary dataset had different formats. So the first step was to normalize our code and data.
We examined the current dictionaries that we shipped out to our customers as part of our base product. We wanted to abstract the similarity of every dictionary. Since they all had the same kind of data and goals, we were able to do the following:
To put these dictionaries into a single interface, the main tasks included (in this order):
In the end, our new format for the plurals dictionary file is:
[2-letter country code]=[word1,word2,..]
In keeping with our introduction, here’s a good example of plurals:
en=feet,feets,foot,foots en=slipper,slippers en=squirrel,squirrels en=fur,furs en=Cinderella,Cinderellas en=Cinderfella,Cinderfellas
That’s the first part: unifying both the interface and structure of the data.
With that, we achieved the following goals:
Now that we had a single interface for every dictionary, we were able to integrate customer-defined words for every NLP technique, for example, customer-specific stop words (see “down” example above), customer-specific normalization (see “für” example above), and so on.
These custom dictionaries are added to the index on top of the static dictionaries. We prioritized the dictionary lookups: a query first consults the custom dictionary before the static one. If the word is found, then the engine doesn’t need to look at the static dictionary.
And that’s it: our customers can now slip on one slipper and help their own customers find the other slipper(s) – in fur or glass.
Powered by Algolia Recommend