Search by Algolia
Building a performant search bar in Nuxt with Algolia & Storefront UI
engineering

Building a performant search bar in Nuxt with Algolia & Storefront UI

In today's highly competitive e-commerce landscape, providing customers with a seamless and efficient search experience can make all ...

Jakub Andrzejewski

Senior Developer and Dev Advocate at Vue Storefront

How to use AI to build your business
ai

How to use AI to build your business

The world of technology is constantly evolving with generative Artificial Intelligence (AI) currently leading the charge. We’re suddenly surrounded ...

Abhijit Mehta

Director of Product Management

Haystack EU 2023: Learnings and reflections from our team
ai

Haystack EU 2023: Learnings and reflections from our team

If you have built search experiences, you know creating a great search experience is a never-ending process: the data ...

Paul-Louis Nech

Senior ML Engineer

What is k-means clustering? An introduction
product

What is k-means clustering? An introduction

Just as with a school kid who’s left unsupervised when their teacher steps outside to deal with a distraction ...

Catherine Dee

Search and Discovery writer

Feature Spotlight: Synonyms
product

Feature Spotlight: Synonyms

Back in May 2014, we added support for synonyms inside Algolia. We took our time to really nail the details ...

Jaden Baptista

Technical Writer

Feature Spotlight: Query Rules
product

Feature Spotlight: Query Rules

You’re running an ecommerce site for an electronics retailer, and you’re seeing in your analytics that users keep ...

Jaden Baptista

Technical Writer

An introduction to transformer models in neural networks and machine learning
ai

An introduction to transformer models in neural networks and machine learning

What do OpenAI and DeepMind have in common? Give up? These innovative organizations both utilize technology known as transformer models ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

What’s the secret of online merchandise management? Giving store merchandisers the right tools
e-commerce

What’s the secret of online merchandise management? Giving store merchandisers the right tools

As a successful in-store boutique manager in 1994, you might have had your merchandisers adorn your street-facing storefront ...

Catherine Dee

Search and Discovery writer

New features and capabilities in Algolia InstantSearch
engineering

New features and capabilities in Algolia InstantSearch

At Algolia, our business is more than search and discovery, it’s the continuous improvement of site search. If you ...

Haroen Viaene

JavaScript Library Developer

Feature Spotlight: Analytics
product

Feature Spotlight: Analytics

Analytics brings math and data into the otherwise very subjective world of ecommerce. It helps companies quantify how well their ...

Jaden Baptista

Technical Writer

What is clustering?
ai

What is clustering?

Amid all the momentous developments in the generative AI data space, are you a data scientist struggling to make sense ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

What is a vector database?
product

What is a vector database?

Fashion ideas for guest aunt informal summer wedding Funny movie to get my bored high-schoolers off their addictive gaming ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

Unlock the power of image-based recommendation with Algolia’s LookingSimilar
engineering

Unlock the power of image-based recommendation with Algolia’s LookingSimilar

Imagine you're visiting an online art gallery and a specific painting catches your eye. You'd like to find ...

Raed Chammam

Senior Software Engineer

Empowering Change: Algolia's Global Giving Days Impact Report
algolia

Empowering Change: Algolia's Global Giving Days Impact Report

At Algolia, our commitment to making a positive impact extends far beyond the digital landscape. We believe in the power ...

Amy Ciba

Senior Manager, People Success

Retail personalization: Give your ecommerce customers the tailored shopping experiences they expect and deserve
e-commerce

Retail personalization: Give your ecommerce customers the tailored shopping experiences they expect and deserve

In today’s post-pandemic-yet-still-super-competitive retail landscape, gaining, keeping, and converting ecommerce customers is no easy ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

Algolia x eTail | A busy few days in Boston
algolia

Algolia x eTail | A busy few days in Boston

There are few atmospheres as unique as that of a conference exhibit hall: the air always filled with an indescribable ...

Marissa Wharton

Marketing Content Manager

What are vectors and how do they apply to machine learning?
ai

What are vectors and how do they apply to machine learning?

To consider the question of what vectors are, it helps to be a mathematician, or at least someone who’s ...

Catherine Dee

Search and Discovery writer

Why imports are important in JS
engineering

Why imports are important in JS

My first foray into programming was writing Python on a Raspberry Pi to flicker some LED lights — it wasn’t ...

Jaden Baptista

Technical Writer

Looking for something?

facebookfacebooklinkedinlinkedintwittertwittermailmail

Language is a funny thing. For example, we take for granted that Cinderella wore glass slippers. Only in a fairy tale can people walk in glass shoes. But maybe it’s metaphorical – the fragility of being Cinderella. Or maybe it’s simply a mistranslation of the original Latin word “vair” (squirrel fur) for the French “verre” (glass).

Language is also hard to pin down. Especially when lost in translation. But no need to despair – sometimes what gets lost can have surprising and moving results.

But we don’t always want to be surprised. Like when we ask direct questions or search for specific items that match our queries. At that point, we aspire to be crystal-clear. 

That’s where dictionaries come in. Dictionaries allow us to be clear, by reinforcing the clarity of each word within the context of the larger phrase. We use dictionaries to reinforce our natural language processing (NLP). Here’s how.

Stop words and plurals, and compounds and segments –  these are a few of my favorite things

Many users still type questions like “What is the best search engine?” instead of the shorter “best search engine”. It’s natural for them to type the way they speak. But other people prefer to use shorter, incomplete phrases to return the same results. With advances in search technology, even nonsensical queries, such as “engine best search”, return great results.

Nevertheless, the full phrase is still in fashion – even more so with voice. The success of voice search depends on allowing people to speak naturally. And stop words are key to that. Stop words reduce a natural phrase to its bare essence: keywords. By dropping such words as “what”, “is”, and “the” from the above query, and leaving only the keywords “best”, “search”, and “engine”, the search engine can match the query to the underlying data in a more reliable and relevant way. 

Granted, all words are important – “What” and “Why” are indeed meaningful distinctions – but if a search algorithm relies on textual matching (as opposed to matching on meaning or semantics), its only job is to compare characters and words. By removing stop words, therefore, you remove the false positives that match on the word “the”. 

We can say the same about normalization (e.g., removing accents), plurals. Any search algorithm that focuses on text, and not the meaning of text, should ignore textual variations (like plurals) to enable a more relevant and non-ambiguous word matching.

Lastly, textual matching also needs to separate words into useful parts. A “boat house” is not a house or a boat but a boat specially made to be used as a house. To help reach that level of precision, a textual search algorithm needs to break out the constituent parts of a word (atoms), by using techniques like segmentation and decompounding

The goal of segmentation or decompounding is not to understand the meaning of words, but to find out what a complex word can be decomposed into. We’re trying to find the “atoms” of the word. We don’t use it in English because most words are already decompounded, it’s in the language’s DNA. Same for French. But German, for example, Hundehütte, meaning “dog kennel”, is composed of “Hund” (dog) and “Hütte”‘ (kennel/house). The space we already have between the two words in English is why we don’t need decompounding. Segmentation is essentially the same thing, but for languages where there’s no space at all (i.e., most Asian languages).

That’s where dictionaries come in.

Using dictionaries for text-based matching

One approach to natural language processing is to use dictionaries, such as a stop-word dictionary, plurals dictionary, and a compound-word dictionary. For example, you can parse a downloaded list of stop words from Wiktionary, not only in English, but in many other languages.

Here’s the process we used:

  • Download the full wiktionary dictionary – words, definitions, and much more
  • Extract the words
  • Store them in text files
  • Compile them into a binary format
  • Optimize code for performance

We do this for every language and it works fairly well for most use cases. But when it doesn’t work, it breaks relevance – which is a critical show-stopper for search engines. Here are some problems we encountered

I’m “down” with relevance

“Down” is a reasonable stop word, except when you’re searching for “down jackets”. Companies who sell “leather”, “suede”, and “down” jackets cannot remove “down” from the query.

Ambiguity cloaked in “fur” (not squirrel fur)

Languages that use accents, like French and Spanish, fare well when normalized with accent removal. For example, “voilà” to “voila” causes no loss in meaning. In fact, it’s rare in French that removing an accent would create an ambiguity. German is not so lucky. For example, the accented “ä”, when normalized to “a”, will change the meaning of some words. 

A curious example of this is the German word wählen, meaning “to choose” in English. If you remove the accent, most people will not object – except for the 1500 residents of the small German-speaking Swiss town, Wahlen. It might be hard to find the town “Wahlen” among the many results that match on “wählen” – thus, hurting tourism in that part of the world. 

The solution is to do a special custom normalization for german. In this case, normalize “ä” to “ae”. Here’s a complete list:

ä → ae
ö → oe
ü → ue
Ä → Ae
Ö → Oe
Ü → Ue
ß → ss (or SZ for capital)

But this leads to a second problem, which illustrates the gymnastics search engines go through when dealing with languages. (Remember, language is funny…) So we normalize “für” to “fuer”, but now we lose the stop word “fur”, because the now normalized “fuer” is not a stop word.  

That’s where custom dictionaries come in.

The solution – Giving customers control with custom dictionaries

We realized that one dictionary per category wasn’t enough, we needed to come up with an additional dictionary per customer that they could use to override the defaults of Wiktionary or add their own words.  So now we have two dictionaries per category (stop words, plurals, etc.): one per language, which we ship out with our software, and one custom dictionary per customer, which they can add words to.  Adding custom dictionaries – meaning, allowing each customer to override and add their own words to our dictionaries – required a bit of refactoring in how we dealt with our standard dictionaries: each dictionary-retrieval function had a different interface and each dictionary dataset had different formats. So the first step was to normalize our code and data.

Normalizing our codebase and standardizing our dictionaries interface

We examined the current dictionaries that we shipped out to our customers as part of our base product. We wanted to abstract the similarity of every dictionary. Since they all had the same kind of data and goals, we were able to do the following: 

  • Create similar data = a list of words
  • Code the same goal = the ability to retrieve words

To put these dictionaries into a single interface, the main tasks included (in this order):

  1. Transforming all the dictionary datasets to have the same data structure: the trie
  2. Migrating existing dictionaries to the new format

In the end, our new format for the plurals dictionary file is:

[2-letter country code]=[word1,word2,..]

In keeping with our introduction, here’s a good example of plurals:

en=feet,feets,foot,foots
en=slipper,slippers
en=squirrel,squirrels
en=fur,furs
en=Cinderella,Cinderellas
en=Cinderfella,Cinderfellas

That’s the first part: unifying both the interface and structure of the data.

With that, we achieved the following goals:

  • A simpler dictionary interface for all dictionaries
  • Mutualized toolings and tests
  • Easier to maintain

Plugging in customer-created, custom dictionaries

Now that we had a single interface for every dictionary, we were able to integrate customer-defined words for every NLP technique, for example, customer-specific stop words (see “down” example above), customer-specific normalization (see “für” example above), and so on.

These custom dictionaries are added to the index on top of the static dictionaries. We prioritized the dictionary lookups: a query first consults the custom dictionary before the static one. If the word is found, then the engine doesn’t need to look at the static dictionary. 

And that’s it: our customers can now slip on one slipper and help their own customers find the other slipper(s) – in fur or glass.

About the author
Joris Valette

C++ Software Engineer

githublinkedin

Recommended Articles

Powered byAlgolia Algolia Recommend

Handling Natural Languages in Search
engineering

Léo Ercolanelli

Software Engineer

NLP & NLU as part of semantic search
ux

Dustin Coates

Product and GTM Manager

Advanced keyword search is built upon natural language processing (NLP)
ai

Julien Lemoine

Co-founder & former CTO at Algolia