Handling Natural Languages
On this page
- The engine supports all languages: It simply matches the text typed in the search bar with the text found in the index.
- If you want the engine to recognize plurals and stop words like “the”, “a”, “of”, etc., you’ll need to specify the language used in your index.
- The engine uses language-specific dictionaries to detect words and combined words, and to handle Asian logograms (CJK), Arabic vowels, and diacritics.
Algolia supports all languages
The Algolia engine supports all languages out of the box. It is language agnostic. It simply matches the text typed in the search bar with the text in the index. This is what we call textual matching.
For example, if you have an index with Japanese text, and someone starts typing in English, they will not see anything because English characters do not match Japanese characters. If your users are searching in Japanese, your index should probably contain Japanese text. Additionally, if you want to support multiple languages, the most common solution is to create one index for each language.
For this to be possible, it uses a wide array of natural language techniques, ranging from the most general (finding words, using Unicode) to the most specific (distinguishing letters from logograms, breaking down compound words, and using single-language dictionaries for vocabulary).
We’ve organized this page around 4 natural language understanding strategies:
- Engine-level processing (normalization)
- Query Processing
- Natural Language Processing (NLP) with Rules
- Configuring Typo Tolerance
Some language-based techniques (such as normalization) play an integral role in the engine, and are performed with every indexing and search operation. These are generally not configurable. Other techniques rely on specialized dictionaries, which facilitate word and word-root detection. These come with several configurable options. Finally, Algolia offers many other techniques (like typo tolerance, Rules) that can be enabled, disabled, or fine-tuned according to the use case. These are also configurable using Algolia’s API settings and suggested best practices.
How the engine normalizes data
Normalization is done at both indexing and query time, ensuring consistency in how your data is represented as well as matched.
You cannot disable normalization. Additionally, there is no language-specific normalization - what we do for one language we do for all. Our normalization process is, therefore,language-agnostic.
What does normalization mean?
- Switch all characters to lower case
- Remove all diacritics (eg accents)
- Remove punctuation within words (eg apostrophes)
- Manage punctuation between words
- Use word separators (such as spaces, but not only)
- Include or exclude non-alphanumeric characters (separatorsToIndex)
- Transform traditional Chinese to modern
Adding Language Specificity using Dictionaries
As already suggested, some of our automated techniques do not work in all languages. Here we discuss how we add language-specific functionality to the Engine.
No automatic language detection
Algolia does not attempt to detect the language of your data nor the language of your end-users as they type in queries.
Therefore, to benefit from our language-specific algorithms, you will need to tell the engine in what language you want your data to be interpreted.
- If you do not designate a language, we will consider that you want to support every language. The drawback here is that you will create ambiguities by mixing every language’s peculiarities. For example, plurals in Italian will be applied to plurals in English, causing problems such as the following: “paste”, the plural of “pasta” in Italian, will also be considered the plural of “pasta” in English, which is not the case, as “paste” in English is a word in its own right (to spread).
- If you designate more than one language (say 2 or 3) - because your index contains more than one language - this is fine, as long as you prepare your data appropriately and carefully. For more on this, check out our multiple languages tutorial.
Even though the engine can do most tasks without knowing the language of an index, there are some tasks that require knowledge of the language. For example, the engine can only compare plural to singular forms by knowing the language. Same with being able to remove small words like “to” and “the” (stop words).
Because the default language of an index is all languages, enabling removeStopWords or ignorePlurals without setting an index’s language will ignore the wrong plurals and remove the wrong stop words. It is therefore very important to set the query languages of all your indices.
Several of our language-related methods require the use of dictionaries. With dictionaries, the engine can apply language-specific, word-based logic to your data and end-user queries. Each method below - whether it stops words or alternative forms - uses a different dictionary. The success of this approach relies on the accuracy and completeness of the dictionary we use. We regularly update these dictionaries.
The engine uses these language-specific dictionaries to detect words, separate out combined words, and handle Asian logograms (CJK), Arabic vowels, and diacritics.
Disabling typoTolerance and prefix search on specific words
advancedSyntax boolean parameter provides a way to disable typo tolerance on specific words in a query by using double quotes:
foot problems would be typo tolerant on both query words, while
"foot" problems would only be typo tolerant on
This parameter also disables prefix searching on words inside the double quotes.
Typo Tolerance and Languages
What is a typo?
- A missing letter in a word, “hllo” → “hello”
- An extraneous letter, “heello” → “hello”
- Inverted letters: “hlelo” → “hello”
Typo tolerance allows users to make mistakes while typing and still find the words they are looking for. This is done by matching words that are close in spelling.
Typos as language-dependent
As you can see, we use English to illustrate the principle. That works because English is phonemic - it uses single characters to represent sound to form a word. Spelling errors are possible in that case.
However, we don’t support typo-tolerance for logogram-based languages (such as Chinese, Japanese, Korean, and Vietnamese), as these languages use pictures to represent partial or full words instead of single letters to represent sounds.
For alphabet-based and phonemic languages (English, French, Russian, ..), we offer many ways to configure the engine to improve typo tolerance:
Natural Language Processing with Rules
You can set up Rules that tell the engine to look for specific words or phrases in a query, and if it finds them, to instruct the engine to take a specific action or change its default behavior.
For example, some query terms can be converted into a filter. If a user types in a filter value - say, “red” - that term can be used as a filter instead of as a search term. If the full query were “red dress”, then the engine would only look at the “red” records for the word “dress”. This is called dynamic filtering, where filter values are removed from the search and used directly as filters.
Dynamic filtering is only one way that Rules can understand and detect the intent of the user.