Natural languages

Algolia supports multiple languages by matching the text typed in the search box with the text found in the . If you want Algolia to recognize plurals and stop words like “the”, “a”, “of”, you’ll need to specify the language used in your index. Algolia uses language-specific dictionaries to remove stop words, detect declined or pluralized forms, separate compound words, and handle Asian logograms (CJK), Arabic vowels, and diacritics.

Algolia supports multiple languages

Algolia is language agnostic and matches the text in the search box with the text in the index: this is called textual matching. For example, suppose you have an index with only English text, and a user searches in Japanese. In that case, Algolia won’t return any results because the Latin alphabet doesn’t match Japanese characters. If your users search in Japanese, your index should contain Japanese text. If you want to support multiple languages, the most common solution is to create one index per language. Algolia uses a wide array of natural language techniques, ranging from general, such as finding words or using Unicode, to specific, including distinguishing letters from logograms, breaking down compound words, and using single-language dictionaries for vocabulary. The following is split into several natural language understanding strategies:

Normalization
Dictionaries
Configuring typo tolerance
Natural language processing (NLP) with Rules

Some language-based techniques (such as normalization) play an integral role and are performed with every indexing and search operation. These aren’t configurable. Other techniques rely on specialized dictionaries, which facilitate word and word-root detection, and these do come with several configurable options. Depending on the use case, Algolia offers additional techniques (like typo tolerance and Rules) that you can turn on, turn off, or fine-tune. These are also configurable using Algolia’s API settings. For more information, see:

Normalization

Algolia performs normalization during indexing and at query time, ensuring consistency in how your data is represented and matched. You can’t globally turn off normalization, but you can turn it off for certain special characters. The normalization process is language-agnostic and applies to all supported languages.

What does normalization mean?

Turns all characters to lowercase
Removes special characters (diacritics) such as accents, umlauts, and Arabic short vowels. However, you can keep diacritics with the keepDiacriticsOnCharacters index setting
Removes punctuation within words, for example, apostrophes
Manages punctuation between words
Uses word separators, such as spaces or other characters
Includes or excludes non-alphanumeric characters (separatorsToIndex)
Transforms traditional Chinese into modern

Some of these actions are part of the tokenization process. Understanding this process will help you understand how Algolia concatenates and splits words.

Add language specificity using dictionaries

Some of these automated techniques only work in some languages.

No automatic language detection

Algolia doesn’t attempt to detect the language of your or your users as they type in queries. Therefore, to benefit from language-specific algorithms, you need to tell Algolia what language you want your records to be interpreted.

If you don’t pick a language, Algolia assumes you want to cover all supported languages. The drawback here is that you create ambiguities by mixing every language’s peculiarities. For example, plurals in Italian are applied to plurals in English, causing problems such as the following: “paste”, the plural of “pasta” in Italian, will also be considered the plural of “pasta” in English, which isn’t the case, as “paste” in English is a word in its own right (to spread).
It’s okay to mix two or three languages in a single index and specify them in your settings. However, you should prepare your indices and records appropriately. For more on this, refer to the multiple languages tutorial.

Even though Algolia can do most tasks without knowing the language of an index, some tasks require knowledge of the language. For example, Algolia can only compare plural to singular forms by knowing the language. The same applies to removing small words like “to” and “the” (stop words).

Because the default language of an index is all supported languages, enabling the removeStopWords or ignorePlurals parameters without setting an index’s language will ignore the wrong plurals and remove the wrong stop words. It’s, therefore, essential to set the query languages of all your indices.

Dictionaries

Several language-related methods require the use of dictionaries. With dictionaries, Algolia can apply language-specific, word-based logic to your data and your user’s queries. Algolia maintains separate, language-specific dictionaries for:

Removing stop words
Detecting pluralized and other declined forms (alternative forms of words due to number, case, or gender)
Splitting compound words (also known as decompounding)
Handling Asian logograms (CJK)

Algolia provides default dictionaries for all supported languages. Algolia updates these dictionaries over time, you can customize the stop words, declensions, and decompounding dictionaries for your needs. For more information, see:

Typo tolerance and languages

What’s a typo?

A missing letter in a word, “hllo” → “hello”
An extraneous letter, “heello” → “hello”
Inverted letters: “hlelo” → “hello”
Substituted letter: “heilo” → “hello”

Typo tolerance allows users to make mistakes while typing and still find the words they’re looking for. This is done by matching words that are close in spelling.

Other spelling errors

Extra or missing spaces and punctuation doesn’t count as typos. Algolia only handles them if typoTolerance is enabled (set to true, min, or strict). For example:

Missing spaces between two words is handled by splitting: “helloworld” → “hello world”
Extra spaces or punctuation is handled by concatenation: “hel lo” → “hello”

Typos as language-dependent

To illustrate the principle, English is a suitable language because it’s phonemic: it uses single characters to represent sounds to form a word. It makes spelling errors possible. Algolia doesn’t support typo tolerance for logogram-based languages (like Chinese and Japanese), as these languages use pictorial characters to represent partial or complete words instead of single letters to represent sounds. For alphabet-based and phonemic languages (like English, French, and Russian), you can configure Algolia in these ways to improve typo tolerance:

Turn off typo tolerance and prefix search on specific words

The advancedSyntax parameter lets you turn off typo tolerance on specific words in a query by using double quotes. For example, the query “foot problems” is typo tolerant on both query words, while “foot” problems” is only typo tolerant on “problems”. This parameter also disables prefix searching on words inside the double quotes.

Natural language processing with Rules

You can set up Rules to tell Algolia to look for specific words or phrases in a query and take a specific action or change its default behavior when it finds them. For example, Algolia can convert some query terms into filters. If a user types in a filter value—for example, “red”—you can use this term as a filter instead of a search term. With the query “red dress”, then Algolia could, therefore, only look at the “red” records (based on a filter attribute) for the word “dress”. The process of removing filter values from the query string and using them directly as filters is called dynamic filtering. Dynamic filtering is only one way that rules can understand and detect the user’s intent.

​Algolia supports multiple languages

​Normalization

​What does normalization mean?

​Add language specificity using dictionaries

​No automatic language detection

​Dictionaries

​Typo tolerance and languages

​What’s a typo?

​Other spelling errors

​Typos as language-dependent

​Turn off typo tolerance and prefix search on specific words

​Natural language processing with Rules

Algolia supports multiple languages

Normalization

What does normalization mean?

Add language specificity using dictionaries

No automatic language detection

Dictionaries

Typo tolerance and languages

What’s a typo?

Other spelling errors

Typos as language-dependent

Turn off typo tolerance and prefix search on specific words

Natural language processing with Rules