Algolia DevCon
Oct. 2–3 2024, virtual.
Guides / Managing results / Optimize search results / Handling natural languages

Language-specific configurations

When Algolia knows the language of your data and your users, the engine can apply word-based processing techniques, such as:

  • Removing common (stop) words like “the” and “a”
  • Making singulars and plurals equivalent
  • Detecting word roots
  • Separating or combining compound words.

Setting the search language

Algolia doesn’t attempt to detect the language of an index automatically. If you want language-based settings like typo tolerance, stop words, and plurals to work correctly, you should tell the engine which language you want these settings to use.

If you don’t, the engine will use the default setting (all languages), which may result in anomalies such as applying French spellings to English words.

You can do this individually for each setting or more globally, with one setting per index.

Removing stop words

To separate a query’s key terms from its common words (such as “the”, “on”, and “it”), you can instruct the engine to ignore these common words and help the engine focus on the essentials of what people are looking for: nouns and adjectives.

Algolia references several sources (including Wiktionary and ranks.nl) to create a list of stop words in all supported languages.

Ignoring plurals (and other alternative forms)

Algolia’s ignorePlurals parameter, if enabled, tells the engine to consider a word’s plural and singular forms as equivalent.

For example, in English, “cars” = “car” and “feet” = “foot”. To ensure completeness and support multiple languages, Algolia uses Wiktionary templates to declare alternative forms of a word. For example, the template {en-noun|s}, would show up like this on Wiktionary’s “car” page:

1
car (plural cars)

With Wiktionary templates, Algolia builds a dictionary of alternative forms. Almost every language has its own template syntax, and many languages have multiple templates.

Wiktionary templates also support other alternative forms:

  • German declension. A German noun changes form depending on its case, gender, number, and role in a sentence (dative, nominative, accusative, and genitive). German nouns can have numerous endings: -er, -e, -es, -e (for nominative), en, -e, -es, -e (accusative), -em, -er, -em, -en (dative), -es, -er, -es, -er (genitive).
  • Dutch diminutive endings. A Dutch noun changes its ending based on whether it’s small, countable, and other such nuances. For example, huisje is a small huis, and colaatje is a glass of cola.

Splitting compound words

Compound words refer to noun phrases (or nominal groups) that combine, without spaces, several words to form a single entity or idea.

An example is the German word “Hundehütte” (“dog house”).

The goal of decompounding is to index and search the individual words “Hund” and “Hütte” (“dog” and “house”) separately, thus improving the chance of a match.

For example, if a user searches for “Hütte für große Hunde” (“house for big dog”), but in your records, you only have the term “Hundehütte”. Without decompounding, Algolia can’t match these records. The query and records can only match if the records contain the compound word “Hundehütte” in its split form.

This setting supports six languages:

  • Dutch (nl)
  • German (de)
  • Finnish (fi)
  • Danish (da)
  • Swedish (sv)
  • Norwegian Bokmål (no).

Compound words are automatically split within:

Splitting compound words doesn’t alter the records sent to Algolia. Compound words aren’t replaced by the segmented version but indexed in two formats: as the full word and as the atoms.

Words segmentation

In some logographic languages, words in queries or sentences aren’t separated by spaces as in Latin languages. The reader distinguishes each word based on the context. Since Algolia’s relevance matches words in the query with words in the records, it identifies which characters represent a word for a given query.

For example, “長い赤いドレス” in Japanese means “long red dress”. When receiving this query, Algolia segments it into its composing words “長い” (long), “赤い” (red), and “ドレス” (dress). The same segmentation happens on the records, ensuring a great match and relevance for Japanese queries.

Algolia supports segmentation in Chinese (zh) and Korean (only at query time) and in Japanese (ja) (at both query and indexing time). You must set the queryLanguages and indexLanguages to the relevant language code to ensure this segmentation applies.

Japanese transliteration and type-ahead

The Japanese language uses three writing systems: Kanji, Hiragana, and Katakana. When typing a query in Japanese, users first type its pronunciation in Hiragana and then convert it to Katakana or Kanji if relevant.

To ensure relevant results as soon as users start typing, not just when the query is complete, Algolia indexes Japanese words in both their original form and in Hiragana.

Transliteration is only available in Japanese (ja). To apply it, set the indexLanguages setting to ja. You can limit transliteration to some attributes or turn it off with the attributesToTransliterate setting.

Multiple conjugations can end up with the same transliteration.

You can use this feature with Query Suggestions to ensure Japanese users start seeing suggestions from the first keystrokes.

Did you find this page helpful?