On this page
What is normalization
What do we mean by normalization
- Switch all characters to lower case
- Remove all diacritics (eg accents)
- Remove punctuation within words (eg apostrophes)
- Manage punctuation between words
- Use word separators (such as spaces, but not only)
- Transform traditional Chinese to modern
When do we normalize?
Normalization is done at both indexing and query time, ensuring consistency in how your data is represented as well as matched.
You cannot disable normalization. Additionally, there is no language-specific normalization - what we do for one language we do for all. Our normalization process is therefore language-agnostic.
Character-based (UTF) normalization
We use Unicode (UTF-16), which handles every known language. So when we say we normalize your data, we are referring to the process of reducing the full UTF-16 character set to a smaller, more consistent subset of unicode characters.
For more detail, check out how we process unicode.
By default, the Algolia search engine normalizes characters to transform them to their lowercase counterparts and to strip them from their diacritics. As an example:
て, etc. This default behavior, however, is an issue with several languages; therefore, you can set keepDiacriticsOnCharacters to
true to disable automatic normalization on a given set of characters.
The engine uses the space (among other techniques) to spot a word. However, not every language relies exclusively on spacing to separate words.
We have learned that spacing is a fairly reliable method of word detection (tokenization). For many languages, it is a near perfect tokenizer. And where it is less efficient, the problem is often that it does not go far enough: while indeed the majority of words are detected, some words within compound words are not detected.
As you will see, we improved word detection with dictionaries. Some languages concatenate and compound words (Agglutinated Words) and others string together words without using spaces (CJK). With the use of dictionaries, we can spot the words within the words. Each case is different and we discuss these differences below.
The following techniques, like normalization, are engine-based and therefore cannot be disabled.
- Splitting: We split words when they are combined: “jamesbrown” will match with “James Brown”. We do this only if there is no typo.
- Concatenation: We combine words that are separated by a space: “entert ainment” will match with “entertainment”. We do this only if there is no typo.
- Acronyms & Hyphenated words
- For acronyms: we consider D.N.A. the same as DNA.
- For hyphenated words:
- “off-campus” will be transformed into “off” “campus” and “offcampus”.
- “a.to_json” will be transformed into “ato_json” and “to_json”.
- These techniques follow two simple rules:
- If letters are separated by separators consider the concatenation of those letters without the separators in the middle (D.N.A. → DNA)
- If each separated component is three or more letters, also consider it as a standalone word (off-campus → off + campus + offcampus).
Normalization for Logogram-based languages (CJK)
Some languages do not use spaces to delimit words. Without words, search is limited to a sequential, character-based matching. This is a serious limitation, as it does not allow for some important and basic search features, such as inverse word matching (“red shirt” / “shirt red”), non-contiguous words (“chocolate cookies” finds “chocolate chip cookies”), the use of ANDs and ORs, or Query Rules, and other situations.
For CJK logograms, therefore, we’ve needed to come up with a solution. We essentially follow a 2-part process, the first of which relies on the unicode (ICU) library (which is based on the MECAB dictionary, enriched with data from Wiktionary).
The process is as follows:
- We use the dictionary to find the words
- if that doesn’t work, we use a sequential character-based search. While far from being perfect, it’s an acceptable fallback.
As this logic is part of the engine, it cannot be disabled. The logic is only triggered when the engine detects a CJK character.
Using a language-specific dictionary for CJK words
The engine can detect when a user is entering CJK characters, but it cannot detect the exact CJK language.
This means that, whenever CJK is detected, the engine will apply a generic CJK logic to separate logograms.
This is fine in many cases, but if you want the engine to go further and apply language-specific dictionaries, you’ll need to use the queryLanguages setting.
For example, with
queryLanguages, you can specify Chinese (“zh”) in the first position to ensure the use of a Chinese dictionary in finding words.
Note that you can change the dictionary dynamically with each search, enabling multi-lingual support.
Converting Traditional Chinese characters to Standard Chinese
As part of the normalization process, all traditional characters are converted into their modern unicode counterparts.
Normalization for Arabic Languages
Arabic languages make extensive use of diacritics to give hints on pronunciation. Yet, it’s not uncommon to omit them when typing, which may hurt search of text with diacritics. While we usually ignore diacritics by default, those are a bit different, as they are considered as full-fledged characters by the Unicode Standard.
We do advanced processing on the most common of those diacritics to ignore them in both indices and queries. Consequently, searching with or without them yields the same results.
So far, the diacritics we ignore are:
Equivalence between Arabic and Persian
Arabic and Persian characters share many similarities, so much that it makes sense for some users to search inside an Arabic index using Persian letters. Yet, two characters are considered different by the Unicode Standard: ك and ک as well as ی and ي. By considering those letters as the same (as we now do), we allow users to search across a Persian index while typing on an Arabic keyboard layout, and vice versa.