Splitting and concatenation
Splitting and concatenation are key relevance features we use to help your users find what they are looking for. You can enable these features via the
typoTolerance parameter, and like typo tolerance, they let your users find results, even if what they are searching for does not match your records exactly.
If a user searches for
parkbench, splitting allows for a match on
park bench. Concatenation is when we combine words that are separated by a space: it allows
nano second to match with
To get a deep understanding of these features, we recommend reading about tokenization first.
Splitting is a technique we apply only at query time. For each non-separator token in a query, we try to split the token into two parts at each possible position. We do this up until the twelfth character, meaning that the first part can be up to 12 characters long. The second part can be any length.
For example, we split the query
Katherinejohnson into the following tokens:
For both parts, we search the indexed tokens for matches. If a “valid” split it found, that is, one which matches tokens in the index, it is kept as an alternative to the original query. In this example, since an index may have the tokens
johnson, but not
katherinejohnson, these two parts are kept as terms to search on.
We only split query words into two, and not more, parts. For example, the query
jamesearljones is split into,
jones, but not into the three tokens
jones. We limit splitting to one split per query word for performance reasons.
At indexing time
The engine performs some concatenation at indexing time. This happens during tokenization.
We use the following separators to concatenate at indexing time: period (
.), apostrophe (
'), and the registered (
®) and copyright symbols (
©). This covers the most typical concatenation use cases such as acronyms (e.g.,
B.C.E.) and contractions (e.g.,
For example, the text
hello.world forms the tokens
world, and due to concatenation,
., is a separator, we do not index it by default.
If non-separator tokens created from concatenation are less than three characters long, we also do not index them. For example
wasn't yields the tokens
wasnt, but we only index
BCE but not
At query time
We apply the same concatenation to queries at query time, as we do to records at indexing time. We also perform some additional concatenation to the query:
- Bi-gram concatenation: We concatenate adjacent pairs of tokens in the query string for the first 5 words.
- All-word concatenation: We concatenate all the words in the query, if the query has three or more words.
We form these tokens, from the query
a wonderful day in the neighborhood:
- From initial tokenization:
- From bi-gram concatenation:
- From all-word concatenation:
Special considerations for numeric characters
In most cases, the engine does not distinguish between alphabetic and numeric characters. For example, the queries
5mm are all tokenized as a single word.
We introduce some special behavior when numbers, separators, and concatenation interact.
Short-form concatenation with numbers
Short-form concatenation, for example, turning
B.C.E. into only the token
BCE, is handled differently with numbers: If the first character of a token is numeric, we do not concatenate it with adjacent tokens. So
m.55 forms the token
5.mm forms the tokens
The reason for this special behaviour is to handle floating point numbers correctly. For example, you wouldn’t want
1.3GB to be tokenzied as
Because periods (
.) denote a decimal point in numerical text, even short (1 or 2 character) non-separator tokens are indexed when numbers are involved. For example
1.5 yields the tokens
. is a separator, we don’t index it by default.
Additionally, the phrase
3.GB yields the tokens
GB, even though
GB is not numerical. As long as one of the characters surrounding a separator is numeric, we index any surrounding, non-separator tokens, even if they are short (1 or 2 characters), or alphabetic. We do not index the concatenated token
3 is a number and we do not concatenate tokens beginning with numbers.
Bi-gram concatenation with numbers
We do not perform bi-gram concatenation on adjacent tokens when the first token ends with a digit and the second starts with a digit. This is because for queries like
XC90 2020 Volvo, you wouldn’t want to search for
This restriction on concatenation can lead to some unexpected behavior while searching for an ISBN (International Standard Book Number) or other hyphenated numbers. An ISBN is a hyphenated 13 digit identifier for books, for example
If you have indexed this as
9783161484100, even when a user searches for
978 3 16 148410 0, we return the appropriate record because of all-word concatenation. However, the query
978316148410 0 doesn’t return this record because we do not apply bi-gram concatenation for adjacent tokens ending and starting with numbers.
This is why we recommend indexing all possible formats of such identifiers. This allow users to find the desired record, regardless of the spacing and special characters they use when querying. You can read our guide on searching hyphenated attributes, such as SKUs, ISBNs, phone numbers, and serial numbers to learn best practices for this use case.