Search by Algolia
5 considerations for Black Friday 2023 readiness
e-commerce

5 considerations for Black Friday 2023 readiness

It’s hard to imagine having to think about Black Friday less than 4 months out from the previous one ...

Piyush Patel

Chief Strategic Business Development Officer

How to increase your sales and ROI with optimized ecommerce merchandising
e-commerce

How to increase your sales and ROI with optimized ecommerce merchandising

What happens if an online shopper arrives on your ecommerce site and: Your navigation provides no obvious or helpful direction ...

Catherine Dee

Search and Discovery writer

Mobile search UX best practices, part 3: Optimizing display of search results
ux

Mobile search UX best practices, part 3: Optimizing display of search results

In part 1 of this blog-post series, we looked at app interface design obstacles in the mobile search experience ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

Mobile search UX best practices, part 2: Streamlining search functionality
ux

Mobile search UX best practices, part 2: Streamlining search functionality

In part 1 of this series on mobile UX design, we talked about how designing a successful search user experience ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

Mobile search UX best practices, part 1: Understanding the challenges
ux

Mobile search UX best practices, part 1: Understanding the challenges

Welcome to our three-part series on creating winning search UX design for your mobile app! This post identifies developer ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

Teaching English with Zapier and Algolia
engineering

Teaching English with Zapier and Algolia

National No Code Day falls on March 11th in the United States to encourage more people to build things online ...

Alita Leite da Silva

How AI search enables ecommerce companies to boost revenue and cut costs
ai

How AI search enables ecommerce companies to boost revenue and cut costs

Consulting powerhouse McKinsey is bullish on AI. Their forecasting estimates that AI could add around 16 percent to global GDP ...

Michelle Adams

Chief Revenue Officer at Algolia

What is digital product merchandising?
e-commerce

What is digital product merchandising?

How do you sell a product when your customers can’t assess it in person: pick it up, feel what ...

Catherine Dee

Search and Discovery writer

Scaling marketplace search with AI
ai

Scaling marketplace search with AI

It is clear that for online businesses and especially for Marketplaces, content discovery can be especially challenging due to the ...

Bharat Guruprakash

Chief Product Officer

The changing face of digital merchandising
e-commerce

The changing face of digital merchandising

This 2-part feature dives into the transformational journey made by digital merchandising to drive positive ecommerce experiences. Part 1 ...

Reshma Iyer

Director of Product Marketing, Ecommerce

What’s a convolutional neural network and how is it used for image recognition in search?
ai

What’s a convolutional neural network and how is it used for image recognition in search?

A social media user is shown snapshots of people he may know based on face-recognition technology and asked if ...

Catherine Dee

Search and Discovery writer

What’s organizational knowledge and how can you make it accessible to the right people?
product

What’s organizational knowledge and how can you make it accessible to the right people?

How’s your company’s organizational knowledge holding up? In other words, if an employee were to leave, would they ...

Catherine Dee

Search and Discovery writer

Adding trending recommendations to your existing e-commerce store
engineering

Adding trending recommendations to your existing e-commerce store

Recommendations can make or break an online shopping experience. In a world full of endless choices and infinite scrolling, recommendations ...

Ashley Huynh

Ecommerce trends for 2023: Personalization
e-commerce

Ecommerce trends for 2023: Personalization

Algolia sponsored the 2023 Ecommerce Site Search Trends report which was produced and written by Coleman Parkes Research. The report ...

Piyush Patel

Chief Strategic Business Development Officer

10 ways to know it’s fake AI search
ai

10 ways to know it’s fake AI search

You think your search engine really is powered by AI? Well maybe it is… or maybe not.  Here’s a ...

Michelle Adams

Chief Revenue Officer at Algolia

Cosine similarity: what is it and how does it enable effective (and profitable) recommendations?
ai

Cosine similarity: what is it and how does it enable effective (and profitable) recommendations?

You looked at this scarf twice; need matching mittens? How about an expensive down vest? You watched this goofy flick ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

What is cognitive search, and what could it mean for your business?
ai

What is cognitive search, and what could it mean for your business?

“I can’t find it.”  Sadly, this conclusion is often still part of the modern enterprise search experience. But ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

How neural hashing can unleash the full potential of AI retrieval
ai

How neural hashing can unleash the full potential of AI retrieval

Search can feel both simple and complicated at the same time. Searching on Google is simple, and the results are ...

Bharat Guruprakash

Chief Product Officer

Looking for something?

Why develop our own Unicode Library? – The Algolia Blog
facebookfacebooklinkedinlinkedintwittertwittermailmail

At one time or another, most developers come across bugs or problems with Unicode (about 3,720,000 results on google for the request unicode bug developer at the time of this writing). Let me tell you about my experience in the last decade and why we have now implemented our own unicode Library to produce exactly the same result across devices/languages.

I first started to use Unicode in 2004 when I was developing a Text Mining software specialized on information extraction. This software was fully implemented in C++ and I used IBM ICU library to be Unicode compliant (all strings were stored in UTF16). I also used some normalization functions of ICU based on decomposition, but I did not notice any major problem at that time. I started to understand the dark side of Unicode later when I used it in other languages like Java, Python, and later in Objective-C. My first surprise was when I understood that a simple isAlpha(unicodechar c) method can return different results!

I started to look in details at the standard and downloaded UnicodeData.txt (the file that contains most of the information about the standard, you can grab the latest version here).

This file contains descriptions of all Unicode characters. Third column represents “General Category” and is documented as:

 

General Categories

The values in this field are abbreviations for the following. Some of the values are normative, and some are informative. For more information, see the Unicode Standard.

Normative Categories

    • Lu: Letter, Uppercase
    • Ll: Letter, Lowercase
    • Lt: Letter, Titlecase
    • Mn: Mark, Non-Spacing
    • Mc: Mark, Spacing Combining
    • Me: Mark, Enclosing
    • Nd: Number, Decimal Digit
    • Nl: Number, Letter
    • No: Number, Other
    • Zs: Separator, Space
    • Zl: Separator, Line
    • Zp: Separator, Paragraph
    • Cc: Other, Control
    • Cf: Other, Format
    • Cs: Other, Surrogate
    • Co: Other, Private Use
    • Cn: Other, Not Assigned (no characters in the file have this property)

Informative Categories

    • Lm: Letter, Modifier
    • Lo: Letter, Other
    • Pc: Punctuation, Connector
    • Pd: Punctuation, Dash
    • Ps: Punctuation, Open
    • Pe: Punctuation, Close
    • Pi: Punctuation, Initial quote (may behave like Ps or Pe depending on usage)
    • Pf: Punctuation, Final quote (may behave like Ps or Pe depending on usage)
    • Po: Punctuation, Other
    • Sm: Symbol, Math
    • Sc: Symbol, Currency
    • Sk: Symbol, Modifier
    • So: Symbol, Other

As you can see there is quite a lot of categories, some of them are very easy to understand like “Lu” (Letter, uppercase) and “Ll” (Letter, lowercase) but some of them are more complex like “Lo” (Letter, other)  and “No” (Number, other), and this is exactly where the first problem begins.

Let’s take the unicode character U+00BD(½) as an example. It is quite common to describe spare parts and is defined as “No”… except that some unicode libraries consider that this is not a number and return false to isNumber(unicodeChar) method (e.g., Objective-C).

In fact the two most used methods, isAlpha(unicodeChar) and isNumber(unicodeChar), are not directly defined by the Unicode standard and are subject to interpretation.

The consequence is that results are not the same across devices/languages! In our case this is a problem because our compiled index is portable, and we want to have exactly the same results on different devices/languages.

However, this is not the only problem! Unicode normalization is also a tricky topic. The Unicode standard defines a way to decompose characters (Characters decomposition mapping), for example U+00E0(à) which is decomposed as U+0061(a) + U+0300( ̀). But most of the time you do not want a decomposition but a normalization: get the most basic form of a string (lowercase without accents, marks, …). This is key to be able to search and compare words. For example, the normalization of the French word “Hétérogénéité” will be normalized as “heterogeneite”.

To compute this normalized form, most people compute the lowercase form of a word (well defined by the Unicode standard), then compute the decomposed form and finally remove all the diacritics. However, this is not enough. Normalization can not always be reduced to just a matter of removing marks. For example the standard German letter ß is widely used and replaced/understood as “ss” (you can enter ß in your favorite web search engine and you will discover that it also search for “ss“). The problem is that there is no decomposition for “ß” in the Unicode standard because this letter is not a letter with marks.

To solve that problem, we need to look in the Character Fallback Substitution table that is not part of most of Unicode library implementations. This substitution table defines that “ß can be replaced by “ss,”. There are plenty of other examples; For instance, 0153(œ) and 00E6(æ), letters of the French language, can be replaced by “oe” and “ae”.

At the end, this led us to implement our own Unicode library to ensure that our isAlpha(unicodechar) and isNumber(unicodechar) methods have a unique behavior on all devices/languages and to implement a normalize(unicodestring) method that contains character fallback substitution table. By the way our implementation of normalization is far more efficient because we implemented it in one step instead of three (lowercase + decomposition + diacritics removal).

I hope you found this post useful and gained a better understanding of the Unicode standard and the limits of standard Unicode libraries. Feel free to contribute comments or ask for precisions.

About the author
Julien Lemoine

Co-founder & former CTO at Algolia

githublinkedintwitter

Recommended Articles

Powered byAlgolia Algolia Recommend

Handling Natural Languages in Search
engineering

Léo Ercolanelli

Software Engineer

NLP & NLU as part of semantic search
ux

Dustin Coates

Product and GTM Manager

Inside the Algolia Engine Part 3 — Query Processing
engineering

Julien Lemoine

Co-founder & former CTO at Algolia