As you may already know, we love having great search on documentations we use daily. That’s the main reason we built DocSearch.
DocSearch is live on hundreds of documentation sites, including our own. We believe DocSearch is the easiest and fastest way to navigate technical documentation. Because of that, we invest a lot of time in making it better and better by improving it on a daily basis.
Once you setup DocSearch, there are three main areas for improvement:
The list is ordered by importance, meaning that if you find a relevance issue on a DocSearch implementation it’s usually due to either the structure or the content. Then, in very few cases, it’s due to the search itself.
We just came across one of those search issues: camelCased words.
Camel case is the practice of writing compound words or phrases such that each word or abbreviation in the middle of the phrase begins with a capital letter
If you are an Algolia user you know that all our api parameters are camel cased: see our list of parameters.
Search for parameters is working but we found it far from perfect.
Let me explain why.
Let’s take, for example, one of the parameters from our doc: snippetEllipsisText
It’s 1 word but you understand 3 different words: “snippet Ellipsis Text”
Looking at it split up, it makes sense to expect the search engine to be able to return results for the following queries:
But also:
There is a few queries where you are not expecting results:
In plain words we want to match:
One of the great features of Algolia is the highlighting. We describe in detail how it works in a previous blog post.
So we also expect, when searching for camel case, to have highlighting working correctly, meaning that if I search “ellip” I expect to see “snippetEllipsisText” in the result
For now we were handling only:
There will be a few search inputs like the one just bellow along the blog for you to try and understand the process. Those inputs will search inside all Algolia parameters (at the time of the writing)
Working queries: “snippetEllipsisText”, “snippet Ellipsis Text”,
Not working queries: “Ellipsis”, “EllipsisText”, “EllipsisTex”, “Ellipsis Text”, “Ellipsis snippet”
As you can see from the examples above, that’s 2 out of 7 working, which we can agree is bad.
Understanding why we are handling so few queries out of the box is the key to fixing it properly – let’s dive in. Algolia is doing prefix matches only (more details in this article). It’s one of the reasons Algolia is able to search so fast, but for our camel case use case it’s preventing us from searching in the middle of the word. So we had to find a way around that.
Since we want to be able to search the middle of our camelCaseWords we knew we had to index it as “camel Case Word” so basically “uncamelizing” the content.
So we started to look for existing librairies doing that (in python because the DocSearch scraper is built with python.
We found the stringcase library which has a sentencecase function wich does the job of “uncamelizing” but there is two issues with such library:
So we had to write our own:
def _uncamelize_word(word):
s = ""
for i in xrange(0, len(word)):
if i > 0 and word[i].isupper() and \
word[i - 1].isalnum() is True and \
not word[i - 1].isupper()
s += " "
s += word[i]
if s != word:
pass # the word was uncamelized
def uncamelize_string(string):
return ' '.join([uncamelize_word(word) for word in string.split()])
if a letter is preceded by an non-capital alphanumeric character we add a space, fairly simple.
With this in place:
But:
Working queries: “Ellipsis”, “EllipsisText”, “Ellipsis Text”, “Ellipsis snippet”, “snippet Ellipsis Text”
Not working queries: “snippetEllipsisText”, “EllipsisTex”
That’s 5 out of 7 working, better but still not perfect
The display issue
As mentioned, we now have a display issue. The content we show on the search result for the query “snippet Ellipsis Text” is not the one that you can see in the content and expect in the search result: “snippetEllipsisText”.
We came up with a nice trick. We looked for an invisible unicode character: \u2063 (there are others but this one does the job) to put as a replacement for the space. This make the engine still considering snippetEllipsisText as several words while displaying snippetEllipsisText because the separator is not visible in a browser.
The _uncamelize_word function code now looks like:
def _uncamelize_word(word):
s = ""
for i in xrange(0, len(word)):
if i > 0 and word[i].isupper() \
and not word[i - 1].isupper() \
and word[i - 1].isalnum() is True:
s += u"\u2063"
s += word[i]
if s != word:
pass # the word was uncamelized
Last but not least: the no result issue for “snippetEllipsisText” and “EllipsisT”
Searching for “snippetEllipsisText” does not bring any result anymore since the index does not contains anymore the word snippetEllipsisText.
Searching for “EllipsisTex” does not work because the word “EllipsisText” is not indexed, we indexed “Ellipsis” and “Text” but not “EllipsisText”.
Note that EllipsisText is returning the expected result because it’s one typo away from “Ellipsis Text”, same for “EllipsisT”. It’s better but we would rather have the engine considering it as 0 typo
Fortunately the Algolia engine has a handy synonym feature.
First thing first: “snippetEllipsisText”
We can just add a 1 way synonym:
snippetEllipsisText => snippet Ellipsis Text
Then for “EllipsisT” in the end what we want is to have another 1 way synonym:
EllipsisText => Ellipsis Text
But we need this to be generic. If we summarize we want to:
The following schema should help you understand:
Let’s consider “snippetEllipsisText” as “A B C”, we are going to create the following 1 way synonyms:
ABC => A B C
BC => B C
C => C we actually don’t need this one as it’s already handled by the initial splitting
You can have a look at the final code here.
Final result:
Handling camel case seemed like an easy thing, but after having to handle it I can fairly say it’s not that simple after all, because it implies a lot of edge cases. The work we did here is improving a lot the search for parameters in our doc, and the search for all already live DocSearch implementations.
One area where DocSearch doesn’t shine yet is searching in generated api documentation from code like JavaDoc where camel case is omnipresent. This work is is a big step forward it making it available.
Maxime Locqueville
DX Engineering ManagerPowered by Algolia AI Recommendations
Julien Lemoine
Co-founder & former CTO at AlgoliaLéo Ercolanelli
Software EngineerJulien Lemoine
Co-founder & former CTO at Algolia