As you may already know, we love having great search on documentations we use daily. That’s the main reason we built DocSearch.
DocSearch is live on hundreds of documentation sites, including our own. We believe DocSearch is the easiest and fastest way to navigate technical documentation. Because of that, we invest a lot of time in making it better and better by improving it on a daily basis.
Improving your DocSearch
Once you setup DocSearch, there are three main areas for improvement:
- The structure
- The content
- The search itself (indexing and/or querying)
The list is ordered by importance, meaning that if you find a relevance issue on a DocSearch implementation it’s usually due to either the structure or the content. Then, in very few cases, it’s due to the search itself.
The camelCase issue
We just came across one of those search issues: camelCased words.
Camel case is the practice of writing compound words or phrases such that each word or abbreviation in the middle of the phrase begins with a capital letter
If you are an Algolia user you know that all our api parameters are camel cased: see our list of parameters.
Search for parameters is working but we found it far from perfect.
Let me explain why.
Let’s take, for example, one of the parameters from our doc: snippetEllipsisText
It’s 1 word but you understand 3 different words: “snippet Ellipsis Text”
Looking at it split up, it makes sense to expect the search engine to be able to return results for the following queries:
- “snippetEllipsisText” (original name)
- “snippet Ellipsis Text” (split name)
- “Ellipsis” (middle word only)
- “EllipsisText” (two last words, not split)
- “EllipsisTex” (prefix query of “EllipsisText”)
- “Ellipsis Text” (two last words split up)
- “Ellipsis snippet” (split up, inverted first and second word)
There is a few queries where you are not expecting results:
- “EllipsisSnippet” (not split inverted first and second word)
- “TextEllipsis” (not split inverted second and third word)
In plain words we want to match:
- The exact parameter name (Because people might copy/paste it from their code to know more)
- Any combination of sub-word of the parameter name split up
- Exact parameter name omitting 1 or more starting sub-words
One of the great features of Algolia is the highlighting. We describe in detail how it works in a previous blog post.
So we also expect, when searching for camel case, to have highlighting working correctly, meaning that if I search “ellip” I expect to see “snippetEllipsisText” in the result
For now we were handling only:
- “snippetEllipsisText” (the basic one)
- “snippet Ellipsis Text” because the engine tries to concatenate the query.
There will be a few search inputs like the one just bellow along the blog for you to try and understand the process. Those inputs will search inside all Algolia parameters (at the time of the writing)
Working queries: “snippetEllipsisText”, “snippet Ellipsis Text”,
Not working queries: “Ellipsis”, “EllipsisText”, “EllipsisTex”, “Ellipsis Text”, “Ellipsis snippet”
As you can see from the examples above, that’s 2 out of 7 working, which we can agree is bad.
Why we get those results
Understanding why we are handling so few queries out of the box is the key to fixing it properly – let’s dive in. Algolia is doing prefix matches only (more details in this article). It’s one of the reasons Algolia is able to search so fast, but for our camel case use case it’s preventing us from searching in the middle of the word. So we had to find a way around that.
The iterative process to fix it
Indexing the splitted content
Since we want to be able to search the middle of our camelCaseWords we knew we had to index it as “camel Case Word” so basically “uncamelizing” the content.
So we started to look for existing librairies doing that (in python because the DocSearch scraper is built with python.
We found the stringcase library which has a sentencecase function wich does the job of “uncamelizing” but there is two issues with such library:
- It’s working too well :), what I mean by that is it’s going to uncamelize everything, like “API client” is going to become “A P I client”, we don’t want that to happen as the brains reads and understand it as “API client” not “A P I client”
- A camelCasedWord in the context of a documentation is usually surrounded by text and it’s not allowing us to know which words got uncamelized in the process (more on why we need that information bellow)
So we had to write our own:
def _uncamelize_word(word): s = "" for i in xrange(0, len(word)): if i > 0 and word[i].isupper() and \ word[i - 1].isalnum() is True and \ not word[i - 1].isupper() s += " " s += word[i] if s != word: pass # the word was uncamelized def uncamelize_string(string): return ' '.join([uncamelize_word(word) for word in string.split()])
if a letter is preceded by an non-capital alphanumeric character we add a space, fairly simple.
With this in place:
- “snippet Ellipsis Text” gives the expected results
- we can now search in the middle of the camelCasedWord
- we now have a display issue when looking for “snippet Ellipsis Text”
- “snippetEllipsisText” is not returning results anymore
- we are still not able to have results for “EllipsisText”
- we can know exactly which word in a sentence was camel cased
Working queries: “Ellipsis”, “EllipsisText”, “Ellipsis Text”, “Ellipsis snippet”, “snippet Ellipsis Text”
Not working queries: “snippetEllipsisText”, “EllipsisTex”
That’s 5 out of 7 working, better but still not perfect
Fixing the remaining issues
The display issue
As mentioned, we now have a display issue. The content we show on the search result for the query “snippet Ellipsis Text” is not the one that you can see in the content and expect in the search result: “snippetEllipsisText”.
We came up with a nice trick. We looked for an invisible unicode character: \u2063 (there are others but this one does the job) to put as a replacement for the space. This make the engine still considering snippetEllipsisText as several words while displaying snippetEllipsisText because the separator is not visible in a browser.
The _uncamelize_word function code now looks like:
def _uncamelize_word(word): s = "" for i in xrange(0, len(word)): if i > 0 and word[i].isupper() \ and not word[i - 1].isupper() \ and word[i - 1].isalnum() is True: s += u"\u2063" s += word[i] if s != word: pass # the word was uncamelized
Last but not least: the no result issue for “snippetEllipsisText” and “EllipsisT”
Searching for “snippetEllipsisText” does not bring any result anymore since the index does not contains anymore the word snippetEllipsisText.
Searching for “EllipsisTex” does not work because the word “EllipsisText” is not indexed, we indexed “Ellipsis” and “Text” but not “EllipsisText”.
Note that EllipsisText is returning the expected result because it’s one typo away from “Ellipsis Text”, same for “EllipsisT”. It’s better but we would rather have the engine considering it as 0 typo
Fortunately the Algolia engine has a handy synonym feature.
First thing first: “snippetEllipsisText”
We can just add a 1 way synonym:
snippetEllipsisText => snippet Ellipsis Text
Then for “EllipsisT” in the end what we want is to have another 1 way synonym:
EllipsisText => Ellipsis Text
But we need this to be generic. If we summarize we want to:
- create a synonym for the complete name,
- remove the first sub-word and creating a new synonym
- iterate until only 1 sub-word remains.
The following schema should help you understand:
Let’s consider “snippetEllipsisText” as “A B C”, we are going to create the following 1 way synonyms:
ABC => A B C
BC => B C
C => C we actually don’t need this one as it’s already handled by the initial splitting
You can have a look at the final code here.
Handling camel case seemed like an easy thing, but after having to handle it I can fairly say it’s not that simple after all, because it implies a lot of edge cases. The work we did here is improving a lot the search for parameters in our doc, and the search for all already live DocSearch implementations.
One area where DocSearch doesn’t shine yet is searching in generated api documentation from code like JavaDoc where camel case is omnipresent. This work is is a big step forward it making it available.