Vector vs Keyword Search: Why You Should Care
Search has been around for a while, to the point that it is now considered a standard requirement in many ...
Senior Machine Learning Engineer
Search has been around for a while, to the point that it is now considered a standard requirement in many ...
Senior Machine Learning Engineer
With the advent of artificial intelligence (AI) technologies enabling services such as Alexa, Google search, and self-driving cars, the ...
VP Corporate Marketing
It’s no secret that B2B (business-to-business) transactions have largely migrated online. According to Gartner, by 2025, 80 ...
Sr. SEO Web Digital Marketing Manager
Twice a year, B2B Online brings together industry leaders to discuss the trends affecting the B2B ecommerce industry. At the ...
Director of Product Marketing & Strategy
This is Part 2 of a series that dives into the transformational journey made by digital merchandising to drive positive ...
Benoit Reulier &
Reshma Iyer
Get ready for the ride: online shopping is about to be completely upended by AI. Over the past few years ...
Director, User Experience & UI Platform
Remember life before online shopping? When you had to actually leave the house for a brick-and-mortar store to ...
Search and Discovery writer
If you imagine pushing a virtual shopping cart down the aisles of an online store, or browsing items in an ...
Sr. SEO Web Digital Marketing Manager
Remember the world before the convenience of online commerce? Before the pandemic, before the proliferation of ecommerce sites, when the ...
Search and Discovery writer
Artificial intelligence (AI) is no longer just the stuff of scary futuristic movies; it’s recently burst into the headlines ...
Search and Discovery writer
Imagine you are the CTO of a company that has just undergone a massive decade long digital transformation. You’ve ...
CTO @Algolia
Did you know that the tiny search bar at the top of many ecommerce sites can offer an outsized return ...
Director, Digital Marketing
Artificial intelligence (AI) has quickly moved from hot topic to everyday life. Now, ecommerce businesses are beginning to clearly see ...
VP of Product
We couldn’t be more excited to announce the availability of our breakthrough product, Algolia NeuralSearch. The world has stepped ...
Chief Executive Officer and Board Member at Algolia
The ecommerce industry has experienced steady and reliable growth over the last 20 years (albeit interrupted briefly by a global ...
CTO @Algolia
As an ecommerce professional, you know the importance of providing a five-star search experience on your site or in ...
Sr. SEO Web Digital Marketing Manager
Hashing. Yep, you read that right. Not hashtags. Not golden, crisp-on-the-outside, melty-on-the-inside hash browns ...
Search and Discovery writer
We’re just back from ECIR23, the leading European conference around Information Retrieval systems, which ran its 45th edition in ...
Senior ML Engineer
Oct 21st 2016 engineering
As you may already know, we love having great search on documentations we use daily. That’s the main reason we built DocSearch.
DocSearch is live on hundreds of documentation sites, including our own. We believe DocSearch is the easiest and fastest way to navigate technical documentation. Because of that, we invest a lot of time in making it better and better by improving it on a daily basis.
Once you setup DocSearch, there are three main areas for improvement:
The list is ordered by importance, meaning that if you find a relevance issue on a DocSearch implementation it’s usually due to either the structure or the content. Then, in very few cases, it’s due to the search itself.
We just came across one of those search issues: camelCased words.
Camel case is the practice of writing compound words or phrases such that each word or abbreviation in the middle of the phrase begins with a capital letter
If you are an Algolia user you know that all our api parameters are camel cased: see our list of parameters.
Search for parameters is working but we found it far from perfect.
Let me explain why.
Let’s take, for example, one of the parameters from our doc: snippetEllipsisText
It’s 1 word but you understand 3 different words: “snippet Ellipsis Text”
Looking at it split up, it makes sense to expect the search engine to be able to return results for the following queries:
But also:
There is a few queries where you are not expecting results:
In plain words we want to match:
One of the great features of Algolia is the highlighting. We describe in detail how it works in a previous blog post.
So we also expect, when searching for camel case, to have highlighting working correctly, meaning that if I search “ellip” I expect to see “snippetEllipsisText” in the result
For now we were handling only:
There will be a few search inputs like the one just bellow along the blog for you to try and understand the process. Those inputs will search inside all Algolia parameters (at the time of the writing)
Working queries: “snippetEllipsisText”, “snippet Ellipsis Text”,
Not working queries: “Ellipsis”, “EllipsisText”, “EllipsisTex”, “Ellipsis Text”, “Ellipsis snippet”
As you can see from the examples above, that’s 2 out of 7 working, which we can agree is bad.
Understanding why we are handling so few queries out of the box is the key to fixing it properly – let’s dive in. Algolia is doing prefix matches only (more details in this article). It’s one of the reasons Algolia is able to search so fast, but for our camel case use case it’s preventing us from searching in the middle of the word. So we had to find a way around that.
Since we want to be able to search the middle of our camelCaseWords we knew we had to index it as “camel Case Word” so basically “uncamelizing” the content.
So we started to look for existing librairies doing that (in python because the DocSearch scraper is built with python.
We found the stringcase library which has a sentencecase function wich does the job of “uncamelizing” but there is two issues with such library:
So we had to write our own:
def _uncamelize_word(word):
s = ""
for i in xrange(0, len(word)):
if i > 0 and word[i].isupper() and \
word[i - 1].isalnum() is True and \
not word[i - 1].isupper()
s += " "
s += word[i]
if s != word:
pass # the word was uncamelized
def uncamelize_string(string):
return ' '.join([uncamelize_word(word) for word in string.split()])
if a letter is preceded by an non-capital alphanumeric character we add a space, fairly simple.
With this in place:
But:
Working queries: “Ellipsis”, “EllipsisText”, “Ellipsis Text”, “Ellipsis snippet”, “snippet Ellipsis Text”
Not working queries: “snippetEllipsisText”, “EllipsisTex”
That’s 5 out of 7 working, better but still not perfect
The display issue
As mentioned, we now have a display issue. The content we show on the search result for the query “snippet Ellipsis Text” is not the one that you can see in the content and expect in the search result: “snippetEllipsisText”.
We came up with a nice trick. We looked for an invisible unicode character: \u2063 (there are others but this one does the job) to put as a replacement for the space. This make the engine still considering snippetEllipsisText as several words while displaying snippetEllipsisText because the separator is not visible in a browser.
The _uncamelize_word function code now looks like:
def _uncamelize_word(word):
s = ""
for i in xrange(0, len(word)):
if i > 0 and word[i].isupper() \
and not word[i - 1].isupper() \
and word[i - 1].isalnum() is True:
s += u"\u2063"
s += word[i]
if s != word:
pass # the word was uncamelized
Last but not least: the no result issue for “snippetEllipsisText” and “EllipsisT”
Searching for “snippetEllipsisText” does not bring any result anymore since the index does not contains anymore the word snippetEllipsisText.
Searching for “EllipsisTex” does not work because the word “EllipsisText” is not indexed, we indexed “Ellipsis” and “Text” but not “EllipsisText”.
Note that EllipsisText is returning the expected result because it’s one typo away from “Ellipsis Text”, same for “EllipsisT”. It’s better but we would rather have the engine considering it as 0 typo
Fortunately the Algolia engine has a handy synonym feature.
First thing first: “snippetEllipsisText”
We can just add a 1 way synonym:
snippetEllipsisText => snippet Ellipsis Text
Then for “EllipsisT” in the end what we want is to have another 1 way synonym:
EllipsisText => Ellipsis Text
But we need this to be generic. If we summarize we want to:
The following schema should help you understand:
Let’s consider “snippetEllipsisText” as “A B C”, we are going to create the following 1 way synonyms:
ABC => A B C
BC => B C
C => C we actually don’t need this one as it’s already handled by the initial splitting
You can have a look at the final code here.
Final result:
Handling camel case seemed like an easy thing, but after having to handle it I can fairly say it’s not that simple after all, because it implies a lot of edge cases. The work we did here is improving a lot the search for parameters in our doc, and the search for all already live DocSearch implementations.
One area where DocSearch doesn’t shine yet is searching in generated api documentation from code like JavaDoc where camel case is omnipresent. This work is is a big step forward it making it available.
It's extensive, clear, and, of course, searchable.
Powered by Algolia Recommend