Search relevance is always top of mind at Algolia. It is one of our differentiating factors, and we are always pushing ourselves to make sure our search engine is as relevant as possible.
One of the ways we ensure relevance is with our custom ranking feature, which is a very powerful tool if you know how to use it. One issue you may run into, however, is that custom ranking attributes that span a wide range of values may have too fine a granularity. Think for example of the number of views a photo might have. If you want to take multiple custom ranking attributes into consideration in order to get a good mix of results, you need to reduce the precision of this attribute or the other attributes may never be used.
To understand why, it’s important to revisit Algolia’s tie-breaking algorithm. Every time multiple records have the same score, we bucket them out based on the currently examined ranking factor and then create smaller and smaller buckets until we have exhausted our ranking factors.
If records have gotten through all of the textual relevance factors and are still tied, we take a look at custom ranking factors. Let’s say that our custom ranking is set up like this:
1. Photo views (descending), with our most popular photos having millions of views and new photos having 0
2. Number of likes (descending), with values ranging from 0 to thousands
3. Date added (descending)
Since we want the most popular photos to be displayed first, we will achieve this with our first factor. But this will, in most cases, be the only factor considered because the values for this attribute are so precise. Think about this—we have six videos tied in textual relevance with the following custom ranking attributes:
In this case, the photos would now be ranked descending, based on number of views (9768, 5341, 1000, 1000, 25, 10). And since only two of them are tied in the same bucket (views equal to 1000), we only examine the second custom ranking criteria for those two photos. And, because the number of likes for those two photos is different, we never actually look at the created date at all.
Why does this matter?
If you just want like count and created date to be in the custom ranking as tie breakers, it doesn’t. But it matters a lot if you want your results to display a good mix of well-viewed photos, well-liked photos and new photos.
Because of the precision of the number of views attribute, you’re not much better off in this case than if you had only used this one attribute for your custom ranking.
What do we do about it?
Quite simply, we need to decrease the range of values by converting continuous values into discrete ones. We can do this in a few different ways, each with their benefits and drawbacks.
The first way to do this is to create tiers of these values. What this means is that you take your values and separate them into deciles, quartiles, centiles or any other equal tier that you desire. From there, you send to Algolia the tier each record belongs to. So our record would then look this (with 10 being the highest tier):
This can be done in-memory or in common databases and is best done with values that don’t change often.
Another easy way of creating tiers is to reduce the precision of the data itself in isolation of other values. For example, a date could be sent with values by day (20160119) or by hour (2016011922).
Calculating the logarithm
Another option is to take the log of the values, rounding down to the nearest integer. Whether it’s a natural log, log10 or anything else doesn’t matter much, which makes the calculation much simpler.
This also creates larger buckets at the high-end, which is valuable because there’s a much larger difference between 10 views and 1000 views than there is between 1,000,010 views and 1,001,010 views.
A final option is to create a custom score at indexing time. This isn’t really a great option because you lose a lot of what makes Algolia so powerful. We will go into the pros and cons of this approach in an upcoming blog post.
What’s right for you?
So what’s the right approach for your situation? It really depends on how often your data changes and how many pieces of data there are. With data that changes very often or with a large set of records, a logarithm might make more sense. For records where values are clumped closely together, perhaps a tiering system would work best. In general, we go with the logarithmic system, but give both a try and see what works best for you!