Search by Algolia
Haystack EU 2023: Learnings and reflections from our team
ai

Haystack EU 2023: Learnings and reflections from our team

If you have built search experiences, you know creating a great search experience is a never ending process: the data ...

Paul-Louis Nech

Senior ML Engineer

What is k-means clustering? An introduction
product

What is k-means clustering? An introduction

Just as with a school kid who’s left unsupervised when their teacher steps outside to deal with a distraction ...

Catherine Dee

Search and Discovery writer

Feature Spotlight: Synonyms
product

Feature Spotlight: Synonyms

Back in May 2014, we added support for synonyms inside Algolia. We took our time to really nail the details ...

Jaden Baptista

Technical Writer

Feature Spotlight: Query Rules
product

Feature Spotlight: Query Rules

You’re running an ecommerce site for an electronics retailer, and you’re seeing in your analytics that users keep ...

Jaden Baptista

Technical Writer

An introduction to transformer models in neural networks and machine learning
ai

An introduction to transformer models in neural networks and machine learning

What do OpenAI and DeepMind have in common? Give up? These innovative organizations both utilize technology known as transformer models ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

What’s the secret of online merchandise management? Giving store merchandisers the right tools
e-commerce

What’s the secret of online merchandise management? Giving store merchandisers the right tools

As a successful in-store boutique manager in 1994, you might have had your merchandisers adorn your street-facing storefront ...

Catherine Dee

Search and Discovery writer

New features and capabilities in Algolia InstantSearch
engineering

New features and capabilities in Algolia InstantSearch

At Algolia, our business is more than search and discovery, it’s the continuous improvement of site search. If you ...

Haroen Viaene

JavaScript Library Developer

Feature Spotlight: Analytics
product

Feature Spotlight: Analytics

Analytics brings math and data into the otherwise very subjective world of ecommerce. It helps companies quantify how well their ...

Jaden Baptista

Technical Writer

What is clustering?
ai

What is clustering?

Amid all the momentous developments in the generative AI data space, are you a data scientist struggling to make sense ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

What is a vector database?
product

What is a vector database?

Fashion ideas for guest aunt informal summer wedding Funny movie to get my bored high-schoolers off their addictive gaming ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

Unlock the power of image-based recommendation with Algolia’s LookingSimilar
engineering

Unlock the power of image-based recommendation with Algolia’s LookingSimilar

Imagine you're visiting an online art gallery and a specific painting catches your eye. You'd like to find ...

Raed Chammam

Senior Software Engineer

Empowering Change: Algolia's Global Giving Days Impact Report
algolia

Empowering Change: Algolia's Global Giving Days Impact Report

At Algolia, our commitment to making a positive impact extends far beyond the digital landscape. We believe in the power ...

Amy Ciba

Senior Manager, People Success

Retail personalization: Give your ecommerce customers the tailored shopping experiences they expect and deserve
e-commerce

Retail personalization: Give your ecommerce customers the tailored shopping experiences they expect and deserve

In today’s post-pandemic-yet-still-super-competitive retail landscape, gaining, keeping, and converting ecommerce customers is no easy ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

Algolia x eTail | A busy few days in Boston
algolia

Algolia x eTail | A busy few days in Boston

There are few atmospheres as unique as that of a conference exhibit hall: the air always filled with an indescribable ...

Marissa Wharton

Marketing Content Manager

What are vectors and how do they apply to machine learning?
ai

What are vectors and how do they apply to machine learning?

To consider the question of what vectors are, it helps to be a mathematician, or at least someone who’s ...

Catherine Dee

Search and Discovery writer

Why imports are important in JS
engineering

Why imports are important in JS

My first foray into programming was writing Python on a Raspberry Pi to flicker some LED lights — it wasn’t ...

Jaden Baptista

Technical Writer

What is ecommerce? The complete guide
e-commerce

What is ecommerce? The complete guide

How well do you know the world of modern ecommerce?  With retail ecommerce sales having exceeded $5.7 trillion worldwide ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

Data is king: The role of data capture and integrity in embracing AI
ai

Data is king: The role of data capture and integrity in embracing AI

In a world of artificial intelligence (AI), data serves as the foundation for machine learning (ML) models to identify trends ...

Alexandra Anghel

Director of AI Engineering

Looking for something?

facebookfacebooklinkedinlinkedintwittertwittermailmail

One of the most unique and most-used features of Algolia is the Distinct feature: it enables developers to deduplicate records on-the-fly at query time based on one specific attribute. We introduced this feature three years ago, opening up a broad range of new use cases like the deduplication of product variants – a must-have for any eCommerce search. It has also considerably changed the way we recommend handling big documents like PDFs or web pages (we recommend splitting each document into several records).

This post highlights the advantages of this feature and the different challenges it represents in term of implementation.

Document search: state of the art and limitations

Searching inside large documents is probably the oldest challenge of information retrieval. It was inspired from the index at the end of books and gave the birth to an entire sub-genre of statistics. Among those methods, tf-idf and BM25 are the two most popular. They are very effective for ranking documents, but they don’t handle false positives well – they push them to the bottom of the results.

To illustrate this problem, you can perform the “states programming” query on Google and Wikipedia. According to Google Trends, this query is as popular as the Rust programming language and seems to correspond to developers that search how to develop an algorithm with a state. The same query on Wikipedia is unfortunately not very relevant! The first issue probably comes from a stemming algorithm or a synonym that considers “program” as the same than “programming.” In Algolia, you can have an expansion of singular/plural without polluting results with stemming by using the ignorePlurals feature, which is based on a linguistic lemmatizer.

That said, even if you scroll through the first 1000 hits of Wikipedia search, you won’t find the article which appears first in Google. There are hundreds of articles that contain both the word “states” and “programming.” Even the “United States” page contains both terms and one is in the title! In this case, tf-idf and BM25 are not the most useful ranking criteria. The position of the two query terms in the document is more important, in addition to their relative distance (finding them close together is always better).

States programming query on Google and Wikipedia -Distinct feature
States programming query on Google and Wikipedia

Why do we split large documents

One of the best ways to avoid such relevance problems is to split big pages into several records – for example, you could create one record per section of the Wikipedia page. If a Wikipedia article were to have 4 main sections, we could create 4 different records, with the same article title, but different body content.

For example, instead of the following big record with the four sections embedded:

We could have 5 different records, one for the main one and one per section. All those records will share the same source attribute that we will use to indicate the engine that they are all component of the same article:

We can use this same approach for technical documentation, as there are often just a few long pages in order to improve the developer experience: scrolling is better than loading new pages! This is exactly why we have decided to split pages into several records in our DocSearch project; you can learn more about our approach in this blog post.

It might sound counter-intuitive, but the problem is even more visible when your data set is smaller! While the problem is only visible on well-selected queries on the Wikipedia use case, it becomes very apparent when you search inside a website with less content. You have a high probability of having false positives in your search that will frustrate for your users.

The need for deduplication

Splitting a page into several records is usually easy as you can use the formatting as a way to split. The problem with several records per document is that you will introduce duplicates in your search results. For example, all paragraphs of the United States Wikipedia article will match for the “United States” query. In this case, you would like to keep only the best match and avoid duplicates in the search results.

This is where the distinct feature comes to play! You just have to introduce one attribute with the same value for all records of the same source (typically an ID of the document) and declare it as the attributeForDistinct in your index setting. At query time, only the best match for each value of the attribute for distinct will be kept, all the other records will be removed to avoid any duplicate.

If we split the United States Wikipedia article by section and subsections, it would generate 39 records. You can find the first four records on this gist:

This split by section reduces a lot the probability of noise on large articles. It avoids the query “Etymology Indigenous” to match this article (which would typically match because there is a section called “Etymology” and another one called “Indigenous and European contact”). To improve the relevance, we have also divided the records into three different types that we order from the most important to the less important via the recordTypeScore attribute:

  • “main”, this is the first section of the article, including the title and synonyms (score of 3)
  • “section”: those are the main sections of the article (score of 2)
  • “subsection”: subdivision inside the sections of the article (score of 1)

The recordTypeScore attribute will be used in the ranking to give more importance to the “main” records than the “section” and “subsection”.

Another good property of this split is that it allows linking to the right section of the page when you click on the search result. For example, if you search for “united states Settlements”, you will be able to open the United States page with the anchor text stored in the record.

Here is the list of the four Algolia settings we applied on this index:

You can set those settings via the setSettings API call or directly on the dashboard:

  • searchableAttributes and customRanking can be configured in the Ranking tab
Algolia Dashboard: Searchable & Ranking Attributes - Distinct feature
Algolia Dashboard: Searchable & Ranking Attributes
  • ignorePlurals can also be configured in the Ranking tab. For the moment you can only enable it for all languages in the dashboard (setting it to true), we will improve this setting to let you configure also the language.
  • attributesForDistinct can be configured in the Display tab of the dashboard:
Algolia Dashboard: Group by - Distinct feature
Algolia Dashboard: Group by

Each of those settings is important for the relevance:

  • The customRanking uses the recordTypeScore to give importance to the main chapter in the case of equality on the textual relevance. In case of equality, we use the popularity attribute which represents the number of backlinks inside Wikipedia to this article (all records of the article keep the same popularity)
  • The order of attributes in searchableAttributes gives the following decreasing importance for the matching attributes, title > text > synonyms > sourceArticle
  • The attributeForDistinct is applied on the name of the article, only the best elements will be displayed for each matched articles
  • We set ignorePlurals to “en” to consider all singular/plurals form of English as identical without introducing any noise (you can also set this setting at query time if you have a search in a different language using the same index)

With those settings, you can easily request a set of results without duplicates with the query parameter distinct=true. You can go even further by requesting several hits per deduplication key. You can, for example, pass distinct=3 in your query to retrieve up to three results per distinct key – this feature is usually known as grouping.

Keeping several records per distinct key

Search engines are also used for navigation or criteria-based search due to the fact that those queries are cheaper than on a regular database. Keeping several records per distinct key makes it easy to bring a different display to search.

For example, if you want to build a list of job offers, you will probably let the user search via different criteria like the role, location, job type, etc – but you might also want to aggregate the search results per company instead of having each job display company information. This is what AngelList is doing when you search for a job on their platform. They display companies along with the three best job offers, this type of display can be built very quickly by:

  • Having one record per job offer with an attribute containing the company ID
  • Configuring the company ID attribute as attributeForDistinct in the index settings
  • Passing distinct=3 in your query

With those three simple actions, you can build an interface like AngelList below (Note that SafeGraph has only two job offers that match the criteria, the value three is the upper-bound limit on the number of records we want per distinct key). Some customers also implement a “Show more” button on each company that allows to display all offers for this company. To implement it, you have just to filter on the company name and specify distinct=false on the query to retrieve all offers of this company.

AngelList job search displaying several jobs per company - Distinct feature
AngelList job search displaying several jobs per company

The Impact of Distinct on faceting

Faceting can very quickly become a headache when distinct is enabled. Let’s illustrate the problem with a very simple example containing two records sharing the same distinct key.

The first record contains the facet values “A” and “B” and the second record contains the facet values “B” and “C”.

Let’s imagine that you have a query that retrieves both records and that faceting and distinct is enabled in your query. You can imagine three different behaviors in this case:

  • Computing the faceting before the deduplication (using distinct). In that case, the possible facet values will be “A=1”, “B=2” and “C=1”. All categories are retrieved but the count won’t fit what you have on screen.
  • Computing the faceting after deduplication (distinct). In this case, the result would be “A=1”, “B=1” OR “B=1”, “C=1” depending on the best matching record. In both cases, the result is not what you expect.
  • Applying deduplication independently for each facet value. In this case, we would have A=1, B=1 and C=1 (for each facet value, we need to look at all distinct key and perform a deduplication at this level).

The third behavior is the holy grail as the result of each facet count would be perfect. Unfortunately, the computation cost is insane and doesn’t scale as it directly depends on the number of facet values which match. In practice, it’s impossible to compute it in real time.

Let’s look at the advantages and drawbacks of the two approaches which can be computed for each query in a reasonable time:

  • Computing the faceting before the deduplication: all retrieved facets are valid but the problem is obviously the counts are misleading because they include the duplicates. That said, this approach works very well if you do not want to display the count, it handle any case including the fact you can have different facets in the records that share the same distinct key.
  • Computing the faceting after the deduplication: this approach works well when you use distinct=1 and all records with the same distinct key share the same facet values, which correspond exactly to the use case of splitting big documents. In other use cases, the results will be disturbing for users as some facet values can appear/disappear depending on the record which is kept after deduplication.

In other words, there is no perfect solution solving all problems. This is why we have decided to implement both approaches but as the deduplication can be misleading and cause weird effect, we have decided to use the faceting before the deduplication as the standard behavior. We have a query parameter called facetingAfterDistinct=true that can be added to your query when you are splitting some big records. In this case, you have a perfect result because the facets are identical in all records sharing the same distinct key.

Great features take time

We introduced the distinct feature in December 2013 and have improved it significantly in July 2015 by adding the support of distinct=N in the engine. The different faceting strategies have been a long-standing challenge and it took us a lot of time to find the best approach. The query parameter facetingAfterDistinct=true has been beta tested with real users for a few months before becoming public in December 2016.

This feature opened a lot of possibilities for our users and allowed a better indexing of big records that later gave the birth to our DocSearch project.

If you want to know more about the internal of the Algolia engine, we recommend to read the other posts in this series:

About the author
Julien Lemoine

Co-founder & former CTO at Algolia

githublinkedintwitter

14-day free trial

Create a full-featured search experience in no time.

Get started
14-day free trial

Recommended Articles

Powered byAlgolia Algolia Recommend

Inside the Engine Part 8: Handling Advanced Search Use Cases
engineering

Julien Lemoine

Co-founder & former CTO at Algolia

Algolia's top 10 tips to achieve highly relevant search results
product

Julien Lemoine

Co-founder & former CTO at Algolia

Best webpage structure & indexing for document search: the Laravel Example
engineering

Julien Lemoine

Co-founder & former CTO at Algolia