Icon ranking white

Distinct

Last updated 01 August 2017

Distinct Overview

Algolia is not meant to be used as a traditional database, or a database at all — it’s a search engine, so the approaches and data structures involved are fundamentally different. Still, it’s sometimes useful to borrow some concepts from the database world. One of those concepts is the parent-child, or one-to-many relationship. These relationships, among others, can be successfully modeled using Algolia’s distinct feature.

Distinct Basics

The attribute used for distinct can be configured dynamically with the attributeForDistinct parameter, or through Algolia’s dashboard. This parameter is set per-index, and must be set before the distinct feature will work.

Note that distinct can be set at either indexing time or query time. Specifying it at indexing time makes sense when providing a specific user experience that will be uniform for each user. Specifying at query time can be used in cases where a user may want to see more or fewer results in each distinct group.

Setting Distinct at Index Time

index.setSettings({
  distinct: true
});

Setting Distinct at Query Time

index.search({
  query: 'query',
  distinct: 2
});

Keep in mind

  • Distinct is a computationally expensive operation on large data sets, especially if distinct > 1
  • distinct(true) is the same as distinct(1)
  • distinct(false) is the same as distinct(0)

Distinct for De-duplication

Let’s look at an example of a restaurant that has several addresses. Instead of storing all of a retaurant’s addresses in a single record like this:

{
  "name": "Burger King",
  "addresses" : [
    {
      "street": "1298 Howard St",
      "city": "San Francisco"
    },
    {
      "street": "1200 Market St",
      "city": "San Francisco"
    },
    {
      "street": "819 Van Ness Ave",
      "city": "San Francisco"
    }
  ]
}

we might choose to store each location in its own record:

{
  "name": "Burger King",
  "street": "1298 Howard St",
  "city": "San Francisco",
  "_geoloc" : {
    "lat": 37.775551,
    "lng": -122.413132
  }
},
{
  "name": "Burger King",
  "street": "1200 Market St",
  "city": "San Francisco",
  "_geoloc" : {
    "lat": 37.778643,
    "lng": -122.415278
  }
},
{
  "name": "Burger King",
  "street": "819 Van Ness Ave",
  "city": "San Francisco",
  "_geoloc" : {
    "lat": 37.783091,
    "lng": -122.421310
  }
}

This, however, presents an interesting problem — if a user searches for restaurants near their current location, they may see multiple Burger Kings, when really only the nearest one is likely relevant. It would be better to show a variety of restaurants, and that’s where distinct comes in.

The distinct feature allows us to remove duplicates by specifying the attribute to use for de-duplication. In this case, the name attribute for the restaurant chain would be used to de-duplicate records. In some sense, name can be thought of as indicating the parent, while each location is a child.

Distinct for Grouping

Distinct is a handy tool anytime one-to-many relationships are required.

Job Postings

Let’s look at another example. Imagine again we are developing a job search with companies and job openings. A company can have several job openings and we may want to display companies as search results with their associated job openings below. For example, consider a query for “software engineer”. Results might look something like:

[
  {
    "company": "Twilio",
    "jobs": [
      "Principle Software Engineer - Cloud Platform",
      "Software Engineer - Financial Tools",
      "Software Engineer in Test - API"
    ]
  },
  {
    "company": "Algolia",
    "jobs": [
      "Senior Ruby on Rails Engineer",
      "Full-Stack Engineer",
      "Frontend Engineer"
    ]
  }, 
]

Distinct can be a boolean (true / false), but it can also be set to an integer — distinct=3, for example — to specify the number of results to return for each grouping.

The hitsPerPage parameter controls the number of returned records per page. In the case of jobs and associated companies, if hitsPerPage=10 and distinct=3, up to 30 records will be returned — 10 companies and at most 3 jobs per company. This behavior makes it easy to implement pagination with grouping.

Distinct to Index Large Records

For performance reasons, objects in Algolia should be 10kb or less, so what does that mean for records containing lots of data — books, for example?

Large records can be split into smaller documents by splitting on a logical chunk such as paragraphs or sentences. In the book example, instead of having a single concept of Book with a large content attribute, you’d have a Paragraph model with a content attribute. They can then be sorted by order to reconstruct the original content:

[{
  book_id: 42,
  order: 1,
  paragraph: "Left Munich at 8:35 P. M, on 1st May, arriving at Vienna early next morning; should have arrived at 6:46, but train was an hour late."
},
{
  book_id: 42,
  order: 2,
  paragraph: "Buda-Pesth seems a wonderful place, from the glimpse which I got of it from the train and the little I could walk through the streets."
},
{
  book_id: 42,
  order: 3,
  paragraph: "I feared to go very far from the station, as we had arrived late and would start as near the correct time as possible."
}]

Splitting large records into smaller ones not only improves search performance — it also improves search relevance. Imagine if an entire book’s contents were stored in a single content attribute. If a search query’s first word matched a paragraph on page 10, and the second word matched a paragraph on page 90, the whole record would be returned as a match. However, this result would not be very relevant.

On the other hand, if those two query words were both present in a single paragraph of a book, that paragraph is a good, relevant result to return.

Faceting with Distinct

When using the distinct setting in combination with faceting, facet counts may be higher than expected. This is because the engine computes faceting before applying de-duplication (distinct).

When the facetingAfterDistinct query time parameter is set to true, you can force faceting to be computed after de-duplication has been applied.

index.search({
  query: 'query',
  distinct: true,
  facetingAfterDistinct: true
});

What’s Next

Continue building your Algolia knowledge with these concepts:

If you want to get started implementing the distinct feature, we have a tutorial you might find helpful:

© Algolia - Privacy Policy