Distinct

Why use Distinct?

Sometimes, we may want to index several variations of a record to improve relevance granularity. For instance:

  • In a “restaurants” index, we may desire creating a record for every restaurant location, despite there being potentially many locations per restaurant chain.
  • In a “jobs” index, we may desire creating a record for every job opening, despite there being potentially many job openings per company.

The Distinct feature allows us to keep such variations of a record indexed for high quality search, while also keeping control of the way you want to display the records: a user may want to remove duplicates of records that contain the same attribute value in favor of a more relevant record match. Moreover, a user may want to keep the N best items with respect to a specified attribute for a given set of search results.

Distinct for de-duplication

Let’s look at an example of a restaurant that has several addresses. Instead of adding all the addresses in our record like this:

{
  "name": "Burger King",
  "addresses" : [
    {
      "street": "1298 Howard St",
      "city": "San Francisco"
    },
    {
      "street": "1200 Market St",
      "city": "San Francisco"
    },
    {
      "street": "819 Van Ness Ave",
      "city": "San Francisco"
    }
  ]
}

we may choose to have them in separate records like so:

{
  "name": "Burger King",
  "street": "1298 Howard St",
  "city": "San Francisco",
  "_geoloc" : {
    "lat": 37.775551,
    "lng": -122.413132
  }
},
{
  "name": "Burger King",
  "street": "1200 Market St",
  "city": "San Francisco",
  "_geoloc" : {
    "lat": 37.778643,
    "lng": -122.415278
  }
},
{
  "name": "Burger King",
  "street": "819 Van Ness Ave",
  "city": "San Francisco",
  "_geoloc" : {
    "lat": 37.783091,
    "lng": -122.421310
  }
}

With respect to the example above, say a search-user lives closest to the Burger King on Market Street. Under this new index data structure, a query of burger king (with geo-search turned on) will not require any post-precessing to identify the best match - the user will directly have the most meaningful result, the one on Market St., ranked highest by the search engine.

As you can see, having one record per restaurant location can help improve relevancy and is easier to leverage (both at search and display time).

Let’s now say we want to maximize the diversity of our search results with respect to restaurant chains. With this new constraint, we only want to see one Burger King restaurant in the results, the closest one to a user’s location for instance. This is where Distinct comes into play.

The Distinct feature allows us to remove duplicates by specifying the attribute to use for de-duplication. In our case, the name attribute for the restaurant chain would be used to de-duplicate records.

  • We can configure the attributeForDistinct to be the restaurant name in the dashboard at the bottom of the Display tab, in the Group by section.
  • To dynamically enable de-duplication at search time, set distinct=true as a query parameter.

Distinct for grouping

Let’s look at another example. Imagine again we are developing a job search with companies and job openings. A company can have several job openings and we may want to display companies as search results with their associated (relevant) job openings below them. Take for example a query for software engineer. A desired results structure may look something like this:

  • Twilio
    • Principle Software Engineer - Cloud Platform
    • Software Engineer - Financial Tools
    • Software Engineer in Test - API
  • Algolia
    • Senior Ruby on rails Engineer
    • Full-stack engineer
    • Frontend engineer

In this scenario we will want to group the results by companies. In order to do that, use the Distinct feature to keep the N best results per company.

  • We can set the attributeForDistinct to be the company name in the dashboard at the bottom of the Display tab, in the Group by section. We can set the Distinct value in the dropdown from [false, true, 2, 3, 4, 5] (true represents 1) to choose the max group size.
  • To enable grouping at query time, we set the maximum number of openings we want to keep per company with distinct=N as a query parameter, where N represents an integer for group size.

For example, for distinct=3, we will have the following results (given that there are many openings for many companies):

  • the first hit contains the best opening (let’s say it is an opening for company X)
  • the second hit will contain the next best opening from company X matching the query
  • the third hit will contain one following best opening from company X matching the query
  • the fourth hit contains the best opening for the next company (let’s say it is an opening for company Y)
  • the fifth hit will contain the next best opening from company Y matching the query
  • … and so on

The hitsPerPage query parameter controls the number of companies per page. For example if we specify hitsPerPage=10 with distinct=3, we will have up to 30 hits in our search results (10 companies and a maximum of three openings per company). This behavior makes it easy to implement pagination with grouping.

Distinct to index big records

As mentioned on our pricing page, we limit records to a maximum weight of 10KB of JSON. This limit can be increased for specific use-cases but we don’t recommend doing that; there is a better way to index big documents containing a big chunk of text, whether it be a large PDF or this web page for that matter.

Splitting a very large record into multiple smaller records as a rule of thumb is a good idea for relevancy. For example, let’s say you have a 2 word query that matches against a single document record of 100 pages (a very large record) - the first word matches a paragraph on page 10, and the second word matches a paragraph on page 90. This result yields a match that is not very relevant to the query; therefore, we’d ideally want the engine to avoid matching this record.

To improve the overall search relevancy for our example, we recommend splitting big chunks of text into several pieces and creating one record for each piece.

On a web page, there could be one record per paragraph associated to the title of the page and all h1, h2, h3 tags before the paragraph. This paragraph, for example, could be:

{
  "url": "https://www.algolia.com/doc",
  "title": "Algolia Documentation for Ruby",
  "h1": "Distinct",
  "anchor": "#distinct",
  "paragraph": "...."
  "objectID": "https://www.algolia.com/doc_42"
}

Don’t forget to generate a different objectID for each paragraph/record. This can be done by suffixing the URL with the paragraph number.

A potential drawback of this approach is that you might have a high count of “duplicates”. For example, if you search for the title of the page, all records (paragraphs) will match.

Using distinct will give us the ability to remove these duplicates.

  • In our example, we would use url as the attribute for distinct, which can be configured in your dashboard at the bottom of the Display tab, in the Group by section.
  • To enable at query time, pass distinct=true as a query parameter to enable the de-duplication.

If several hits have the same value for the url attribute, then only the best one is kept and others are removed.

EXAMPLE:

Let’s go back to our example of indexing web pages at the paragraph level. In addition to paragraphs, it would also be beneficial to index titles of pages and types of sections to rank relevance accordingly. This particular behavior is easy to achieve by having several type of records:

  • The title of the page with the URL and some meta, type attribute set to 1.
  • All H1 of the page (contains title and the content of the h1), type attribute set to 2.
  • All H2 of the page (contains the title, the content of the h1 parent and the h2), type attribute set to 3.
  • All H3 of the page (contains the title, the content of the h1-h2 parents and the h3), type attribute set to 4.
  • All H4 of the page (contains the title, the content of the h1-h3 parents and the h4), type attribute set to 5.
  • All H5 of the page (contains the title, the content of the h1-h4 parents and the h5), type attribute set to 6.
  • All paragraphs of the page (contains the title, the content of the h1-h5 parents and the content of the paragraph), type attribute set to 7.

Finally, you can set the customRanking attribute to asc(type) in your index settings to have the type of record taken into account in the result.

With this customRanking configuration and use of distinct, you are given the ability to search large sets of data while maintaining a high bar for relevancy.

A Full Example: De-duplicating variants of products

Distinct 1

Check out the live demo or the source code

To illustrate further the distinct feature, we’ll take the example of an e-commerce website selling multiple models of t-shirts and sweatshirts. Each model is available in a range of colors.

The simplest solution would be to have one record per model of clothing, and not to care about the colors.

The issue with this approach is that when someone types red t-shirt, results will yield all the models of t-shirts that have at least one variant in red and display them; however, the thumbnail of the objects would not necessarily be red. These results - which fail to cue a user that red shirts have been found - can lead to a confusing search experience.

Let’s instead index our data so that when someone types red t-shirt, the search page contains only products with a red thumbnail.

Formatting our data

We’re using a sample dataset with 2 models (A & B) of t-shirts, and two models of sweatshirts (C & D). The four models are available in multiple colors (Red, Green, Blue, Orange…).

Distinct 2

We create one record for each color variant of each object. And in each record, we have an attribute specifying its model (A, B, C or D).

The records also have an attribute number_of_sales that will be used in the Custom Ranking setting. Here’s what our records look like:

[
  {
    "objectID": "sweatshirt-C-blue",
    "name": "sweatshirt",
    "model": "C",
    "color": "blue",
    "image_slug": "sweatshirt-C-blue.png",
    "number_of_likes": 910
  },
  {
    "objectID": "sweatshirt-C-red",
    "name": "sweatshirt",
    "model": "C",
    "color": "red",
    "image_slug": "sweatshirt-C-red.png",
    "number_of_likes": 515
  },
  // [...]
]

Index configuration

To get the best relevance, we will configure the index settings to tune the overall index relevancy. The most important settings are the searchable attributes searchableAttributes and the attributes reflecting record popularity custom_ranking. We will also enable the distinct feature to handle our de-duplication of the color variants.

Searchable Attributes

We want to be able to search on the model, name and color attributes of our records.

Algolia.init_index('distinct_tutorial').set_settings({"searchableAttributes"=>["model", "name", "color"]})
<?php
$client->initIndex("distinct_tutorial")->setSettings(array("searchableAttributes" => array("model", "name", "color")));
client.initIndex('distinct_tutorial').setSettings({"searchableAttributes":["model", "name", "color"]});
client.init_index('distinct_tutorial').set_settings({"searchableAttributes":["model", "name", "color"]})
client.initIndex('movies').setSettings(new IndexSettings().setSearchableAttributes(Arrays.asList("model", "name", "color")));
val result: Future[Task] = client.execute {
  changeSettings of "myIndex" `with` IndexSettings(
    searchableAttributes = Some(Seq(SearchableAttributes.attributes("model", "name", "color"))),
  )
}
res, err := client.InitIndex("distinct_tutorial").SetSettings(algoliasearch.Map{
	"searchableAttributes": []string{"model", "name", "color"},
})

Custom Ranking

The number_of_likes attribute in our objects is a good indicator of the popularity of the products. We’ll use it in the Custom Ranking to improve the relevance by using this business metric.

Algolia.init_index('distinct_tutorial').set_settings({"customRanking"=>["desc(number_of_likes)"]})
<?php
$client->initIndex("distinct_tutorial")->setSettings(array("customRanking" => array("desc(number_of_likes)")));
client.initIndex('distinct_tutorial').setSettings({"customRanking":["desc(number_of_likes)"]});
client.init_index('distinct_tutorial').set_settings({"customRanking":["desc(number_of_likes)"]})
client.initIndex('movies').setSettings(new IndexSettings().setCustomRanking(Arrays.asList("desc(number_of_likes)")));
val result: Future[Task] = client.execute {
  changeSettings of "myIndex" `with` IndexSettings(
    customRanking = Some(Seq(
      CustomRanking.desc("number_of_likes")
    ))
  )
}
res, err := client.InitIndex("distinct_tutorial").SetSettings(algoliasearch.Map{
	"customRanking": []string{"desc(number_of_likes)"},
})

Distinct during Configuration

The distinct feature is used to limit the number of results having the same value on a specific attribute. For example, if you apply it to the attribute color, you’ll only retrieve one blue product, one red product, one orange product… in the search results.

In our case, we want to de-duplicate the model of the product (have only 1 result for each model A, B, C, or D). So we need to declare the attribute model in attributeForDistinct:

Algolia.init_index('distinct_tutorial').set_settings({"attributeForDistinct"=>"model"})
<?php
$client->initIndex("distinct_tutorial")->setSettings(array("attributeForDistinct" => "model"));
client.initIndex('distinct_tutorial').setSettings({"attributeForDistinct":"model"});
client.init_index('distinct_tutorial').set_settings({"attributeForDistinct":"model"})
client.initIndex('movies').setSettings(new IndexSettings().setAttributeForDistinct(Arrays.asList("model")));
val result: Future[Task] = client.execute {
  changeSettings of "myIndex" `with` IndexSettings(
    attributeForDistinct = Some("model")
  )
}
res, err := client.InitIndex("distinct_tutorial").SetSettings(algoliasearch.Map{
	"attributeForDistinct": "model",
})

Distinct at Query Time

Warning: If distinct is enabled, the number of hits and the counts of the facets returned by Algolia doesn’t correspond to the number of results you’re seeing anymore. We recommend to NOT display these values on your page.

For Distinct to work, you need to have configured attributeForDistinct.

distinct is a search parameter that accepts either a boolean or an integer. You can set distinct by doing:

index.search('query_string', { distinct: true }, searchCallback);
helper.setQueryParameter('distinct', true);
let query = Query()
query.distinct = 1
index.search(query) { (content, error) in
    // [...]
}
Query* query = [Query new];
query.distinct = [NSNumber numberWithInt:1];
[index search:query completionHandler:^(NSDictionary* content, NSError* error) {
    // [...]
}];
Query query = new Query();
query.setDistinct(1);
index.searchAsync(query, new CompletionHandler() {
    @Override
    public void requestCompleted(JSONObject content, AlgoliaException error) {
    }
});

Once again, distinct is used to de-duplicate objects having the same value in the attribute set in attributeForDistinct (in our case, the attribute model):

Note: At either configuration or query time, we are able to set the distinct value:

  • false, or 0 will disable the distinct feature and make a regular search
  • true, or 1 will return only one result per model of product
  • 2 will return two results per model of product
  • 3 will return three results per model of product

With respect to our e-commerce example:

Distinct = true

Distinct 3

Results with distinct = true

When distinct is set to true, we get one color for each model. The variant that has been selected is the one with the best value in its Custom Ranking (here, the number of likes).

Distinct = 2

Distinct 4

Results with distinct = 2

With distinct set to 2, we get the best two variants for each model, ranked by Custom Ranking.

Distinct = 3 and query = “tshirt”

Distinct 5

Results with distinct = 3 and query="tshirt"

And of course, we can combine distinct and all the other regular search parameters, like a query or a facet.

Distinct for this e-commerce example has allowed us to:

  1. Ensure sure that one popular model of t-shirt won’t be overrepresented on the results page and hide all other products. We now only display the most popular variants of each model.
  2. Enable retrieval of the right thumbnail of the products when someone types the color in the search input (the query red t-shirt will retrieve only the red variants of the each model of products, with the right thumbnail).

Conclusion

Distinct is a versatile feature to de-duplicate similar results, whether it be used to in increase the diversity of record types, or to help break down large text documents.

Did you find this page helpful?

We're always looking for advice to help improve our documentation! Please let us know what's working (or what's not!) - we're constantly iterating thanks to the feedback we receive.

Send us your suggestions!