Distinct

How Distinct works

Sometimes you may want to index several variations of a single record, for example different locations for one restaurant or different job offers for a company. The Distinct feature allows you to have such variations of record indexed while keeping the control on the way you want to display them (for example by removing duplicate or keeping the N best items).

Warning: If distinct is enabled, the number of hits and the counts of the facets returned by Algolia doesn’t correspond to the number of results you’re seeing anymore. We recommend to NOT display these values on your page.

Distinct for de-duplication

Let’s take an example of a restaurant that have several addresses. Instead of adding all the addresses in one record like that:

{
  "name": "Burger King",
  "addresses" : [
    {
      "street": "1298 Howard St",
      "city": "San Francisco"
    },
    {
      "street": "1200 Market St",
      "city": "San Francisco"
    },
    {
      "street": "819 Van Ness Ave",
      "city": "San Francisco"
    }
  ]
}

you can have them in different records like that:

{
  "name": "Burger King",
  "street": "1298 Howard St",
  "city": "San Francisco",
  "_geoloc" : {
    "lat": 37.775551,
    "lng": -122.413132
  }
},
{
  "name": "Burger King",
  "street": "1200 Market St",
  "city": "San Francisco",
  "_geoloc" : {
    "lat": 37.778643,
    "lng": -122.415278
  }
},
{
  "name": "Burger King",
  "street": "819 Van Ness Ave",
  "city": "San Francisco",
  "_geoloc" : {
    "lat": 37.783091,
    "lng": -122.421310
  }
}

Having one record per restaurant location helps to improve the relevancy and is easier to use (both at search and display time). For instance if you have the query burger king market, you don’t have any post-precessing to write to identify the best match: you directly have the good record, the one on Market St.

Let’s say you now want to maximize the diversity in the search results and only want to see one Burger King restaurant in the results (the closest to user location for instance). You can use our Distinct feature to remove duplicates by configuring the attribute to use for de-duplication: in our case the name attribute. You can configure the attributeForDistinct to be the restaurant name in your dashboard at the bottom of the Display tab, in the Group by section.

To dynamically enable the de-duplication at search time, set distinct=true as query parameter.

Distinct for grouping

Let’s imagine you are developing a job search with companies and job offers. A company can have several job offers and you want to display companies as search results with their associated (relevant) job offers below them. For example for the software engineer query, you would like to display the results like that:

  • Twilio
    • Principle Software Engineer - Cloud Platform
    • Software Engineer - Financial Tools
    • Software Engineer in Test - API
  • Algolia
    • Senior Ruby on rails Engineer
    • Full-stack engineer
    • Frontend engineer

In this case you want to group the results by companies. In order to do that you can use the Distinct feature to keep the X best results per company. Just set the attributeForDistinct to be the company name in your dashboard at the bottom of the Display tab, in the Group by section.

You can then specify at query time what is the maximum number of offers you want to keep per company with the distinct=X query parameter.

For example, for distinct=3, you will have the following results:

  • the first hit contains the best offer (let’s say it is an offer of company X)
  • then the following offer of the company X matching the query
  • then the following offer of the company X matching the query
  • then the second best offer is displayed (from company Y)
  • then the following offer of the company Y that match the query

The hitsPerPage query parameter controls the number of companies per page. For example if you specify hitsPerPage=10 with distinct=3, you will have up to 30 hits in your search results (10 companies and a maximum of three offers per company). This behavior makes it easy to implement pagination with grouping.

Distinct to index big records

As mentioned on our pricing page, we limit records to a maximum weight of 10KB of JSON. This limit can be increased for specific use-cases but we don’t recommend it as there is a better way to index big documents containing a big chunk of text, like for example a big PDF file, or this web page.

To improve the overall relevancy of your search, we recommend splitting big chunks of text into several pieces and creating one record for each of them. For example, a document of 100 pages that matches the first word of a two word query in a paragraph on page 10 and the second word in a paragraph on page 90 is not really relevant, therefore we want the engine to avoid matching it.

On a web page you can have a record per paragraph associated to the title of the page and all h1, h2, h3 tags before the paragraph. For example this paragraph could be:

{
  "url": "https://www.algolia.com/doc",
  "title": "Algolia Documentation for Ruby",
  "h1": "Distinct",
  "anchor": "#distinct",
  "paragraph": "...."
  "objectID": "https://www.algolia.com/doc_42"
}

Don’t forget to generate a different objectID for each paragraph/record. This can be done by can suffixing the URL with the paragraph number.

The drawback of this approach is that you might have a high amount of “duplicates”. For example, if you search for the title of the page, all records will match. In order to avoid this problem you can use our “distinct” feature. It can be configured in the settings of your index, where you can select which attribute is used for distinct. In our example, we would use "url" as the attribute for distinct. Then at query time you can pass distinct=true as a query parameter to enable the de-duplication. If several hits have the same value for the "url" attribute, then only the best one is kept and others are removed.

EXAMPLE:

Let’s go back to the indexing of the web page with the indexing of all paragraphs, it would also be interesting to index titles of pages and sections and giving more importance in the search to sections. You can easily have this behavior by having several type of records:

  • The title of the page with the URL and some meta, type attribute set to 1.
  • All H1 of the page (contains title and the content of the h1), type attribute set to 2.
  • All H2 of the page (contains the title, the content of the h1 parent and the h2), type attribute set to 3.
  • All H3 of the page (contains the title, the content of the h1-h2 parents and the h3), type attribute set to 4.
  • All H4 of the page (contains the title, the content of the h1-h3 parents and the h4), type attribute set to 5.
  • All H5 of the page (contains the title, the content of the h1-h4 parents and the h5), type attribute set to 6.
  • All paragraphs of the page (contains the title, the content of the h1-h5 parents and the content of the paragraph), type attribute set to 7.

Finally, you can set the customRanking attribute to asc(type) in your index settings to have the type of record taken into account in the result.

Example: De-duplicating variants of products

Distinct 1

Check out the live demo or the source code

To illustrate further the distinct feature, we’ll take the example of an e-commerce website selling multiple models of t-shirts and sweatshirts. Each model is available in a range of colors.

The most simple solution would be to have one record per model of clothing, and not to care about the colors. The issue is that when someone would type t-shirt red, you’d find all the models of t-shirts that have at least one variant in red, display them, but the thumbnail of the objects wouldn’t necessarily be red.

Let’s see how to index your data so that when someone types t-shirt red, the search page contains only products with a red thumbnail.

Formatting our data

We’re using a sample dataset with 2 models (A & B) of t-shirts, and two models of sweatshirts (C & D). The four models are available in multiple colors (Red, Green, Blue, Orange…).

Distinct 2

We create one record for each color variant of each object. And in each record, we have an attribute specifying its model (A, B, C or D).

The records also have an attribute number_of_sales that will be used in the Custom Ranking setting. Here’s what our records look like:

[
  {
    "objectID": "sweatshirt-C-blue",
    "name": "sweatshirt",
    "model": "C",
    "color": "blue",
    "image_slug": "sweatshirt-C-blue.png",
    "number_of_likes": 910
  },
  {
    "objectID": "sweatshirt-C-red",
    "name": "sweatshirt",
    "model": "C",
    "color": "red",
    "image_slug": "sweatshirt-C-red.png",
    "number_of_likes": 515
  },
  // [...]
]

Index configuration

To get the best relevance, we will configure the index settings to tune the overall index relevancy. The most important settings are the searchable attributes searchableAttributes and the attributes reflecting record popularity custom_ranking. We will also enable the distinct feature to handle our de-duplication of the color variants.

Searchable Attributes

We want to be able to search on the model, name and color attributes of our records.

Algolia.init_index('distinct_tutorial').set_settings({"searchableAttributes"=>["model", "name", "color"]})
<?php
$client->initIndex("distinct_tutorial")->setSettings(array("searchableAttributes" => array("model", "name", "color")));
client.initIndex('distinct_tutorial').setSettings({"searchableAttributes":["model", "name", "color"]});
client.init_index('distinct_tutorial').set_settings({"searchableAttributes":["model", "name", "color"]})
client.initIndex('movies').setSettings(new IndexSettings().setSearchableAttributes(Arrays.asList("model", "name", "color")));
val result: Future[Task] = client.execute {
  changeSettings of "myIndex" `with` IndexSettings(
    searchableAttributes = Some(Seq(SearchableAttributes.attributes("model", "name", "color"))),
  )
}
res, err := client.InitIndex("distinct_tutorial").SetSettings(algoliasearch.Map{
	"searchableAttributes": []string{"model", "name", "color"},
})

Custom Ranking

The number_of_likes attribute in our objects is a good indicator of the popularity of the products. We’ll use it in the Custom Ranking to improve the relevance by using this business metric.

Algolia.init_index('distinct_tutorial').set_settings({"customRanking"=>["desc(number_of_likes)"]})
<?php
$client->initIndex("distinct_tutorial")->setSettings(array("customRanking" => array("desc(number_of_likes)")));
client.initIndex('distinct_tutorial').setSettings({"customRanking":["desc(number_of_likes)"]});
client.init_index('distinct_tutorial').set_settings({"customRanking":["desc(number_of_likes)"]})
client.initIndex('movies').setSettings(new IndexSettings().setCustomRanking(Arrays.asList("desc(number_of_likes)")));
val result: Future[Task] = client.execute {
  changeSettings of "myIndex" `with` IndexSettings(
    customRanking = Some(Seq(
      CustomRanking.desc("number_of_likes")
    ))
  )
}
res, err := client.InitIndex("distinct_tutorial").SetSettings(algoliasearch.Map{
	"customRanking": []string{"desc(number_of_likes)"},
})

Distinct

The distinct feature is used to limit the number of results having the same value on a specific attribute. For example, if you apply it to the attribute color, you’ll only retrieve one blue product, one red product, one orange product… in the search results.

In our case, we want to de-duplicate the model of the product (have only 1 result for each model A, B, C, or D). So we need to declare the attribute model in attributeForDistinct:

Algolia.init_index('distinct_tutorial').set_settings({"attributeForDistinct"=>"model"})
<?php
$client->initIndex("distinct_tutorial")->setSettings(array("attributeForDistinct" => "model"));
client.initIndex('distinct_tutorial').setSettings({"attributeForDistinct":"model"});
client.init_index('distinct_tutorial').set_settings({"attributeForDistinct":"model"})
client.initIndex('movies').setSettings(new IndexSettings().setAttributeForDistinct(Arrays.asList("model")));
val result: Future[Task] = client.execute {
  changeSettings of "myIndex" `with` IndexSettings(
    attributeForDistinct = Some("model")
  )
}
res, err := client.InitIndex("distinct_tutorial").SetSettings(algoliasearch.Map{
	"attributeForDistinct": "model",
})

Using Distinct

To test our search, we built a basic search page powered by Algolia. If you haven’t implemented your search page yet, you can find the source code on Github. Let’s now use distinct.

Warning: If distinct is enabled, the number of hits and the counts of the facets returned by Algolia doesn’t correspond to the number of results you’re seeing anymore. We recommend to NOT display these values on your page.

How it works

For Distinct to work, you need to have configured attributeForDistinct (see above).

distinct is a search parameter that accepts either a boolean or an integer. You can set distinct by doing:

index.search('query_string', { distinct: true }, searchCallback);
helper.setQueryParameter('distinct', true);
let query = Query()
query.distinct = 1
index.search(query) { (content, error) in
    // [...]
}
Query* query = [Query new];
query.distinct = [NSNumber numberWithInt:1];
[index search:query completionHandler:^(NSDictionary* content, NSError* error) {
    // [...]
}];
Query query = new Query();
query.setDistinct(1);
index.searchAsync(query, new CompletionHandler() {
    @Override
    public void requestCompleted(JSONObject content, AlgoliaException error) {
    }
});

It is used to de-duplicate objects having the same value in the attribute set inattributeForDistinct (in our case, the attribute model):

  • false, or 0 will disable the distinct feature and make a regular search
  • true, or 1 will return only one result per model of product
  • 2 will return two results per model of product
  • 3 will return three results per model of product

Distinct = true

Distinct 3

Results with distinct = true

When distinct is set to true, we get one color for each model. The variant that has been selected is the one with the best value in its Custom Ranking (here, the number of likes).

Distinct = 2

Distinct 4

Results with distinct = 2

With distinct set to 2, we get the best two variants for each model, ranked by Custom Ranking.

Distinct = 3 and query = “tshirt”

Distinct 5

Results with distinct = 3 and query="tshirt"

And of course, we can combine distinct and all the other regular search parameters, like a query or a facet.

Conclusion

Distinct is a useful feature to de-duplicate similar results. In our case, it allows us to:

  1. Make sure that one popular model of t-shirt won’t be overrepresented on the results page and hide all other products. We’ll only display the most popular variants of each model.
  2. Enable to retrieve the right thumbnail of the products when someone types the color in the search input (the query tshirt red will retrieve only the red variants of the each model of products, with the right thumbnail).