Formatting Your Data

Introduction

Algolia is great with semi-structured data. In most cases it’s plug, play, and go. Nonetheless, as your data and needs become more complex, some formatting of the data upfront can ensure that you are using Algolia in the optimal way. We look at a few considerations here.

Organizing Your Indices

Should you have everything in the same index? Or different indices for different types of data? In nearly all situations, if two items you want to search on go into different tables in a database, they’ll go into a different index.

For example, if you are building an autocomplete dropdown that searches into movies and actors, you’ll need two indices: one for movies and one for actors.

Even if you have a search experience that searches multiple data at once, you are better off having them in separate indices. That way each index can have its own settings and ranking strategy.

Organizing Your Records

Accepted formats

Algolia is schemaless, and can index data attributes having the following format:

  • string "foo"
  • integer / float 12 or 12.34
  • boolean true
  • nested objects { "sub": { "a1": "v1", "a2": "v2" } }
  • array (of string, integers, floats, booleans, nested objects) ["foo", "bar"]

objectID

Every object is uniquely identified by an objectID. If you don’t provide one, we’ll generate one automatically, but it’ll be easier for you to remove or update the record if you have stored its unique identifier in the objectID attribute.

Date format

For now, the only format we understand for the dates is the Unix timestamp (e.g. 1435735848), if you want to filter or sort by date.

Example record

{
  "objectID": 42,             // record identifier
  "title": "Breaking Bad",    // string attribute
  "episodes": [               // array of strings attribute
    "Crazy Handful of Nothin'",
    "Gray Matter"
  ],
  "like_count": 978,          // integer attribute
  "avg_rating": 1.23456,      // float attribute
  "featured": true,           // boolean attribute
  "actors": [                 // nested objects attribute
    {
      "name": "Walter White",
      "portrayed_by": "Bryan Cranston"
    },
    {
      "name": "Skyler White",
      "portrayed_by": "Anna Gunn"
    }
  ]
}

10kb Size limit

We limit the size of a record to 10kb, if you send an object bigger than the limit, you’ll get the error message Record is too big. There are three reasons for it:

  • Our pricing is based on it
  • In most cases, having bigger objects is a sign that you’re not using Algolia at its full capacity: Having very big chunks of text is usually bad for relevance, because most objects end-up having a lot of similar words, and they will match with a lot of irrelevant queries. It’s often better to de-duplicate big objects into several smaller ones
  • It increases latency: big objects with a lot of unnecessary information take a long time to be uploaded/downloaded

Indexing Relations

You can technically index the relations of objects with an array of values as we support indexing of arrays in your JSON. That said, it is usually better to index each element of the array in a separate record, to get the best relevance.

Let’s see why by using the example of a TV shows search.

Option 1: Index one object for each TV show that includes all episode names:

{
  "episodes": [
    "A Scandal in Belgravia",
    "The Reichenbach Fall",
    "A Study in Pink",
    "The Great Game",
    "The Hounds of Baskerville",
    "The Blind Banker",
    "Unaired Pilot",
    "The Empty Hearse",
    "The Sign of Three",
    "His Last Vow"
  ],
  "show_name": "Sherlock",
  "popularity": 891
}

Option 2: Index one object for each TV show and one object for each episode:

{
  "show_name": "Sherlock",
  "popularity": 891
},
{
  "episode_name": "A Scandal in Belgravia",
  "episode_of": "Sherlock",
  "popularity": 891
},
{
  "episode_name": "The Reichenbach Fall",
  "episode_of": "Sherlock",
  "popularity": 891
},
{
  "episode_name": "A Study in Pink",
  "episode_of": "Sherlock",
  "popularity": 891
},
...

The second option is far better in terms of ranking. Let’s try to perform a query on a set of 10,000 TV shows.

Here are the first two results for the Game of Thrones query with the first option (one object per show that includes episode names):

{
  "show_name": "Game of Thrones",
  "popularity": 110000,
  "episodes": [
    "The Rains of Castamere",
    "Blackwater",
    ...
  ],
  ...
},
{
  "show_name": "Stargate SG-1",
  "popularity": 64460,
  "episodes": [
    "Children of the Gods",
    "Fair Game",
    "Romancing the Throne"
    ...
  ],
  ...
},
...

And here are the first two results for the Game of Thrones query with the second option (one object per show and one object per episode):

{
  "show_name": "Game of Thrones",
  "popularity": 110000
  ...
},
{
  "episode_of": "Game of Thrones",
  "episode_name": "The Rains of Castamere",
  "popularity": 110000,
  ...
},
...

Note that with the first option, the first hit is great as it matches all query words in the show_name attribute but the second result however is pretty ugly. Stargate SG1 was considered a correct hit because it contains three episodes each matching one of the query words. The second option does not have this problem and provides better results.

To conclude, here are the three important tips you should know to better index your objects:

  • It is better to have several small objects than a big one. It will reduce the probability to have a wrong result.

  • When sharing information between several objects, it is better to use a different name for each attribute. This enables to use attributes to order matches by importance. For example, in the second option we have set the searchableAttributes (formerly named attributesToIndex) index setting (attributes are sorted by decreasing order of importance) as follows:

index.set_settings({"searchableAttributes" => ["show_name", "episode_name", "episode_of"]})

index.set_settings({"searchableAttributes": ["show_name", "episode_name", "episode_of"]})
index.setSettings({searchableAttributes: ['show_name', 'episode_name', 'episode_of']});
<?php
$index->setSettings(array("searchableAttributes" => array("show_name", "episode_name", "episode_of")));
<?php
/**
 *
 * @ORMEntity
 *
 * @AlgoliaIndex(
 *     searchableAttributes = {"show_name", "episode_name", "episode_of"}
 * )
 *
 */
class TVShow
{
}
index.setSettings(new IndexSettings()
      .searchableAttributes(Arrays.asList("show_name", "episode_name", "episode_of")));
settings := make(map[string]interface{})
  settings["searchableAttributes"] = []string{"show_name", "episode_name", "episode_of"}
  index.SetSettings(settings)
curl --header 'X-Algolia-API-Key: YourAPIKey' \
     --header 'X-Algolia-Application-Id: YourApplicationID' \
     --data-binary '{"searchableAttributes": ["show_name", "episode_name", "episode_of"]}' \
     --request PUT https://APP_ID.algolia.net/1/indexes/YourIndexName/settings
index.SetSettings(JObject.Parse(@"{""searchableAttributes"":[""show_name"", ""episode_name"", ""episode_of""]}"));
client.execute {
  changeSettings of "myIndex" `with` IndexSettings(
    searchableAttributes = Some(Seq(SearchableAttributes.attribute("show_name"), SearchableAttributes.attribute("episode_name"), SearchableAttributes.attribute("episode_of")))
  )
}

As a result, the query Game of Thrones the rains of castamere retrieves the hit matching both show name and episode name while still ensuring that shows are always returned before episodes when the query contains only the show name.

  • Finally to have an excellent ranking, you can use the customRanking index setting to introduce popularity of hits. In this case we have used our popularity score:
index.set_settings({"customRanking" => ["desc(popularity)"]})

index.set_settings({"customRanking": ["desc(popularity)"]})
index.setSettings({customRanking: ['desc(popularity)']});
<?php
$index->setSettings(array("customRanking" => array("desc(popularity)")));
<?php
/**
 *
 * @ORMEntity
 *
 * @AlgoliaIndex(
 *     searchableAttributes = {"desc(popularity)"}
 * )
 *
 */
class TVShow
{
}
index.setSettings(new IndexSettings()
      .setCustomRanking(Arrays.asList("desc(popularity)")));
settings := make(map[string]interface{})
  settings["customRanking"] = "desc(popularity)"
  index.SetSettings(settings)
curl --header 'X-Algolia-API-Key: YourAPIKey' \
     --header 'X-Algolia-Application-Id: YourApplicationID' \
     --data-binary '{"customRanking": ["desc(popularity)"]}' \
     --request PUT https://APP_ID.algolia.net/1/indexes/YourIndexName/settings
index.SetSettings(JObject.Parse(@"{""customRanking"":[""desc(popularity)""]}"));
client.execute {
  changeSettings of "myIndex" `with` IndexSettings(
    customRanking = Some(Seq(CustomRanking.desc("popularity"))
  )
}

Did you find this page helpful?

We're always looking for advice to help improve our documentation! Please let us know what's working (or what's not!) - we're constantly iterating thanks to the feedback we receive.

Send us your suggestions!