01 Jun 2018

Structure/Format your Data

Structuring your Data into Searchable Indexes

Organizing Your Indices

Even though Algolia is schemaless, we highly recommend that each index contain only one dataset. In nearly all situations, if two items you want to search on go into different tables in a database, they’ll go into a different index.

For example, if you are building an autocomplete dropdown that searches into movies and actors, you’ll need two indices: one for movies and one for actors. This approach allows each index to have its own settings and ranking strategy. For example, movies might be ranked by rating, while actors would be ranked according to a different popularity metric.

However, it should be noted that if you require results to be returned intermingled (that is, ranked in conjunction with each other) you will need to store all the respective records in one index.

Indexing Relations

You can technically index the relations of objects with an array of values, as we support indexing of arrays in your JSON. That said, it is usually better to index each element of the array in a separate record, as this will provide the best relevance.

For example, consider the case of a search where users can query by book title or chapter title.

If you index a book with its chapter information, this record might look like this:

{
  "book_name": "Harry Potter and the Philosopher's Stone",
  "popularity": 1000,
  "chapter_titles": [
    "The Boy Who Lived",
    "The Vanishing Glass",
    ...
  ]
}

However, if you split this record into book records and chapter records, you data set will look this:

[
  {
    "book_name": "Harry Potter and the Philosopher's Stone",
    "popularity": 1000
  },
  {
    "book_of": "Harry Potter and the Philosopher's Stone",
    "chapter_name": "The Boy Who Lived",
    "popularity": 900
  },
  {
    "book_of": "Harry Potter and the Philosopher's Stone",
    "chapter_name": "The Vanishing Glass",
    "popularity": 800
  }
]

By re-structuring your data this way, you can take advantage of a few Algolia settings:

  • By placing book_name higher in the searchable attributes list than chapter_name or book_of, you can ensure the parent book will rank higher than any of its chapters.
  • By breaking up the chapters into their own records, you can add more granular popularity attributes to each record to ensure the most relevant chapters are surfaced, assuming popularity is added as a custom ranking attribute.

Indexing Long Documents

In order to index long documents with Algolia, you will need to split up the document into smaller records. Record size is limited for performance reasons, meaning each new “chunk” should realistically only be a paragraph or two. It’s also important to note that this approach will result in some redundancy of data.

Consider the case of a long document with a title, description, and 5 paragraphs. If you choose to split along the paragraph line, this document will coincide with 5 records in Algolia. Here’s a quick example of what this might look like:

[
  {
    "title": "Lorem Ipsum",
    "description": "Donec molestie nisl vel sem ultrices laoreet",
    "content:": "Suspendisse eget dictum neque, id dapibus ligula. Nullam commodo a nunc sit amet tincidunt."
    "popularity": 1000
  },
  {
    "title": "Lorem Ipsum",
    "description": "Donec molestie nisl vel sem ultrices laoreet",
    "content:": "Morbi mattis malesuada lacus in interdum. Phasellus tempor vel dui eu sodales."
    "popularity": 1000
  },
  ...
]

In order to ensure that users only see the best matching record when searching, we will need to leverage the distinct parameter. This will allow us to de-duplicate based off a common “key” (in this case, the document’s title). For more information on utilizing distinct, read over the distinct concept.

Additionally, we should set our searchable attributes such that title is most important, followed by description, and then content. This will ensure that a match in the title is ranked higher than a match within the content of a record. However, because the records are split, if a match does occur in the content we can easily display the matching segment highlighted in a snippet.

Formatting Considerations

There are a few cases to consider as data objects become more complex, along with requirements from the Algolia engine on record size, non standard types, and unique identifiers.

Accepted datatypes for your Attributes

Records in Algolia are modeled with JSON, and are easy to configure with semi-structured data. Both requests to the Algolia API are sent and received with JSON objects.

Algolia is schemaless, and can index data attributes that have the following format:

  • string "foo"
  • integer / float 12 or 12.34
  • boolean true
  • nested objects { "sub": { "a1": "v1", "a2": "v2" } }
  • array (of string, integers, floats, booleans, nested objects) ["foo", "bar"]

Unique identifier - ObjectID

Every object is uniquely identified by an objectID.

If you don’t provide one, Algolia will generate one automatically. However, it will be easier to remove or update records if you have stored a unique identifier in the objectID attribute.

If your objects have unique IDs and you would like to use them to make future updates easier, you can specify the objectID in the records you push to Algolia. The value you provide for objectIDs can be an integer or a string.

Because the objectID attribute is used as a unique identifier for your objects, it is treated specially by Algolia:

  • It can be searched by declaring it in searchableAttributes.
  • It cannot be highlighted nor snippeted. If objectID is declared in attributesToHighlight or attributesToSnippet, it will be ignored.
  • It cannot be excluded from the results. If objectID is declared in unretrievableAttributes or omitted from attributesToRetrieve, it will still be returned.
  • It can be used as a facet filter.
  • But it cannot be faceted. If objectID is declared in attributesForFaceting, it will be ignored. (Faceting on a unique identifier makes little sense anyway, since every facet count would be equal to one.)

Dates

Date attributes should be formatted as Unix Timestamps (ex. 1435735848) if you want to filter, or sort by date. By default the Algolia engine doesn’t interpret strings following the ISO date format, so you must convert your dates into numeric values.

Size Limit

Algolia limits the size of a record for performance reasons. The limits depend on your plan - see our pricing page. If a record is larger than the limit, Algolia will return the error message Record is too big. Algolia has techniques that can help you reformat and also break up your larger records into smaller ones, or remove duplicate data from your results using distinct. For more information, read over the distinct concept.

Example Record

{
  "objectID": 42,             // record identifier
  "title": "Breaking Bad",    // string attribute
  "episodes": [               // array of strings attribute
    "Crazy Handful of Nothin'",
    "Gray Matter"
  ],
  "like_count": 978,          // integer attribute
  "avg_rating": 1.23456,      // float attribute
  "air_date": 1356846157,     // date as a timestamp attribute
  "featured": true,           // boolean attribute
  "actors": [                 // nested objects attribute
    {
      "name": "Walter White",
      "portrayed_by": "Bryan Cranston"
    },
    {
      "name": "Skyler White",
      "portrayed_by": "Anna Gunn"
    }
  ]
}

Indexing of html tags

In order to keep an optimal relevancy, we made the choice to exclude HTML/XML tags, and their attributes, from data being indexed and searchable.

In the following example, users will be able to search for any word in the description except the link tag itself. This means that the following record will not be returned for a query of “href”, “target” or “_blank”.

{
	"name": "Myth #9",
	"description": "In-house <a href=\"http: //…\" target=\"_blank\" rel=\"noopener\">experts</a > are essential to get search right "
}

Sanitizing your data

Sanitizing these HTML attributes for security risks is the best practice for both displaying on the front end and indexing with Algolia.

© Algolia - Privacy Policy