Guides / Sending and managing data / Format and structure your data

Preparing Your Data for Indexing

Before sending anything to Algolia, you need to think about where your data lives and what information you want to make searchable. For a retail outlet, it’s products. For a music store, it’s records and artists. For a real estate company, it’s houses and locations.

The next question is what information you need to build a search experience. You don’t need everything from your data source, only what is necessary to create a search experience.

Fetching data

To get started with Algolia, you need to extract your data and send it to our servers. The data can come from a relational database, a collection of XML files, a set of Excel spreadsheets, etc. The original data format doesn’t matter, what you need is a way to fetch and transform your data into a format that Algolia understands.

Usually, you want to write a script to fetch, transform, and send your data to Algolia. This script runs on your computer or server. You can access Algolia’s servers via our Search API. We provide you with API clients in many different programming languages, framework integrations and platform extensions to help you send and manage your data depending on your technical stack.

For example, let’s say you have a custom PHP blog with a MySQL database, and you want to make your blog posts searchable. You can create a script that fetches the posts from your database (e.g., with PDO or an ORM), picks and transforms the data, and rearranges it into records. Later on, you can use our PHP API client to send the objects to Algolia, and keep the data up to date when you add, update, or delete a post.

Structuring records

When adding data to your records, you need to be selective. For example, if you’re working with a product line, you don’t need to send every piece of information about your products. Algolia only needs what serves the purposes of search. It includes all the necessary information to find products, rank them, and display them on your website or application.

Building records involves extracting the right data, reworking it to remove unnecessary data and add or compute extra information that can improve the chances of finding the most relevant results.

Reworking data

Imagine we want to create a search experience around movies, which means we may want to search and display movie titles, synopses, and cast. We also want to display (but not search) images and country of release, filter on genre or a range of dates, and rank based on review scores. However, we don’t care about technical information, such as how long the movie is.

Let’s say your data comes from a relational database, with the information you need in different tables. You need to query the data from these tables. After fetching it, a record for one movie may look like the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
[
  {
    "title": "Spirited Away",
    "synopsis": "During her family's move to the suburbs, a sullen 10-year-old girl wanders into a world ruled by gods, witches, and spirits, and where humans are changed into beasts.",
    "director": "Hayao Miyazaki",
    "cast": [
      {
        "name": "Rumi Hiiragi",
        "birth_date": "August 1, 1987",
        "birth_place": "Tokyo, Japan"
      },
      {
        "name": "Miyu Irino",
        "birth_date": "February 19, 1988",
        "birth_place": "Tokyo, Japan"
      },
      {
        "name": "Mari Natsuki",
        "birth_date": "May 2, 1952",
        "birth_place": "Tokyo, Japan"
      }
    ],
    "release_year": 2001,
    "country": "Japan",
    "genres": [
      "Animation",
      "Adventure",
      "Family",
      "Fantasy",
      "Mystery"
    ],
    "runtime": 125,
    "aspect_ratio": "1.85:1",
    "content_rating": "PG",
    "review_scores": {
      "imdb": 8.6,
      "rotten_tomatoes": {
        "critics": 97,
        "audience": 96
      }
    },
    "images": [
      "https://yourdomain.com/spirited-away/image1.jpg",
      "https://yourdomain.com/spirited-away/image2.jpg"
    ]
  }
]

Here, we have all kinds of content, some of which are useful for a search experience as is, others that we need to clean up or rework, and some that we can eliminate. For example, we don’t need to keep runtime or aspect_ratio. While this is useful in other contexts, they have little to no value when searching, filtering, ranking, or displaying search results.

Same goes with some information in the cast attribute: while the names of the voice actors are useful, we don’t need their birth date and place. Therefore, we can safely remove them and only keep the names. This allows us to remove noise, and save room in our records for more useful data.

1
2
3
4
5
6
7
{
  "cast": [
    "Rumi Hiiragi",
    "Miyu Irino",
    "Mari Natsuki"
  ]
}

Data for searching

Attributes for searching are the ones that contain the terms that your users look for. If you want to search for a movie by title, plot, genre, or cast, you need attributes that contain these terms. In our example, such attributes are title, synopsis, director, cast, and genres.

Algolia lets you define in which specific attributes to search, using the searchableAttributes parameter. By default, the engine searches within the entire record, but you want to adjust this: this better for performance, and allows you to remove noise. You don’t want to search into attributes like images, release_year or review_scores, which aren’t textually relevant, or country, which may result in false positives. For example, when searching for “japan”, users most likely want to find movies that either have the term in the title or takes place in Japan, rather than Japanese movies.

We can therefore set title, synopsis, director, cast, and genres as searchableAttributes, and leave out the rest for displaying, filtering, and custom ranking.

Additionally, you can add some extra data to improve the discoverability of your record. For example, some users may try to look for it by its original title, or by the translation in their own language. Unless the translations are in the record, searching for these terms would return no results, so it’s a good idea to retrieve them and add them to your objects. You can fetch them from your database if you have them, or a third-party source such as an API or a website.

1
2
3
4
5
6
7
8
9
{
  "display_title": "Spirited Away",
  "original_title": "千と千尋の神隠し",
  "alternative_title": [
    "Le voyage de Chihiro",
    "El viaje de Chihiro",
    "Chihiros Reise ins Zauberland"
  ]
}

Data for filtering and faceting

When you have a significant amount of data, you can improve your search experience and let users fine-tune their query by filtering it down. For example, they may want to find all movies by director Hayao Miyazaki, find new adventure movies to watch or look for the best motion pictures of the past year.

Algolia lets you filter down results based on attributes. In our case, we could use director, cast, country, content_rating, and genres and display them as refinement lists in our search experience, as well as release_year to display a range slider. You can declare them with the attributesForFaceting parameter.

Filterable attributes can be anything, but you want to make sure to normalize the content. For example, if you have attribute genres with the term “Animation” in one record and “Animated picture” in another, these would result in two different facet values.

Custom ranking

Algolia ensures that the most textually relevant results come first, but there might be ties in terms of relevance. For example, if a user looks for “james bond”, all James Bond movies would match equally. Without anything else to break the tie, Algolia falls back on the objectID in alphanumeric order, which isn’t relevant. A much more meaningful way to break the tie is to rely on a piece of information that’s meaningful for our use case. For movies, it can be review scores, or likes on a given platform. For a retail store, it could be number of sales.

Algolia lets you inject business metrics to influence the ranking formula via the customRanking attribute. Attributes for custom ranking can be either numeric or boolean.

In our case, we can leverage the review_scores attribute. However, since we have several scores, we may want to compute them into a global one and use it in customRanking. The computed attribute would look like this:

1
2
3
{
  "computed_score": 201.6
}

Depending on whether we plan to display the individual scores in the search results, we may decide to keep the original review_scores or get rid of it altogether to further simplify our record.

Handling record hierarchy

Simplifying and restructuring your records doesn’t mean you have to lose hierarchy or relationships. For example, if you want your users to search for movies and see them organized by director, you need to store this relationship in your index. Algolia doesn’t impose a data schema. You can organize your data in any way you want, and keep it simple without losing complexity.

Let’s look at an example that simplifies hierarchical records. A director and all the movies they directed could either be one or many objects. A single record can represent a director, but it’s not useful for searching individual movies. While in a traditional database, you wouldn’t want to repeat data, this is perfectly okay in your Algolia index.

Take the following record, with nested movies:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
{
  "director": "Hayao Miyazaki",
  "movies": [
    {
      "title": "Spirited Away",
      "score": 201.6
    },
    {
      "title": "My Neighbor Totoro",
      "score": 196
    },
    {
      "title": "Princess Mononoke",
      "score": 195.4
    }
  ]
}

This is only useful if your search experience focuses on finding directors, not on movies. Now take a look at that same information in three flatter, less hierarchical records:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
[
  {
    "title": "Spirited Away",
    "score": 201.6,
    "director": "Hayao Miyazaki"
  },
  {
    "title": "My Neighbor Totoro",
    "score": 196,
    "director": "Hayao Miyazaki"
  },
  {
    "title": "Princess Mononoke",
    "score": 195.4,
    "director": "Hayao Miyazaki"
  }
]

Here, you can find movies much more easily. Additionally, if you wanted only to show a limited number of movies per director (e.g., the most popular one), you could leverage Algolia’s distinct feature on the director along with custom ranking on score. This way, searching for “miyazaki” would only return Miyazaki’s most popular movie.

Note that flattening data adds more records to your index. If you have 10,000 directors with an average of ten movies each, this results in an index with 100,000 records. It may sound like a lot, but it’s not. Besides what your plan allows, Algolia has no prescribed limit to the number of records, only disk size.

Structuring indices

An index is a collection of records. When you perform a search, you look into the records of an Algolia index.

An important principle that comes from relational databases is to distribute data across different tables. You break information into small, meaningful units to avoid redundancy. With Algolia, however, these principles don’t matter. As seen above, with record structure and data hierarchy, flattened data is best for searching. This applies at the index level too. It might seem reasonable to create several indices and map them to your tables, where each index represents a different kind of entity. For example, you may want to separate movies from actors and create an index for each. However, this might not serve the purposes of your search. What if you want your users to search for both movies and actors at the same time, and for them to appear in the same results? In that case, a single index works better.

The main reason for this is relevance. Algolia searches one index at a time; it doesn’t perform cross-index searches. Searching two indices at once produces two sets of results, each with their own internal relevance configuration. Algolia doesn’t merge these results and trying to do this yourself would only break the relevance. That’s because combining Algolia’s results after a search requires understanding and re-implementing Algolia’s ranking algorithm. You would invariably be undoing the work that Algolia does for you.

It doesn’t mean that there are no reasons for having multiple indices; yet, splitting data per entity isn’t a relevant use case.

When to use multiple indices

Keeping separate indices is usually a user interface question. If you want your search experience to display movies and actors separately, it’s better to use different indices.

Another case is when you want to showcase popular queries with an autocomplete menu, which you can do with Algolia’s Query Suggestions feature. For that, you need two indices: one for your content, and one for the common queries.

You also need to use separate indices when you want to let your users switch between different rankings, such as ascending or descending popularity. While you can’t dynamically change the ranking of an index, you can use replica indices with the same data and different ranking strategies.

You shouldn’t hesitate using several indices when necessary. The guiding principle to keep in mind is that a search is performed on a single index at a time. It means that every index must contain all the necessary records to retrieve exhaustive results. If you want two objects to appear in the same result set and have them weighed against each other in the same relevance computation, they need to be in the same index.

Best formatting practices

Custom ranking

High on the list of relevance tuning is custom ranking. When two records rank the same based on textual relevance, you want to tie break based on meaningful metrics, such as popularity, number of views, or number of sales. Including business metrics in your records is crucial to create a relevant search experience.

We go more in-depth on the topic of custom ranking later on in the documentation, but we encourage you to start thinking about such metrics when fetching your data and structuring your records.

Filters and facets

Including attributes for filtering and faceting helps with relevance, because it helps your users refine their search and narrow down results. While you can rely on existing categories from your data, you can also reuse your searchable or ranking data to create new filters and facets. Going back to our movies example, we used actors as facets, but we can also leverage numeric attributes like year to display a range filter on the front end, or string attributes like country to refine movies by country.

A good rule of thumb is to add attributes based on what your end users want to fine-tune their search with. If you’re selling movies online, useful categories include distribution system (e.g., DVD, Blu-Ray, VOD, etc.) or content rating (e.g., PG, PG-13, etc.) If you have a movie reviews website, users likely want to refine on review score or popularity.

Advanced formatting

While Algolia provides a vast collection of settings to help with relevance, many of these work in combination with how you format your content. Examples of this are whether to use one or many attributes for a single piece of information, include long or short descriptions (or both), repeat the same words in the title and description, round custom ranking attributes, etc.

Did you find this page helpful?