Guides / Scaling

Scaling to Larger Datasets

You may occasionally need to import massive numbers of records (in the tens or hundreds of millions) to Algolia in a short period. While Algolia can handle a high load of indexing requests, there are caveats to keep in mind.

The engine always prioritizes search operations, which means indexing doesn’t impact the search performance.

Preparation

If possible, it’s best to keep Algolia in the loop when you plan on indexing a massive quantity of data. With advanced notice, we can help by monitoring the infrastructure and engine, and optimizing the configuration of the machines and indices. For example, we might need to manually fine-tune the internal indices sharding for your specific data.

Contact either your dedicated Solutions Engineer or support@algolia.com to prepare for massive indexing operations.

Configure your indices before pushing data

It’s best to configure your index before pushing the records. You only need to do this once per index.

The searchableAttributes setting is particularly important beforehand to ensure the best indexing performance. By default, Algolia indexes all attributes, which requires more processing power than indexing only necessary ones.

Ensure the data fits on your machine

Plans with a dedicated cluster come with servers with 128 GB of RAM. For optimal performance, you should keep the total size of indices below 80% of the total allocated RAM, since Algolia stores all indices in memory. The remaining 20% is for other tasks, such as indexing. When the data size exceeds the RAM capacity, the indices swap back and forth between the SSD and the RAM as operations are performed, which severely degrades performance.

Since Algolia processes your data, the actual size of an index is often larger than the size of your raw data. The exact factor heavily depends on the structure of your data and configuration of your index. Usually, it’s between two to three times as large.

Pushing data

Use the API clients

It’s best to use the official API clients for pushing data, as opposed to using the REST API directly, a custom wrapper, or an unofficial client that Algolia doesn’t maintain internally. The official API clients follow strict specifications that contain optimizations for both performance and reliability. These optimizations are required when performing bulk imports.

Batch indexing jobs

All official API clients have a Save objects method that lets you push records in batches. Pushing records one by one is a lot harder to process for the engine because it needs to keep track of the progress of each job. Batching decreases the overhead.

Batches between 1 and 100K records tend to be optimal, depending on the average record size. Each batch should remain below ~10 MB of data for optimal performance. The API can technically handle batches up to 1 GB, but sending much smaller batches yields better indexing performance.

Multi-thread your indexing

You can push the data from multiple servers, or multiple parallel workers.

Handling large datasets

API clients are built with ease of use in mind. They come with a configuration and defaults that work for most cases.

The saveObjects method works out of the box with optimal performance. But if you have large indexing jobs, you might hit limitations and you might want to tweak the settings.

To index large datasets with an API client, you need to:

  1. Define the ideal batch size
  2. Compress the record

API clients let you change the batch size when using saveObjects. Finding the ideal batch size depends on the size of your records, and requires iteration on your end to nail it down. You should also inspect the Algolia HTTP error to decide whether it’s too big for the engine.

Did you find this page helpful?