Guides / Scaling

Scaling to Larger Datasets

You may occasionally need to import massive numbers of records (in the tens or hundreds of millions) to Algolia in a short period. While Algolia can handle a high load of indexing requests, there are caveats to keep in mind.

Our engine always prioritizes search operations, which means indexing doesn’t impact the search performance.

Preparation

Let us know

If possible, it’s best to keep us in the loop when you plan on indexing a massive quantity of data. With some heads up, we can help by monitoring the infrastructure and engine, and optimizing the configuration of the machines and indices. For example, we might need to manually fine-tune the internal indices sharding for your specific type of data.

Get in touch with either your dedicated Solutions Engineer or support@algolia.com to prepare for massive indexing operations.

Configure your indices before pushing data

We recommend configuring your index before pushing the records. You only need to do this once per index.

The searchableAttributes setting is particularly important beforehand to ensure the best indexing performance. By default, we index all attributes, which requires significantly more processing power than only indexing the necessary ones.

Ensure the data fits on your machine

When you’re on an Enterprise plan, our servers come with 128 GB of RAM. For optimal performance, we recommend keeping the size of indices below 80% of the total allocated RAM, since we store all indices in memory. The remaining 20% is for other tasks, such as indexing. When the data size exceeds the RAM capacity, the indices swap back and forth between the SSD and the RAM as operations are performed, which severely degrades performance.

Since we process your data, the actual size of an index is often larger than the size of your raw data. The exact factor heavily depends on the structure of your data and configuration of your index. Usually, it’s between two and three times as big.

Pushing the data

Use our API clients

We highly recommend using our API clients for pushing data, as opposed to using the REST API directly, a custom wrapper, or an unofficial client that we don’t maintain internally. Our API clients follow strict specifications that contain many optimizations for both performance and reliability. These optimizations are required when performing bulk imports.

Batch the jobs

All our API clients have a Save objects method that lets you push records in batches. Pushing records one by one is a lot harder to process for the engine because it needs to keep track of the progress of each job. Batching significantly decreases the overhead.

We recommend sending batches between 1 and 100K records, depending on the average record size. Each batch should remain below ~10 MB of data for optimal performance. The API can technically handle batches up to 1 GB, but we highly recommend sending much smaller batches.

Multi-thread

You can push the data from multiple servers, or multiple parallel workers.

Did you find this page helpful?