Concepts / Scaling / Indexing Large Data
Mar. 22, 2019

Indexing Large Data

Introduction

You may occasionally need to import massive numbers of records (tens or hundreds of millions of records) to Algolia in a short period of time. While Algolia can handle a high load of indexing requests, there are caveats to keep in mind. This guide will review some tips and tricks to help you safely import your data.

Our engine always prioritizes search operations, which means indexing won’t negatively impact the search performance.

Preparation

Let us know

If possible, it’s best to keep us in the loop when you have planned to index a massive quantity of data. With some heads up, we can help by both monitoring the infrastructure and engine and optimizing the configuration of the machines and indices. For example, we might need to manually fine-tune the internal sharding of the indices for your specific type of data.

Either your dedicated Solutions Engineer or support@algolia.com will be able to help you out.

Configure your indices before pushing

It’s always better to set the settings of your index before pushing the records (you will only need to do this once per index, not at every batch). The searchableAttributes setting is particularly important beforehand to ensure the best indexing performance. By default we index all the attributes, which requires significantly more processing than only indexing the necessary attributes.

Ensure the data fits on your machine

When you’re on an enterprise plan, our servers come with 128GB of RAM. For optimal performance, we recommend keeping the size of the indices below 80% of the RAM, as the indices will be stored on RAM. The remaining 20% is used for other tasks (such as indexing). If the data size exceeds the RAM capacity, the indices will be swapped back and forth from the SSD to RAM as operations are performed, which will severely degrade performance.

It’s worth noting that since we process the data, the actual size of an index is often larger than the size of the raw data. The exact multiplication factor depends heavily on the structure of your data and configuration of your index, but it is generally 2 to 3 times.

Pushing the data

Use our API clients

We highly recommend using our API clients for pushing data, as opposed to the REST API or a custom wrapper built on top of the REST API. Our API clients contain many optimizations for both performance and reliability. These optimizations are required when performing bulk imports. If necessary, you can build a wrapper on top of our API clients.

Batch the jobs

All of our API clients have a Save objects method that allows you to push objects in batches. Pushing the objects one by one is a lot harder to process for the engine, because it needs to keep track of the progress of each individual job. Batching will greatly decrease the overhead.

We recommend sending batches of 1k to 100k records, depending on the average record size. Each batch should be no more than roughly 10mb for optimal performance. Technically, the API can handle batches up to 1GB (although we highly recommend sending much smaller batches).

Multi-thread

You can push the data from multiple servers, or from multiple workers in parallel.

Questions?

If you plan to push a large amount of data to your servers, please reach out to us directly either at support@algolia.com or through your dedicated Solutions Engineer.

Did you find this page helpful?