Algolia DevCon
Oct. 2–3 2024, virtual.
Guides / Scaling

Multi-cluster management (MCM)

Managing multiple clusters

Multiclusters

Migrating to the multi-cluster management setup isn’t available anymore. The lack of features associated with this architecture is limiting. Instead, if you wish to partition your data per user, prefer to use facets and secured API keys. If your issue is with the amount of data, ask the support team about upgrading to a bigger cluster to suit your needs.

What is multi-cluster management?

Normally, all of your data can fit onto one cluster. However, when your data becomes too large for one cluster, Multi-Cluster Management (MCM) offers a way to logically break up your data so that:

Consider an email system, in which users search their own (massive) email history. This is a case where one cluster will quickly become insufficient to store all those emails. Before MCM, you could have added more clusters, but, as discussed below, managing multiple clusters without MCM would be difficult. MCM helps you manage the redistribution of the emails, making it easy for you to put some users on cluster 1, others on cluster 2, and still others on as many clusters as need be. And this is easily scalable as the number of email users grow.

It is also transparent: without MCM, you would need to manage the multiple clusters yourself. This would include keeping track of the clusters and their different application ids, and which users are on which clusters. And you would also be moving data from one machine to another, to balance the load or when adding new clusters. Using Algolia’s MCM all you need to do is tell the MCM API on which unique attribute you wish to split your data (here, the user’s email), and then make sure that all indexing and search operations contain the appropriate user attribute (essentially, every index record needs to include the email). The rest is managed with the MultiCluster API: cluster management, index mapping, search mapping, and load balancing.

Other use-cases

Managing user emails is only one use case.

  • SaaS providers like Salesforce or Dropbox can use MCM to manage their data. This is similar to the email system, but with a unique customer id - each cluster gets a different set of customers.
  • Music streaming services like Deezer and Spotify. Initially, a single cluster may be sufficient to house the full collection of music: all private and public playlists. However, when the number of private playlists becomes too large to fit on one cluster, you would move to multiple clusters. Now, every cluster will still contain the full collection of music and public playlists, but MCM will enable [distributing the private playlists over different clusters]. MCM is easily scalable. As a result, any company with only 1 cluster can scale up to as many clusters as they need, to match the growth of their own customer base. The only requirement is that their data can be split into discreet chunks (by UserID or customerID).

Essentially, MCM comes into play whenever a set of data can be logically split into smaller chunks, such that a full search can be performed on only one chunk at a time.

One search, one cluster

MCM does not enable searching across clusters, nor does it merge results from different clusters. The distribution is based on a split where a user’s entire dataset can be placed on a single cluster, so that only one cluster is actually needed to perform the complete search. MCM does not, in other words, aggregate results from different clusters.

This works perfectly well for an email system, because users have exclusive access to their own data. In contrast, consider an online library that enables full-text book searching. You couldn’t spread the books on different servers and expect complete results from a single-cluster search. You would need to search all of the clusters and then aggregate the results.

Multiple clusters in more detail

Load balancing

One of the natural consequences of splitting up your data is that the split will not always stay balanced - some users will have significantly more data than others. MCM simplifies load balancing. You will base your initial cluster distribution on current usage but, over time, as usage grows, the initial balance will get undone. With MCM, it will be easy for you to move your user data around to sustain a balanced state.

More about your data

Just to reiterate how MCM works:

  • You’ll first want to slice up your full index into smaller index-subsets, where each subset is tagged with a user-id (that is, user-partitioned)
  • The indexing operations will then load these index-subsets to different clusters
  • Once done, every search and indexing operation must include the UserID

Some important points:

  • Clusters can (and will) contain more than one user
  • No user can be on more than one cluster
  • No single user data can be larger than the size of a single cluster

Global vs private data

As mentioned in the music-streaming use case, clusters can have the same data as well as different data. This is seen when you allow users to search public data (music collection, public playlists) as well as private data (private playlists).

Private data is managed by assigning the UserID to a particular user.

With MCM, adding, and managing global data across every index is as easy as setting the UserID to “*”.

Mapping users to their cluster

Prior to implementing MCM, every Algolia cluster is assigned a unique application id. If you use multiple clusters, you must keep track of which application id maps to which cluster. Additionally, you must keep track of which user is on which cluster. Therefore, to manage multiple clusters, you need a mapping table of three items: user, cluster, and application id. Every indexing and search operation needs to go through this mapping.

With MCM, the clusters in a multi-clusters configuration use the same application id. For the mapping, every cluster contains the mapping. You only need to send the UserID to Algolia: Algolia will then do the necessary redirection to find the correct cluster.

The way this works concretely is that there is at most two clusters involved in every search. The first cluster - normally the one closest to the user - will either be the correct cluster for that user, or it will redirect the request to the correct cluster using its local map.

Moving data and scalability

MCM is easily scalable because every cluster is independent from a data point of view: your data is broken up into as many clusters as you need, and every search is performed on its own cluster.

Moving data from one cluster to another (for load balancing or adding new clusters) is a simple process. Without MCM, moving data requires a number of very precise steps in the right order to avoid any downtime. Additionally, bringing in new clusters requires a number of move operations that is both time-consuming and resource-intensive. Error handling and rollbacks can be difficult.

One of the primary goals for MCM was to simplify this process of moving data and adding new clusters, by minimizing the number of API calls and parameters you would need to use. This is key aspect of the scalability achieved through MCM: given its simplicity, MCM makes it easy to scale to the growth of your own user base.

The MultiClusters API

The MultiClusters API hides the difficulties associated with multiple clusters by adding a layer on top of the classical API.

As with all of Algolia’s REST APIs, anything that can be done using the REST API can be done using an Algolia API client.

To work with the API, you’ll need to ensure that each record is tagged with a UserID. Once that tagging is complete, the API takes over managing user-to-cluster mapping, redirecting every update and query to its proper cluster.

Further reading

Did you find this page helpful?