Today we’re announcing the release of Multi Cluster Management, a feature dedicated to improving our fellow SaaS customers’ lives when it comes to search.
While our search engine works for every industry, and our features — from fast indexing to typo-tolerance — work out of the box for every vertical, there are some needs that are particular to certain verticals.
Being a SaaS company ourselves, we share some of the pain points of our SaaS customers, and developed the Multi Cluster Management feature to help address some of these challenges:
- Rapidly scaling to large data sizes — several TBs and growing for our current biggest SaaS users
- Managing infrastructure costs and gross margin contributions to ensure business success of SaaS applications.
- Managing and tracking multiple TBs of data across multiple Algolia clusters without additional work required for load balancing and splitting at the user level
- Matching and migrating individual users to the right cluster based on capacity needs
- Enhanced features and service offerings for users depending on their plan or tier level
How the problem of scale is addressed today
Regardless of your search provider, when the total size of your indices reaches a certain size — typically a few hundred GBs — you’ll need to scale your infrastructure beyond the first cluster or server with which you initially started. For the sake of argument, let’s assume that you reach this limit when you reach 10k users.
So while the first 10k users were handled smoothly, scaling beyond that will add a great deal of pain and complexity into your search infrastructure. At this point, you’ll need to redesign your search solution to spread your users into more servers and manage all the additional complexity that comes along with it.
On Algolia (without Multi Cluster Management)
At Algolia, a single search instance consists of a triple-redundant cluster of servers that are operating in sync to provide the fastest and most reliable search experience possible for your users. Without Multi Cluster Management, if your 10k customers no longer fit on a single cluster, you will need to create a new Algolia application on a second cluster, and manually split your users to balance the load: 5k users on the first cluster, 5k on the other one.
This “solution” doubles the complexity of your initial search application as you will now need to manage two Algolia applications — with two APP_IDs, two sets of API keys — in addition to tracking which user is on which application.
This is where the pain merely begins. When you reach 20k users, both clusters will be full. You’ll need to set up a third cluster, and re-partition your users manually, and again at 30k users, and at 40k users…a painful process indeed!
On other engines (like ElasticSearch)
On other engines, you would rely on the cross-server sharding mechanism. Your users would be automatically balanced across the different servers, but this balancing comes at a cost.
First, each search query requires a full search across each server (or shard), which increases search latency and lowers users’ perception of your search engine’s performance.
Furthermore, since the users are automatically sharded across servers, you don’t have the ability to control which user is assigned to which servers, so you can’t make sure that a given user will be allocated the right amount of resources.
Add in the fact that as you scale (either up or down) and the number of shards changes, you will need to reindex all of your documents, potentially taking search offline until the indexing is completed or doubling your infrastructure and associated costs in order to avoid this downtime. So once again, a painful process!
Solving for scale with Multi Cluster Management
Multi Cluster Management is designed to solve the pain of scaling for SaaS customers on Algolia by making it as easy to manage two, three, four or 100 clusters as it is to manage one cluster. For starters, this means that all clusters will share one application; with one APP_ID, and set of API keys.
Let’s return to our example above, but this time with 100k users and cluster sizes of 10k users per cluster. To manage this setup with Multi Cluster Management, we would always use the same application ID to access our set of 10 clusters. Assigning users is as simple as making an API call to the application and telling it which user belongs on which cluster. No longer is it necessary to track each user’s location or specific application ID.
Once users are assigned, the cluster management becomes abstracted: if you add, update, or delete records for a specific user, there is no need to remember which cluster or application that user is on as the request will be automatically routed to the appropriate cluster. For users, every query entered is always routed to the right cluster with no increase to latency, preserving the instantaneous search experience that they have come to expect.
And finally, let’s say you need to migrate a user to a new cluster for any number of reasons—rapid growth, upgrading to a new tier of service or moving data to a different geographic region. Doing this is now a straightforward, two-step process, with no user facing downtime introduced by this migration:
- 1. Order a new cluster
- 2. Make an API call to assign this user to the new cluster
That’s it! The user will be automatically migrated with no downtime, no manual operation, and most importantly, no complex migration process.
The inverse is also true: adding a cluster or removing a cluster is as straightforward as sending the necessary API calls to migrate users.
How this helps SaaS companies
With Multi Cluster Management, we have started addressing many key considerations that our SaaS customers experience as their businesses scale:
- 1. Search performance
You know how obsessed we are with performance. For that reason, we wanted to avoid having to search across multiple servers when performing a search. With this design, each user is hosted on a single cluster, so a single server can answer queries for a given user. Not having to merge results from different servers allows us to keep the same performance Algolia is known for.
- 2. Adding and removing clusters
Adding or removing a cluster doesn’t require a complex migration strategy. A few API calls and your new setup is live, with no downtime in the process.
- 3. Flexibility
You have a complete control of mapping users to specific clusters, so you can assign users as your needs dictate; for example, you may want to host a specific user on a dedicated cluster, on a shared cluster in a specific region, or even on a dedicated cluster in a specific region.
- 4. Resource control
By controlling to which clusters users are allocated, you are able to manage your infrastructure to achieve the most effective and efficient outcome based on your needs.
Technical choices we made
In this section, we’ll dive deeper into the feature by investigating some of the technical choices we made when creating Multi Cluster Management (MCM). In particular, we want to look at four different topics:
- 1. Why search on only one server at a time
- 2. Why we made user-to-cluster mapping configurable
- 3. Why not use clusters with more servers?
- 4. What use cases are not covered by Multi Cluster Management?
Through these four topics, we hope to explain our reasoning in developing this feature, give our users an in-depth explanation of the feature’s capabilities, and finally explain how to optimize its use.
Why search on only one server at a time
This is probably the most defining design choice of MCM: every search happens on a single server.
When facing a situation where the size of the data is bigger than one server, most search engines decide to shard the data across a cluster with multiple servers using a hash-based mapping: this means that, when searching, you need to query each server, retrieve the matching results, merge the results, and finally provide the answer to the user. As you can probably guess, this introduces a performance bottleneck at the network level between the servers themselves. Requiring multiple servers to communicate lists of results to be able to merge them is slow.
One of our main objectives at Algolia is to make search fast. To achieve this with Multi Cluster Management, we avoid communication between different servers of a cluster by storing 100% of the required data on a single server. Fortunately, this is easily achievable for SaaS use cases where the data is easily partitioned on user-by-user basis.
One good property of data partitioned by user is the fact you can create small, autonomous slices of data, where each slice of data fits within the storage limitations of a single server. This ensures that each user will fit within a single, Algolia cluster. And because an Algolia cluster is defined as three fully-redundant servers, we need to query only one of the servers within the cluster to get the full list of results – eliminating the network bottleneck discussed above.
Why we made user-to-cluster mapping configurable
Our initial intuition was to automate the assignment of users to clusters, so that the first time you send an object related to a new user, this new user would be automatically assigned to a cluster. However, we found a few use cases which required additional flexibility:
Let’s say that ⅔ of your users are in the US, and ⅓ in Europe. Ideally, you would like to have your servers located as close as possible to your users. By making user to cluster mapping fully configurable, it is now within your control to assign users to whichever cluster makes sense to ensure optimal performance for their given regions.
Different tiers of service
If you are a SaaS service, chances are you provide different levels of plans. In that case, you may want to provide a different quality of search to different users. For example, you may want to provide advanced search features such as Query Rules to your premium plans.
Or you may even want to make sure that your most important prospects don’t face an issue during the trial or testing phase that could jeopardize their perception of your product, by placing them on clusters with greater search or indexing resources.
For these reasons, we decided to keep the mapping of users to clusters configurable: the first time you send us a userID, you need to decide in which cluster this userID will be hosted.
And to make this smooth, we provide two handy tools:
- 1. A migration tool: re-assigning users is as easy as an API call
- 2. An overview of each cluster’s status: we expose an API giving you an overview of how full the different clusters are, and how much capacity your largest users are consuming on each cluster.
In the future we may provide different strategies to further simplify user assignment. An example of this would be to introduce multiple mapping modes:
- Manual: you need to provide us the userID<->cluster mapping
- Random: the users are assigned randomly to a cluster in the pool
- Empty-First: the users are automatically assigned to the cluster with the greatest unused capacity
Why not use clusters with more servers?
Algolia’s infrastructure is built from the ground up to ensure reliability. For that reason, every single search cluster consists of three fully redundant servers, so that if any two servers go down, the end-user’s search availability and performance will remain unaffected.
When designing MCM, one of our options was to add more servers to these clusters. For instance, we discussed having clusters of four, five, or 15 servers, where each of these servers would have hosted only a subset of the overall data.
Instead, we decided to trust our initial design choice of three servers per cluster, and then link these clusters together for the following reasons:
Having each cluster composed of three servers allows us to build a highly available infrastructure where the three servers are hosted within different data centers. This allows us to continue offering a 99.99% SLA by default (and 99.999% SLA as a premium option) for our customers, even when multiple clusters are in use.
Easier to manage
For cluster size, complexity grows with scale. The bigger a cluster gets, the more issues it creates. For example, a cluster of 15 servers has 5 times more chances of having any one server go down. Therefore, using five clusters of three servers each and keeping these clusters independent from one another removes an unnecessary risk created by large cluster sizes.
What use cases are not covered by Multi Cluster Management?
As you may have understood by now, one of the core design choices for Multi Clusters Management is the idea that a search will happen on only one cluster at a time.
This choice means that there are use cases for which MCM is not applicable. As long as the size of the dataset into which you’re searching at a given time fits into one cluster, MCM will work perfectly. But if the size of that single dataset outgrows a single cluster’s capacity, and that dataset is not partitionable, then MCM will not work.
For example, let’s say that you want to create an index of all the known objects in the universe (such as stars and planets). This index may end up representing a few TB of index size. If your plan is to enable search across all of the planets and stars at once, then you’ll need to search into your entire dataset. However, this dataset doesn’t fit onto a single cluster, so in order to run this search query you would need to be able to search simultaneously across clusters. This use case, however, is incompatible with this design, as queries are always routed to a single server to be answered.
An example of where MCM would be a fit for searching all the known objects in the universe, however, would be if searches are always done on a solar system by solar system basis. Assuming that no single solar system consists of a data set too large for a single cluster, then MCM will route each search request to the cluster on which that data is hosted and return results.
This design choice is what makes Multi Cluster Management particularly adept for SaaS use cases. For example, for CRM companies, the user data is easily partitioned by end user who is paying for the service. From our experience in SaaS, the number of use-cases where a single customer doesn’t fit in an entire server are very rare (we haven’t encountered the case yet).
Building on the tradition of Algolia’s infrastructure solutions
At Algolia, we always pay special attention to problems of scale. A few years ago, we launched DSN to give customers the ability to provide great search to users anywhere in the world. We then started to spread our clusters to different data centers to be able to rely on multiple provider for servers, electricity, networks… Multi Cluster Management is a unique — and we hope — effective solution to a set of problems that scale presents to SaaS companies.
We look forward to hearing your feedback: @algolia, or comment below!