Concepts / Scaling / Servers and clusters
Jun. 03, 2019

Servers and Clusters

Algolia’s Bare Metal Servers

Server

While virtualization is the choice of a vast majority of SaaS services, Algolia has decided to use bare metal servers. Bare metal servers give applications direct access to the physical and software resources of a computer. For example, all Algolia searching and indexing operations are processed by the Algolia engine, which in turn directly interacts with the computer’s essential resources, such as the Operating System, CPU, RAM, and disk.

With a virtual machine, however, all applications need to pass by one or more additional layers of system-level software (the virtual machine) before reaching the services of the underlying physical server. While this clearly slows things down, on the positive side, it also creates flexibility, by spreading a single server’s capacity over many discreet use-cases: one customer might want to run a massive SQL Server database on a Windows computer; another customer might want to perform CPU-intensive calculations using an old version of unix; and another might want to simulate a macintosh - all of which can be done on the same, shared server, using virtualization.

For Algolia, virtual machines are not necessary. Bare metal servers have been time- and task-slicing for years, without the need to virtualize. With powerful hardware components, a single server can handle countless customers - especially if they are all doing the same operations, which is the case with Algolia.

Additionally, many of our larger customers use dedicated servers, giving them exclusive access to the entire cluster. Most of our smaller accounts share servers - which means that they share the same server and Algolia engine. However, whether it is dedicated or shared, the principle is the same - Algolia operates directly on the machine.

Note that when we say server, we are actually referring to a cluster of 3 identical servers. Thanks to clusters, Algolia can provide an SLA reliability of 99.99%, because a cluster guarantees that at least 1 of the 3 servers will be available at all times.

Algolia Cluster

What is an Algolia cluster?

Cluster

  • A set of 3 servers “clustered” together to handle all requests.
  • Each server is equal to the others; that is, each one is equally capable of responding to every request.
  • For this to be possible, each server must have the same data and index settings, as well as the same overall system configuration, thereby enabling the cluster to behave as a single server.
  • Why is this done? For redundancy - so that if 1 or 2 servers go down, the cluster is still available. This is how we guarantee an SLA of 99.99% availability.

Algolia clusters in more detail

Algolia has over 400 clusters in 15 regions and 60 data centers, with each cluster consisting of 3 bare metal servers.

A cluster of 3 servers acts as 1, with each one ready to serve at any moment, waiting for the next request - while the other 2 remain on-call to process the following requests.

We refer to this as a 3-way partnership, in which all computers are of equal value, and each are configured with the same exact software, system settings, and (generally speaking) the same hardware specifications.

And most importantly, they contain the same exact data and index settings.

Every Algolia customer uses a cluster for all search and indexing operations. Take a customer who has an immense clothing collection, with one of the most active retail websites in the world. At any given moment, they can be updating their indices while 1000s of clients are searching for different clothing items. Each request is balanced and evenly distributed, so that all 3 servers are sharing the load.

Why does Algolia use a cluster?

So why a cluster? Why 3 machines?

Let’s start off with what a cluster is not: It is not designed to optimize capacity. Algolia does not split a customer’s data across 3 computers, with each machine getting a third of the data. Admittedly, this would triple the capacity. But this is not the goal.

In fact, as regards capacity, Algolia does not need to do this, because generally speaking customers do not need more than one server to store its data - even if a customer has an enormous database. The indexing required for search only needs a small subset of a customer’s data, small enough for one server.

Additionally, this is not about concurrency, where the parts of one operation are split across different computers. Clusters are not designed for parallel computing: each server in a cluster processes independently the whole request.

Ultimately, a cluster is about redundancy, or reliability. For Algolia, alongside performance is reliability: A fast and relevant search is of little value if the search engine is unavailable.

Redundancy - What happens when a server goes down?

Or more precisely - What happens to a cluster when one or more of its servers go down?

For us, a server is “unavailable” when it is unreachable by the other servers. Unavailability can be caused by many things - a temporary network failure, a server is too busy to respond, it is physically down - whatever its cause, what is important here is that it is unreachable by the other servers in the cluster.

And as we will explain below with our consensus algorithm, synchronizing data within the cluster requires uninterrupted communication between all 3 servers; therefore, when one or more machines in a cluster become unreachable, this puts that synchronization at risk.

When a server is unreachable

If 1 machine is unreachable, the other 2 will continue to function normally - processing both indexing and search requests. They can achieve consensus among themselves, and when the 3rd returns, it can be properly updated/synchronized with the same index as the other 2.

Unfortunately, while a server might be unreachable to the other servers in the cluster, it might nonetheless be reachable by your own servers - that is, it might still be able to receive indexing requests from your own servers. This is a serious problem for synchronization: the “down” server has no idea what the other 2 servers are doing with their indices, and so if it were to start using its own indexing changes without sharing those changes with the others, the overall cluster will end up with 2 very different sets of data.

To handle this, we queue indexing jobs on any server that is unreachable. So while the other 2 servers continue to process their indexing jobs - in a certain order, synchronizing between them - the absent server will put on hold (queue) any indexing jobs, and process them only once the whole cluster is back together.

Availability over Consistency

What we are describing here is the common tradeoff, for hosting services, between availability and consistency of data.

  • Availability: always being able to access your data (ie. no service outage).
  • Consistency: viewing the same data everywhere at the same time (i.e., all users getting the same search results at the same exact time).

Algolia has chosen the fast, availability option over consistency of data because when someone searches they should get results without fault. If there are small differences of data between users, this is far less critical than not getting results at all.

For many technical reasons, achieving these two goals with equal success is an impossibility (CAP theorem, eventual consistency). Primary among these reasons is the following: Every client gets 3 servers to service all search requests, which guarantees that at least one server is available at all times. However, if we were to delay searches until all 3 computers have the same exact data, this would cause numerous delays, thereby undermining our milliseconds guarantee.

That being said, synchronizing data between 3 servers takes seconds or less; users therefore do not often experience data discrepancies.

Regarding Search operations

Meanwhile, with regards to search, server-to-server communication is less important; therefore, as long as a server is functional, we allow it to process search requests.

From Servers to Clusters to DSN

As discussed, there are multiple reasons to use clusters.

  • Availability: If 1 or 2 servers go down, the users of your website won’t be affected. Search is always available. We’ve never had all 3 servers go down at the same time.
  • Redundancy: Having 3 live copies of your data makes it very unlikely that we’ll ever lose it.

Consensus of 3 servers

We knew early on that to achieve this kind of reliability, we would need a 3-server cluster. Initially, we used 1-server-per-region to process every indexing and search operation. Our focus was on the machine and how to improve its performance. However, we also needed reliability, so we quickly switched over to a cluster.

Cluster history

Clusters require a solid consensus algorithm to ensure that each server contains the same data at all times, without service interruption. We went with the (RAFT algorithm). RAFT coordinates all index input - adding, updating, and deleting index data - so that all machines in a cluster get updated at the same time.

Distance Counts

We next needed to put the cluster’s servers in different data centers. For a simple reason: When servers share the same data center or same power lines, a single flood or power outage can bring down the entire cluster. Thus, to ensure cluster reliability, we needed to separate the servers so that no single act (power outage or otherwise) could bring down the whole cluster. We succeeded in doing this by adding new data centers in neighboring regions with no physical links. So for example, we have servers in the same cluster separated by more than 300 km.

Additionally, we chose our internet service providers (ISP) carefully. Sharing a network is the single greatest cause of system downtime. So part of creating distance is to address network issues, and we do this by ensuring that no servers in the same cluster use the same ISP.

And we were able to add these distances without affecting the important RAFT consensus among the machines.

Extending the cluster with DSN

Finally, for customers with a worldwide client-base, we introduced a Distributed Search Network (DSN). While we discuss DSN in more detail, it is useful to see how it fits in with our cluster model.

DSN adds one or more satellite servers to a cluster, thereby extending a customer’s reach into other regions. Every DSN contains the full data and settings of the cluster. Take the example of a cluster on the East Coast of the US. An East Coast customer can add a DSN server to the West Coast, to bring the server closer to its West Coast clients. This will reduce network latency (between client and server), which will improve performance. Additionally, DSN can be used to share the load of large cluster activity: a customer can offload requests to the DSN whenever its cluster(s) reach peak usage.

Monitoring and locating Algolia’s clusters and servers

You can monitor your servers and clusters via the dashboard. Go to Dashboard Icon -> API Status, then click on the cluster name.

You can also monitor (and configure) your DSNs. Goto to Infrastructure Icon.

Finally, for Enterprise customers, we have a Monitoring API that provides a window into all your cluster and DSN activity.

Where are the clusters and servers located?

We are obsessed with high performance and delivering the best user experience. For these reasons, we have decided to deploy a distributed architecture with several clusters around the world.

Our 400 clusters are currently located in 15 different regions, 60 different datacenters worldwide:

  • US-East (Virginia): two different Equinix data centers in Ashburn & COPT DC-6 in Manassas (three independent Autonomous Systems).
  • US-West (California): three different Equinix data centers in San Jose (three independent Autonomous Systems).
  • US-Central (Texas): two different data centers in Dallas (two independent Autonomous Systems)
  • Europe (France): four different datacenters in Roubaix, two different datacenters in Strasbourg and one datacenter in Gravelines.
  • Europe (Netherlands): four different datacenters around Amsterdam
  • Europe (Germany): seven different datacenters in Falkenstein and one datacenter in Frankfurt (two independent Autonomous Systems)
  • Canada: four different datacenters in Beauharnois
  • Singapore: two different datacenter in Singapore (two independent Autonomous Systems)
  • Brazil: three different datacenters around São Paulo (two independent Autonomous Systems)
  • Japan: one datacenter in Tokyo and one data center in Osaka
  • Russia: one data center in Moscow
  • Australia: three datacenters in Sydney (two independent Autonomous Systems)
  • India: one data center in Noida
  • Hong Kong: two different datacenters (two independent Autonomous Systems)
  • South Africa: two datacenters in Johannesburg ( two independent Autonomous Systems)

When you create your account, you can pick which region you want to use. Also, you can use our DSN feature to distribute your search engine in multiple regions, and decrease the latency for your audience in different parts of the world.

Did you find this page helpful?