Guides / Scaling

Servers and Clusters

Bare metal servers

Server

While the vast majority of SaaS service choose virtualization, Algolia has decided to use bare metal servers. Bare metal servers give applications direct access to the physical and software resources of a computer. For example, the Algolia engine processes all searching and indexing operations, which directly interacts with the computer’s essential resources, such as the operating system, CPU, RAM, and disk.

With a virtual machine, however, all applications need to pass by one or more additional layers of system-level software before reaching the services of the underlying physical server. While this clearly slows things down, it also brings flexibility by spreading a single server’s capacity over many discreet use-cases. One customer might want to run a massive SQL Server database on a Windows computer; another customer might want to perform CPU-intensive calculations using an old version of Unix; and another might want to simulate a macOS machine—all of which can happen on the same shared server, using virtualization.

For Algolia, virtual machines aren’t necessary. Bare metal servers have been time and task-slicing for years, without the need to virtualize. With powerful hardware components, a single server can handle countless customers—especially if they’re all doing the same operations, which is the case with Algolia.

Many of our larger customers use dedicated servers, giving them exclusive access to the entire cluster. Most of our smaller accounts share servers, which means that they share the same Algolia engine. However, in both situations, the principle is the same: Algolia operates directly on the machine.

Note that when we say server, we are actually referring to a cluster of three identical servers. With clusters, Algolia can provide an SLA reliability of 99.99%, because a cluster guarantees that at least one of the three servers is available at all times.

Algolia clusters

What is an Algolia cluster?

Cluster

An Algolia cluster is a set of three servers “clustered” together to handle all requests. All servers are equal; that is, each one is equally capable of responding to every request. For this to be possible, each server must have the same data and index settings, as well as the same overall system configuration, thereby enabling the cluster to behave as a single server.

The reason behind clusters is redundancy: if one or two servers go down, the service is still available. This is how we guarantee an SLA of 99.99% availability.

Algolia clusters in more detail

Algolia has over 400 clusters in 15 regions and 60 data centers, with each cluster consisting of three bare metal servers.

A cluster of three servers acts as one, with each one ready to serve at any moment, waiting for the next request while the other two remain on-call to process the following requests.

We refer to this as a three-way partnership, in which all servers are of equal value, and each are configured with the same software, system settings, and (generally speaking) the same hardware specifications. And most importantly, they contain the same data and index settings.

Every Algolia customer uses a cluster for all search and indexing operations. Take a customer with one of the most active retail websites in the world, and an immense clothing collection. At any given moment, they can update their indices while thousands of clients are searching for different clothing items. Each request is balanced and evenly distributed, so that all three servers share the load.

Why we use clusters

Let’s start off with what a cluster is not: it’s not designed to optimize capacity. Algolia doesn’t split a customer’s data across three computers, with each machine getting a third of the data. Admittedly, this would triple the capacity, but that isn’t the goal.

In fact, as regards capacity, Algolia doesn’t need to do this, because most customers don’t need more than a single server to store their data—even with enormous databases. When it comes to search, all we need is a small subset of a customer’s data, small enough for one server.

Clusters are not about concurrency either, where parts of a single operation are split across different computers. They also aren’t designed for parallel computing: each server in a cluster processes the whole request independently.

Ultimately, a cluster is about redundancy. For Algolia, performance goes hand-in-hand with reliability: a fast and relevant search is of little value if the search engine is unavailable.

What happens when a server goes down

We define a server as unavailable when other servers in the cluster can’t reach it. This can happen for many reasons: a temporary network failure, a server is too busy to respond, a server is physically down. Synchronizing data within the cluster requires uninterrupted communication between all three servers, as we explain below with our consensus algorithm. When one or more machines in a cluster are unreachable, the synchronization process is at risk.

If a machine is unreachable, the other two continue to function normally, processing both indexing and search requests. They can achieve consensus among themselves, and when the third returns, it can synchronize with the same index as the other two.

Unfortunately, while a server might be unreachable to the other servers in the cluster, you might still be able to reach it—that is, the server might still be able to receive indexing requests from your own servers. This is a serious problem for synchronization: the “down” server has no idea what the other two servers are doing with their indices. If it were to start using its own indexing changes without sharing those changes with the others, the overall cluster would end up with two different datasets.

To handle this, we queue indexing jobs on any unreachable server. While the other two servers continue to process their indexing jobs and synchronize together, the absent server puts any indexing jobs on hold, and processes them only once the whole cluster is back together.

Availability over consistency

What we’re describing here is the common tradeoff for hosted services between availability and consistency of data.

  • Availability: constant access to your data, no service outage.
  • Consistency: the same data everywhere at the same time (e.g., all users getting the same search results at the same time).

Algolia has chosen availability over consistency, because when someone searches, they should get results without failure. We consider small data differences between users to be less critical than users not getting results at all.

For many technical reasons, achieving these two goals with equal success is an impossibility (CAP theorem, eventual consistency). Among these reasons is the following: every client gets three servers to service all search requests, which guarantees that at least one server is available at all times. However, if we were to delay searches until all three computers have the same exact data, this would cause numerous delays, thereby undermining our milliseconds guarantee.

That being said, synchronizing data between three servers takes seconds or less. Users therefore don’t often experience data discrepancies.

Search operations

Meanwhile, with regards to search, server-to-server communication is less important. Therefore, as long as a server is functional, we allow it to process search requests.

From servers to clusters to DSN

There are multiple reasons to use clusters.

  • Availability: if one or two servers go down, your users aren’t affected, and search is always available. We’ve never had all three servers go down at the same time.
  • Redundancy: having three live copies of your data makes it unlikely that we ever lose it.

Consensus of three servers

We knew early on that to achieve this kind of reliability, we would need a three-server cluster. Initially, we used one server per region to process every indexing and search operation. Our focus was on the machine and how to improve its performance. However, we also needed reliability, so we quickly switched over to a cluster.

Cluster history

Clusters require a robust consensus algorithm to ensure that each server contains the same data at all times, without service interruption. We went with the RAFT algorithm. RAFT coordinates all index input—adding, updating, and deleting index data—so that all machines in a cluster update at the same time.

Distance counts

When servers share the same data center or same power lines, a single flood or power outage can bring down the entire cluster. Thus, to ensure cluster reliability, we separated the servers so that no single incident could bring the whole cluster down. We did so by adding new data centers in neighboring regions with no physical links. For example, we have servers belonging to the same cluster, but in data centers more than 300 km apart.

Additionally, we chose our internet service providers (ISP) carefully. Sharing a network is the single greatest cause of system downtime, so part of creating distance is to address network issues. We do this by ensuring that no server within the same cluster uses the same ISP.

We were able to add these distances without affecting the important RAFT consensus among machines.

Extending the cluster with DSN

For customers with a worldwide client-base, we introduced our Distributed Search Network (DSN).

DSN adds one or more satellite servers to a cluster, thereby extending a customer’s reach into other regions. Every DSN contains the full data and settings of a cluster.

Take the example of a cluster on the East Coast of the US. An East Coast customer can add a DSN server to the West Coast, to bring the server closer to their West Coast clients. This reduces network latency between the client and the server, which improves performance. Additionally, you can use DSN to share the load of large cluster activity: a customer can offload requests to the DSN whenever their cluster(s) reach peak usage.

Monitoring and locating Algolia’s clusters and servers

You can monitor your servers and clusters via the dashboard, under Monitoring > Status.

You can also monitor and configure your DSNs, under Infra > DSN.

Finally, for Enterprise customers, we have Monitoring and Usage APIs, which provide a window into all your cluster and DSN activity.

Where are the clusters and servers located?

We’re obsessed with high performance and delivering the best user experience. For those reasons, we’ve decided to deploy a distributed architecture with several clusters around the world.

Our 400 clusters are currently located in 16 different regions and 70 different worldwide data centers:

  • US-East (Virginia): two different Equinix data centers in Ashburn & COPT DC-6 in Manassas (three independent Autonomous Systems).

  • US-West (California): three different Equinix data centers in San Jose (three independent Autonomous Systems).

  • US-Central (Texas): two different data centers in Dallas (two independent Autonomous Systems).

  • Europe (France): four different data centers in Roubaix, two different data centers in Strasbourg and one data center in Gravelines.

  • Europe (Netherlands): four different data centers around Amsterdam.

  • Europe (Germany): seven different data centers in Falkenstein and one data center in Frankfurt (two independent Autonomous Systems).

  • Europe (UK): two different datacenters in London (two independent Autonomous Systems).

  • Canada: four different data centers in Beauharnois.

  • Middle East: one data center in Dubai.

  • Singapore: two different data center in Singapore (two independent Autonomous Systems).

  • Brazil: three different data centers around São Paulo (two independent Autonomous Systems).

  • Japan: one data center in Tokyo and one data center in Osaka.

  • Australia: three data centers in Sydney (two independent Autonomous Systems).

  • India: one data center in Noida.

  • Hong Kong: two different data centers (two independent Autonomous Systems).

  • South Africa: two data centers in Johannesburg (two independent Autonomous Systems).

When you create your account, you can pick which region you want to use. Also, you can use our DSN feature to distribute your search engine in multiple regions, and decrease the latency for your audience in different parts of the world.

Did you find this page helpful?