Guides / Scaling

Servers and Clusters

Bare metal servers

Server

While the vast majority of SaaS service choose virtualization, Algolia has decided to use bare metal servers. Bare metal servers give applications direct access to the physical and software resources of a computer. For example, the Algolia engine processes all searching and indexing operations, which directly interacts with the computer’s essential resources, such as the operating system, CPU, RAM, and disk.

With a virtual machine, all applications need to pass by one or more additional layers of system-level software before reaching the services of the underlying physical server. While this slows things down, it also brings flexibility by spreading a single server’s capacity over many discreet use-cases. One user might want to run a massive SQL Server database on a Windows computer. Another user might want to perform CPU-intensive calculations using an old version of Unix. Another might want to simulate a macOS machine. All these can happen on the same shared server, using virtualization.

For Algolia, virtual machines aren’t necessary. Bare metal servers have been time and task-slicing for years, without the need to virtualize. With powerful hardware components, a single server can handle countless users—especially if they’re all doing the same operations, which is the case with Algolia.

Many larger users use dedicated servers, giving them exclusive access to the entire cluster. Most smaller accounts share servers, which means that they share the same Algolia engine. In both situations, the principle is the same: Algolia operates directly on the machine.

Note that Algolia uses the term server to refer to a cluster of three identical servers. With clusters, Algolia can provide an SLA reliability of 99.99%, because a cluster guarantees that at least one of the three servers is always available.

Algolia clusters

What is an Algolia cluster?

Cluster

An Algolia cluster is a set of three servers “clustered” together to handle all requests. All servers are equal; that is, each one is equally capable of responding to every request. For this to be possible, each server must have the same data and index settings, as well as the same overall system configuration, thereby enabling the cluster to behave as a single server.

The reason behind clusters is redundancy: if one or two servers go down, the service is still available. This redundancy guarantees an SLA of 99.99% availability.

Algolia clusters in more detail

Algolia has over 400 clusters in 15 regions and 60 data centers, with each cluster consisting of three bare metal servers.

A cluster of three servers acts as one, with each one ready to serve at any moment, waiting for the next request while the other two remain on-call to process the following requests.

This is a three-way partnership, in which all servers are of equal value, and each are configured with the same software, system settings, and (generally speaking) the same hardware specifications. Most importantly, they have the same data and index settings.

Every Algolia user uses a cluster for all search and indexing operations. Take a user with one of the most active retail websites in the world, and an immense clothing collection. At any given moment, they can update their indices while thousands of clients are searching for different clothing items. Each request is balanced and evenly distributed, so that all three servers share the load.

Why use clusters

To be clear, clusters aren’t designed to optimize capacity. Algolia doesn’t split a user’s data across three computers, with each machine getting a third of the data. Admittedly, this would triple the capacity, but that isn’t the goal.

In fact, as regards capacity, Algolia doesn’t need to do this, because most users don’t need more than a single server to store their data—even with enormous databases. When it comes to search, all that’s needed is a small subset of a user’s data, small enough for one server.

Clusters aren’t about concurrency either, where parts of a single operation are split across different computers. They also aren’t designed for parallel computing: each server in a cluster processes the whole request independently.

Clusters are about redundancy. For Algolia, performance goes hand-in-hand with reliability: a fast and relevant search is of little value if the search engine is unavailable.

What happens when a server goes down

A server is unavailable when other servers in the cluster can’t reach it. This can happen for many reasons: a temporary network failure, a server is too busy to respond, a server is physically down. Synchronizing data within the cluster requires uninterrupted communication between all three servers, as explained below with the consensus algorithm. When one or more machines in a cluster are unreachable, the synchronization process is at risk.

If a machine is unreachable, the other two continue to function normally, processing both indexing and search requests. They can achieve consensus among themselves, and when the third returns, it can synchronize with the same index as the other two.

While a server might be unreachable to the other servers in the cluster, you might still be able to reach it—that is, the server might still be able to receive indexing requests from your own servers. This is a serious problem for synchronization: the “down” server has no idea what the other two servers are doing with their indices. If it were to start using its own indexing changes without sharing those changes with the others, the overall cluster would end up with two different datasets.

To handle this, Algolia queues indexing jobs on any unreachable server. While the other two servers continue to process their indexing jobs and synchronize together, the absent server puts any indexing jobs on hold, and processes them only once the whole cluster is back together.

Availability over consistency

There is a common tradeoff for hosted services between availability and consistency of data.

  • Availability: constant access to your data, no service outage.
  • Consistency: the same data everywhere at the same time (for example, all users getting the same search results at the same time).

Algolia has chosen availability over consistency, because when someone searches, they should get results without failure. Algolia considers small data differences between users to be less critical than users not getting results at all.

For technical reasons, achieving these two goals with equal success is an impossibility (CAP theorem, eventual consistency). Among these reasons is the following: every client gets three servers to service all search requests, which guarantees that at least one server is always available. If Algolia were to delay searches until all three computers have the same exact data, this would cause delays.

That said, synchronizing data between three servers takes seconds or less. Users therefore don’t often experience data discrepancies.

Search operations

Meanwhile, with regards to search, server-to-server communication is less important. Therefore, as long as a server is functional, it can process search requests.

From servers to clusters to Distributed Search Network

There are multiple reasons to use clusters.

  • Availability: if one or two servers go down, your users aren’t affected, and search is always available.
  • Redundancy: having three live copies of your data makes it unlikely that it can be lost.

Consensus of three servers

Cluster history

Clusters require a robust consensus algorithm to ensure that each server always has the same data, without service interruption. Algolia uses the Raft algorithm. Raft coordinates all index input—adding, updating, and deleting index data—so that all machines in a cluster update at the same time.

Distance counts

When servers share the same data center or same power lines, a single flood or power outage can bring down the entire cluster. Thus, to ensure cluster reliability, Algolia separates the servers so that no single incident could bring the whole cluster down. This is done by adding new data centers in neighboring regions with no physical links. For example, Algolia has servers belonging to the same cluster, but in data centers more than 300 km apart.

Additionally, internet service providers (ISP) were carefully considered. Sharing a network is the single greatest cause of system downtime, so part of creating distance is to address network issues. Algolia does this by ensuring that no server within the same cluster uses the same ISP. These distances don’t affect the Raft consensus among machines.

Extending the cluster with the Distributed Search Network

For users with a worldwide client-base, Algolia provides the Distributed Search Network (DSN).

DSN adds one or more satellite servers to a cluster, thereby extending a user’s reach into other regions. Every DSN contains the full data and settings of a cluster.

Take the example of a cluster on the east coast of the United States. An east coast user can add a DSN server in the west coast region to bring the server closer to their west coast clients. This reduces network latency between the client and the server, which improves performance. Additionally, you can use DSN to share the load of large cluster activity: a user can offload requests to the DSN whenever their cluster or clusters reach peak usage.

Monitoring and locating Algolia’s clusters and servers

You can monitor your servers and clusters via the dashboard, under Monitoring > Status.

You can also monitor and configure your DSNs, under Infra > DSN.

Finally, the Monitoring and Usage APIs provide a window into all your cluster and DSN activity.

Where are the clusters and servers located?

Algolia has deployed a distributed architecture with several clusters around the world. There are 400 clusters located in 16 different regions and 70 different worldwide data centers:

  • US-East (Virginia): two different Equinix data centers in Ashburn & COPT DC-6 in Manassas (three independent Autonomous Systems).

  • US-West (California): three different Equinix data centers in San Jose (three independent Autonomous Systems).

  • US-Central (Texas): two different data centers in Dallas (two independent Autonomous Systems).

  • Europe (France): four different data centers in Roubaix, two different data centers in Strasbourg and one data center in Gravelines.

  • Europe (Netherlands): four different data centers around Amsterdam.

  • Europe (Germany): seven different data centers in Falkenstein and one data center in Frankfurt (two independent Autonomous Systems).

  • Europe (UK): two different datacenters in London (two independent Autonomous Systems).

  • Canada: four different data centers in Beauharnois.

  • Middle East: one data center in Dubai.

  • Singapore: two different data center in Singapore (two independent Autonomous Systems).

  • Brazil: three different data centers around São Paulo (two independent Autonomous Systems).

  • Japan: one data center in Tokyo and one data center in Osaka.

  • Australia: three data centers in Sydney (two independent Autonomous Systems).

  • India: one data center in Noida.

  • Hong Kong: two different data centers (two independent Autonomous Systems).

  • South Africa: two data centers in Johannesburg (two independent Autonomous Systems).

When you create your account, you can pick which region you want to use. You can also use the DSN feature to distribute your search engine in multiple regions, and decrease the latency for your audience in different parts of the world.

Did you find this page helpful?