17 Oct 2018

Clusters

Cluster

What is an Algolia cluster?

  • A set of 3 servers “clustered” together to handle all requests.
  • Each server is equal to the others; that is, each one is equally capable of responding to every request.
  • For this to be possible, each server must have the same data and settings, and the same system configuration, thereby enabling the cluster to behave as a single server.
  • Why is this done? For redundancy - so that if 1 or 2 servers go down, the cluster is still available. This is how we guarantee an SLA of 99.99% availability.

Algolia clusters in more detail

Algolia has over 400 clusters in 15 regions and 60 data centers, with each cluster consisting of 3 bare metal servers.

A cluster of 3 servers acts as 1, with each one ready to serve at any moment, waiting for the next request - while the other 2 remain on-call to process the following requests.

We refer to this as a 3-way partnership, in which all computers are of equal value, and each are configured with the same exact software, system settings, and (generally speaking) the same hardware specifications.

And most importantly, they contain the same exact data and index settings.

Every Algolia customer uses a cluster for all search and indexing operations. Take a customer who has an immense clothing collection, with one of the most active retail websites in the world. At any given moment, they can be updating their indices while 1000s of clients are searching for different clothing items. Each request is balanced and evenly distributed, so that all 3 servers are sharing the load.

Why does Algolia use a cluster?

So why a cluster? Why 3 machines?

Let’s start off with what a cluster is not: It is not designed to optimize capacity. Algolia does not split a customer’s data across 3 computers, with each machine getting a third of the data. Admittedly, this would triple the capacity. But this is not the goal.

In fact, as regards capacity, Algolia does not need to do this, because generally speaking customers do not need more than one server to store its data - even if a customer has an enormous database. The indexing required for search only needs a small subset of a customer’s data, small enough for one server.

Additionally, this is not about concurrency, where the parts of one operation are split across different computers. Clusters are not designed for parallel computing: each server in a cluster processes independently the whole request.

Ultimately, a cluster is about redundancy, or reliability. For Algolia, alongside performance is reliability: A fast and relevant search is of little value if the search engine is unavailable.

Redundancy - What happens when a server goes down?

Or more precisely - What happens to a cluster when one or more of its servers go down?

For us, a server is “unavailable” when it is unreachable by the other servers. Unavailability can be caused by many things - a temporary network failure, a server is too busy to respond, it is physically down - whatever its cause, what is important here is that it is unreachable by the other servers in the cluster.

And as we will explain below with our consensus algorithm, syncronizing data within the cluster requires uninterrupted communication between all 3 servers; therefore, When one or more machines in a cluster become unreachable, this puts that synchronization at risk.

When a server is unreachable

If 1 machine is unreachable, the other 2 will continue to function normally - processing both indexing and search requests. They can achieve consensus among themselves, and when the 3rd returns, it can be properly updated/syncronized with the same index as the other 2.

Unfortunately, while a server might be unreachable to the other servers in the cluster, it might nonetheless be reachable by your own servers - that is, it might still be able to receive indexing requests from your own servers. This is a serious problem for synchronization: the “down” server has no idea what the other 2 servers are doing with their indices, and so if it were to start using its own indexing changes without sharing those changes with the others, the overall cluster will end up with 2 very different sets of data. Among the many problems this discord cause, chief among them is the impossibility to merge the 2 indices: any merge will result in a loss of data.

To handle this, we queue indexing jobs on any server that is unreachable. So while the other 2 servers continue to process their indexing jobs - in a certain order, synchronizing between them - the absent server will put on hold (queue) any indexing jobs, and process them only once the whole cluster is back together.

Regarding Search operations

Meanwhile, with regards to search, server-to-server communication is less important; therefore, as long as a server is functional, we allow it to process search requests.

From Servers to Clusters to DSN

As discussed, there are multiple reasons to use clusters.

  • Availability: If 1 or 2 servers go down, the users of your website won’t be affected. Search is always available. We’ve never had all 3 servers go down at the same time.
  • Redundancy: Having 3 live copies of your data makes it very unlikely that we’ll ever lose it.

Consensus of 3 servers

We knew, in the very beginning, To achieve this kind of reliability, we would need a 3-server cluster. Initially, we used 1-server-per-region to process every indexing and search operation. Our focus was on the machine and how to improve its performance. However, we also needed reliability, so we quickly switched over to a cluster.

Cluster history

Clusters require a solid consensus algorithm to ensure that each server contains the same data at all times, without service interruption. We went with the (RAFT algorithm). RAFT coordinates all index input - adding, updating, and deleting index data - so that all machines in a cluster get updated at the same time.

Distance Counts

We next needed to put the cluster’s servers in different data centers. For a simple reason: When servers share the same data center or same power lines, a single flood or power outage can bring down the entire cluster. Thus, to ensure cluster reliability, we needed to separate the servers so that no single act (power outage or otherwise) could bring down the whole cluster. We succeeded in doing this by adding new data centers in neighboring regions with no physical links. So for example, we have servers in the same cluster separated by more than 300 km.

Additionally, we chose our internet service providers (ISP) carefully. Sharing a network is the single greatest cause of system downtime. So part of creating distance is to address network issues, and we do this by ensuring that no servers in the same cluster use the same ISP.

And we were able to add these distances without affecting the important RAFT consensus among the machines.

Extending the cluster with DSN

Finally, for customers with a worldwide client-base, we introduced a Distributed Search Network (DSN). While we discuss DSN in more detail, it is useful to see how it fits in with our cluster model.

DSN adds one or more satellite servers to a cluster, thereby extending a customer’s reach into other regions. Every DSN contains the full data and settings of the cluster. Take the example of a cluster on the East Coast of the US. An East Coast customer can add a DSN server to the West Coast, to bring the server closer to its West Coast clients. This will reduce network latency (between client and server), which will improve performance. Additionally, DSN can be used to share the load of large cluster activity: a customer can offload requests to the DSN whenever its cluster(s) reach peak usage.

Monitoring and locating Algolia’s clusters and servers

Monitoring

You can monitor your servers and clusters via the dashboard. Go to Dashboard Icon -> API Status, then click on the cluster name.

You can also monitor (and configure) your DSNs. Goto to Infrastructure Icon.

Finally, for Enterprise customers, we have a Monitoring API that provides a window into all your cluster and DSN activity.

Where are the clusters and servers located?

We are obsessed with high performance and delivering the best user experience. For these reasons, we have decided to deploy a distributed architecture with several clusters around the world.

Our 400 clusters are currently located in 15 different regions, 60 different datacenters worldwide:

  • US-East (Virginia): two different Equinix data centers in Ashburn & COPT DC-6 in Manassas (three independent Autonomous Systems).
  • US-West (California): three different Equinix data centers in San Jose (three independent Autonomous Systems).
  • US-Central (Texas): two different data centers in Dallas (two independent Autonomous Systems)
  • Europe (France): four different datacenters in Roubaix, two different datacenters in Strasbourg and one datacenter in Gravelines.
  • Europe (Netherlands): four different datacenters around Amsterdam
  • Europe (Germany): seven different datacenters in Falkenstein and one datacenter in Frankfurt (two independent Autonomous Systems)
  • Canada: four different datacenters in Beauharnois
  • Singapore: two different datacenter in Singapore (two independent Autonomous Systems)
  • Brazil: three different datacenters around São Paulo (two independent Autonomous Systems)
  • Japan: one datacenter in Tokyo and one data center in Osaka
  • Russia: one data center in Moscow
  • Australia: three datacenters in Sydney (two independent Autonomous Systems)
  • India: one data center in Noida
  • Hong Kong: two different datacenters (two independent Autonomous Systems)
  • South Africa: two datacenters in Johannesburg ( two independent Autonomous Systems)

When you create your account, you can pick which region you want to use. Also, you can use our DSN feature to distribute your search engine in multiple regions, and decrease the latency for your audience in different parts of the world.

© Algolia - Privacy Policy