Engineering

DNS fallback for better resilience
facebooklinkedintwittermail

At Algolia, we are obsessed with finding a way to have a 99.9999% available architecture. On our way to achieve that, we have to make sure every piece of the architecture can safely fail without affecting the service.

The first point of the architecture where a customer’s request starts to interact with our service is not the router in the datacenter, but a DNS resolving a domain name to the IP address “long time” before that. This piece of architecture is very often overlooked and that is no surprise as you mostly get best-effort DNS service automatically with your server.

Latency

For couple months we are a happy user of NSONE that provides us with the first level of logic. We use NSONE for its superb performance and data-driven DNS that gives us control in steering the traffic of our Distributed Search Network to the proper server – whether it means closest or simply available one. But as any other network dependent service, there are factors outside of NSONE’s control that can influence availability of its DNS resolves and consequently Algolia. BGP routing is still a huge magic and “optimizations” of some ISPs are beyond understanding. Well, they do not always make the optimizations in the direction we would like to. For some services the change of DNS resolution time from 10 to 500ms does not mean a lot but for us it is a deal breaker.

nsone-dig-latency
Resolution of latency-1 via NSONE

DDoS

When we started to think about our DNS dependency, we remembered the 2014 DDoS attack on UltraDNS and the situation when there was not enough #hugops for all the services impacted. During the previous attack on UltraDNS in 2009 even big names like Amazon and SalesForce got impacted.

Solution

In most of the cases it would mean adding another DNS name server from a different provider and replicate the records. But not in ours. NSONE has some unique features that we would have to give up and find a common feature subset with a different provider. In the end we would have to serve a portion of DNS resolutions via slower provider for no good reason.

Since we provide custom made API clients we have one more place where to put additional logic. Now came a time to choose a resilient provider for our secondary DNS and since we like AWS, Route53 was a clear choice. Route53 has ok performance, many POPs around the world and API we already had integration for.

In the last moment, one more paranoid idea came to us – let’s not rely on a single TLD. No good reason for that, it was just “what if…?” moment.

route53-dig-latency
Resolution of latency-1 via Route53

Right now, all the latest versions of our API clients (detailed list below) use multiple domain names. “algolia.net” is served by NSONE and provides all the speed and intelligence, “algolianet.com” is served by Route53 in case that for any reason contacting server via “algolia.net” fails. It brings more work to our side, brings more cost on our side but it also brings better sleep for our customers, their customers and us.

And now we can think what else can fail…

Minimal versions of API clients with support of multiple DNS:

About the authorAdam Surak

Adam Surak

Director of Infrastructure & Security @ Algolia

Recommended Articles

Powered by Algolia AI Recommendations

How to build a scalable DNS load balancing system
Engineering

How to build a scalable DNS load balancing system

Paul Berthaux

Paul Berthaux

Sr. Site Reliability Engineer
Speeding up our Crawler parallel processing by 50 percent
Engineering

Speeding up our Crawler parallel processing by 50 percent

Samuel Bodin

Samuel Bodin

Software Engineer Crawler
Comparing Algolia and Elasticsearch For Consumer-Grade Search Part 1: End-to-end Latency
Engineering

Comparing Algolia and Elasticsearch For Consumer-Grade Search Part 1: End-to-end Latency

Josh Dzielak

Josh Dzielak