At Algolia, we are obsessed with finding a way to have a 99.9999% available architecture. On our way to achieve that, we have to make sure every piece of the architecture can safely fail without affecting the service.
The first point of the architecture where a customer’s request starts to interact with our service is not the router in the datacenter, but a DNS resolving a domain name to the IP address “long time” before that. This piece of architecture is very often overlooked and that is no surprise as you mostly get best-effort DNS service automatically with your server.
For couple months we are a happy user of NSONE that provides us with the first level of logic. We use NSONE for its superb performance and data-driven DNS that gives us control in steering the traffic of our Distributed Search Network to the proper server – whether it means closest or simply available one. But as any other network dependent service, there are factors outside of NSONE’s control that can influence availability of its DNS resolves and consequently Algolia. BGP routing is still a huge magic and “optimizations” of some ISPs are beyond understanding. Well, they do not always make the optimizations in the direction we would like to. For some services the change of DNS resolution time from 10 to 500ms does not mean a lot but for us it is a deal breaker.
When we started to think about our DNS dependency, we remembered the 2014 DDoS attack on UltraDNS and the situation when there was not enough #hugops for all the services impacted. During the previous attack on UltraDNS in 2009 even big names like Amazon and SalesForce got impacted.
In most of the cases it would mean adding another DNS name server from a different provider and replicate the records. But not in ours. NSONE has some unique features that we would have to give up and find a common feature subset with a different provider. In the end we would have to serve a portion of DNS resolutions via slower provider for no good reason.
Since we provide custom made API clients we have one more place where to put additional logic. Now came a time to choose a resilient provider for our secondary DNS and since we like AWS, Route53 was a clear choice. Route53 has ok performance, many POPs around the world and API we already had integration for.
In the last moment, one more paranoid idea came to us – let’s not rely on a single TLD. No good reason for that, it was just “what if…?” moment.
Right now, all the latest versions of our API clients (detailed list below) use multiple domain names. “algolia.net” is served by NSONE and provides all the speed and intelligence, “algolianet.com” is served by Route53 in case that for any reason contacting server via “algolia.net” fails. It brings more work to our side, brings more cost on our side but it also brings better sleep for our customers, their customers and us.
And now we can think what else can fail…
Minimal versions of API clients with support of multiple DNS: