Milliseconds and transparency matter at Algolia- both inside and outside our company. So we thought we’d share the details and our post mortem of the DNS DDoS incident which took place on Monday, May 16.
It all started when we received an alert from our monitoring system – the globally geo-routed endpoint to our API was not available. Despite our fallback plan where we’d designed our API clients to target another endpoint if the geo-routed one is unavailable, the problem still occured. We immediately started investigating the issue because it could slow down the speed of search queries for our end-users which we absolutely could not have.
Identifying the Cause
Earlier that day, we had performed an hour-long maintenance procedure on our monitoring system and verified that it was working correctly on our performance dashboards. When the alert about our main endpoint being unavailable came, we saw a similar drop in the number of queries reaching our API and we first suspected a monitoring error. Some local tests showed that the endpoint was responding but was incredibly slow. Our worries were confirmed when we saw that the DNS resolution was actually consuming the majority of the query time. A few minutes later, we received an email from our primary DNS provider informing us of an incoming DDoS attack. The root cause had been identified – it was time to update our status page and prepare.
The first wave of the DDoS came around 14:00 UTC and lasted till 17:30 UTC. The second one came at 19:30 and lasted till 20:30. At the peak of the attack we lost 25% of the traffic, as can be seen on the following graph.
Learning from our mistakes
This was not the first DDoS attack that either Algolia or our primary DNS provider has seen. In fact, we’d already put some measures in place for just such an eventuality. Last year, we updated our API clients with what we call the “DNS Fallback”. This allows our API to operate on two independent DNS networks and fallback from algolia.net to algolianet.com in case there is a problem.
Our DNS providers too have DDoS mitigation solutions in place and have a lot of capacity to handle attacks. When forced to explain this new problem, we realised something was not working correctly in our DNS retry strategy. Despite our efforts, we noticed that 25% of our requests dropped. We immediately suspected two sources: usage of outdated API clients (without the DNS fallback) or buggy handling of DNS timeout in some of them.
Even when DDoS mitigation is triggered quickly, it takes minutes to get rid of the simplest attacks. This is long enough to affect our users’ search availability. That’s why we’re tuning the timeouts of all the requests in our API clients itself in order to bring the impact close to zero.
The Good, The Bad and The Ugly
Although we had introduced the DNS fallback a year ago, we still see usage of very old versions of our API clients. During this year, we tried to eradicate the usage of our old clients by sending notices to the impacted users and introducing a warning in our dashboard. Unfortunately we did not manage to remove all instances of old client usage – there were probably a couple of components missing in our messages since we’d not discovered a good enough incentive to get people to upgrade an API client that worked just fine, as there hadn’t been any outages. On the bright side though, when most people using old clients (without fallback support) came to us, we asked them to upgrade their API clients which resolved the issue instantly.
But we also discovered during this attack, that we trusted our fallback implementation a bit too much. We started to test all API clients’ implementation by replicating the conditions of the DDoS attack. For these tests, we created a new testing domain algolia.biz. This domain timeouts all the requests due to non-responding name-servers.
We officially support 15 API clients and here is the overview of what did (or did not) work.
Ruby (also Rails), Python (also Django), Go, Swift, C#, and Android API clients passed the new test with flying colours.
Fortunately, this didn’t occur for every browser or OS. When a browser fails at resolving a DNS server on time, there’s no timeout exception raised by the browser but rather an error on the underlying XMLHttpRequest object. Internally, we use XMLHttpRequest errors to decide to use JSONP in case XMLHttpRequests are blocked. A recent fix on that JSONP fallback introduced a bug when facing a DNS resolution error.
We advised our clients to use the last confirmed working version if they were experiencing issues: version 3.13.1.
The good news is that we’ve now reworked the fallbacks and the latest client is working perfectly today.
Java and Scala clients using the standard Oracle JVM had unexpected results with our new algolia.biz testing domain. While the new test worked locally, it kept failing on Travis CI which we use for testing all of our libraries. After carefully tracing the application calls, we discovered 2 things:
- There is no way to set a timeout on DNS resolution, but there is a workaround: https://bugs.java.com/bugdatabase/view_bug.do?bug_id=6450279. The JVM uses the timeout of the underlying OS. That explains the difference between our workstations and Travis CI.
- The JVM has a DNS resolution cache that is enabled by default (more information on networkaddress.cache.ttl )
Although the second one was not an issue during the DDoS, depending on the OS the first one could have been.
This implies that some work needs to be done for both the Java and the Scala API clients. For the Java client, we will do it in the upcoming v2. For the Scala client, we need to upgrade the underlying HTTP client, which will take some time as we need to change the underlying architecture of the client.
For PHP, the situation is even trickier. We are using the CURL library to perform all the requests. Unfortunately, the CURLOPT_TIMEOUT and CURLOPT_CONNECTTIMEOUT options do not include DNS resolution time and PHP uses the timeout of the OS. Luckily, if you have the “ares” extension installed it sets a CURL_TIMEOUT_RESOLVE that handles DNS timeout.
When we implemented the DNS fallback strategy earlier last year, we were confident it was the very last required piece of code to implement the high-availability of our API. Testing such a DNS fallback strategy is complex and it turns out that not having the ability to perfectly reproduce all the conditions of the attack – be it the OS configuration or weird behavior of the underlying HTTP library you don’t understand- was more of a handicap than we thought.
Today we have a dedicated domain name and robust tests to ensure that our fallback code is working in order to alleviate this problem in the future.
And finally, if you are an Algolia customer, please ensure that you are using the latest version of our API clients in order to avoid such impact in the future.