Search by Algolia
Introducing new developer-friendly pricing
algolia

Introducing new developer-friendly pricing

Hey there, developers! At Algolia, we believe everyone should have the opportunity to bring a best-in-class search experience ...

Nick Vlku

VP of Product Growth

What is online visual merchandising?
e-commerce

What is online visual merchandising?

Eye-catching mannequins. Bright, colorful signage. Soothing interior design. Exquisite product displays. In short, amazing store merchandising. For shoppers in ...

Catherine Dee

Search and Discovery writer

Introducing the new Algolia no-code data connector platform
engineering

Introducing the new Algolia no-code data connector platform

Ingesting data should be easy, but all too often, it can be anything but. Data can come in many different ...

Keshia Rose

Staff Product Manager, Data Connectivity

Customer-centric site search trends
e-commerce

Customer-centric site search trends

Everyday there are new messages in the market about what technology to buy, how to position your company against the ...

Piyush Patel

Chief Strategic Business Development Officer

What is online retail merchandising? An introduction
e-commerce

What is online retail merchandising? An introduction

Done any shopping on an ecommerce website lately? If so, you know a smooth online shopper experience is not optional ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

5 considerations for Black Friday 2023 readiness
e-commerce

5 considerations for Black Friday 2023 readiness

It’s hard to imagine having to think about Black Friday less than 4 months out from the previous one ...

Piyush Patel

Chief Strategic Business Development Officer

How to increase your sales and ROI with optimized ecommerce merchandising
e-commerce

How to increase your sales and ROI with optimized ecommerce merchandising

What happens if an online shopper arrives on your ecommerce site and: Your navigation provides no obvious or helpful direction ...

Catherine Dee

Search and Discovery writer

Mobile search UX best practices, part 3: Optimizing display of search results
ux

Mobile search UX best practices, part 3: Optimizing display of search results

In part 1 of this blog-post series, we looked at app interface design obstacles in the mobile search experience ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

Mobile search UX best practices, part 2: Streamlining search functionality
ux

Mobile search UX best practices, part 2: Streamlining search functionality

In part 1 of this series on mobile UX design, we talked about how designing a successful search user experience ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

Mobile search UX best practices, part 1: Understanding the challenges
ux

Mobile search UX best practices, part 1: Understanding the challenges

Welcome to our three-part series on creating winning search UX design for your mobile app! This post identifies developer ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

Teaching English with Zapier and Algolia
engineering

Teaching English with Zapier and Algolia

National No Code Day falls on March 11th in the United States to encourage more people to build things online ...

Alita Leite da Silva

How AI search enables ecommerce companies to boost revenue and cut costs
ai

How AI search enables ecommerce companies to boost revenue and cut costs

Consulting powerhouse McKinsey is bullish on AI. Their forecasting estimates that AI could add around 16 percent to global GDP ...

Michelle Adams

Chief Revenue Officer at Algolia

What is digital product merchandising?
e-commerce

What is digital product merchandising?

How do you sell a product when your customers can’t assess it in person: pick it up, feel what ...

Catherine Dee

Search and Discovery writer

Scaling marketplace search with AI
ai

Scaling marketplace search with AI

It is clear that for online businesses and especially for Marketplaces, content discovery can be especially challenging due to the ...

Bharat Guruprakash

Chief Product Officer

The changing face of digital merchandising
e-commerce

The changing face of digital merchandising

This 2-part feature dives into the transformational journey made by digital merchandising to drive positive ecommerce experiences. Part 1 ...

Reshma Iyer

Director of Product Marketing, Ecommerce

What’s a convolutional neural network and how is it used for image recognition in search?
ai

What’s a convolutional neural network and how is it used for image recognition in search?

A social media user is shown snapshots of people he may know based on face-recognition technology and asked if ...

Catherine Dee

Search and Discovery writer

What’s organizational knowledge and how can you make it accessible to the right people?
product

What’s organizational knowledge and how can you make it accessible to the right people?

How’s your company’s organizational knowledge holding up? In other words, if an employee were to leave, would they ...

Catherine Dee

Search and Discovery writer

Adding trending recommendations to your existing e-commerce store
engineering

Adding trending recommendations to your existing e-commerce store

Recommendations can make or break an online shopping experience. In a world full of endless choices and infinite scrolling, recommendations ...

Ashley Huynh

Looking for something?

Our Post Mortem of the DNS DDoS which took place on Monday May 16th

Jun 1st 2016 product

Our Post Mortem of the DNS DDoS which took place on Monday May 16th
facebookfacebooklinkedinlinkedintwittertwittermailmail

Milliseconds and transparency matter at Algolia- both inside and outside our company. So we thought we’d share the details and our post mortem of the DNS DDoS incident which took place on Monday, May 16.

It all started when we received an alert from our monitoring system – the globally geo-routed endpoint to our API was not available. Despite our fallback plan where we’d designed our API clients to target another endpoint if the geo-routed one is unavailable, the problem still occured. We immediately started investigating the issue because it could slow down the speed of search queries for our end-users which we absolutely could not have.

Identifying the Cause

Earlier that day, we had performed an hour-long maintenance procedure on our monitoring system and verified that it was working correctly on our performance dashboards. When the alert about our main endpoint being unavailable came, we saw a similar drop in the number of queries reaching our API and we first suspected a monitoring error. Some local tests showed that the endpoint was responding but was incredibly slow. Our worries were confirmed when we saw that the DNS resolution was actually consuming the majority of the query time. A few minutes later, we received an email from our primary DNS provider informing us of an incoming DDoS attack. The root cause had been identified – it was time to update our status page and prepare.

The first wave of the DDoS came around 14:00 UTC and lasted till 17:30 UTC. The second one came at 19:30 and lasted till 20:30. At the peak of the attack we lost 25% of the traffic, as can be seen on the following graph.

API requests per second during DDoS

Learning from our mistakes

This was not the first DDoS attack that either Algolia or our primary DNS provider has seen. In fact, we’d already put some measures in place for just such an eventuality. Last year, we updated our API clients with what we call the “DNS Fallback”. This allows our API to operate on two independent DNS networks and fallback from algolia.net to algolianet.com in case there is a problem.

Our DNS providers too have DDoS mitigation solutions in place and have a lot of capacity to handle attacks. When forced to explain this new problem, we realised something was not working correctly in our DNS retry strategy. Despite our efforts, we noticed that 25% of our requests dropped. We immediately suspected two sources: usage of outdated API clients (without the DNS fallback) or buggy handling of DNS timeout in some of them.

Even when DDoS mitigation is triggered quickly, it takes minutes to get rid of the simplest attacks. This is long enough to affect our users’ search availability. That’s why we’re tuning the timeouts of all the requests in our API clients itself in order to bring the impact close to zero.

The Good, The Bad and The Ugly

Although we had introduced the DNS fallback a year ago, we still see usage of very old versions of our API clients. During this year, we tried to eradicate the usage of our old clients by sending notices to the impacted users and introducing a warning in our dashboard. Unfortunately we did not manage to remove all instances of old client usage – there were probably a couple of components missing in our messages since we’d not discovered a good enough incentive to get people to upgrade an API client that worked just fine, as there hadn’t been any outages. On the bright side though, when most people using old clients (without fallback support) came to us, we asked them to upgrade their API clients which resolved the issue instantly.

But we also discovered during this attack, that we trusted our fallback implementation a bit too much. We started to test all API clients’ implementation by replicating the conditions of the DDoS attack. For these tests, we created a new testing domain algolia.biz. This domain timeouts all the requests due to non-responding name-servers.

We officially support 15 API clients and here is the overview of what did (or did not) work.

The Good

Ruby (also Rails), Python (also Django), Go, Swift, C#, and Android API clients passed the new test with flying colours.

The Bad

The bad one turned out to be the JavaScript API client. In the recent versions we introduced a bug that seems to have disabled the DNS Fallback. That bug was triggered when the DNS browser timeout or the DNS system timeout was triggered before our internal API client timeout.

Fortunately, this didn’t occur for every browser or OS. When a browser fails at resolving a DNS server on time, there’s no timeout exception raised by the browser but rather an error on the underlying XMLHttpRequest object. Internally, we use XMLHttpRequest errors to decide to use JSONP in case XMLHttpRequests are blocked. A recent fix on that JSONP fallback introduced a bug when facing a DNS resolution error.

We advised our clients to use the last confirmed working version if they were experiencing issues: version 3.13.1.

The good news is that we’ve now reworked the fallbacks and the latest client is working perfectly today.

The Ugly

Java and Scala clients using the standard Oracle JVM had unexpected results with our new algolia.biz testing domain. While the new test worked locally, it kept failing on Travis CI which we use for testing all of our libraries. After carefully tracing the application calls, we discovered 2 things:

Although the second one was not an issue during the DDoS, depending on the OS the first one could have been.

This implies that some work needs to be done for both the Java and the Scala API clients. For the Java client, we will do it in the upcoming v2. For the Scala client, we need to upgrade the underlying HTTP client, which will take some time as we need to change the underlying architecture of the client.

For PHP,  the situation is even trickier. We are using the CURL library to perform all the requests. Unfortunately, the CURLOPT_TIMEOUT and CURLOPT_CONNECTTIMEOUT options do not include DNS resolution time and PHP uses the timeout of the OS. Luckily, if you have the “ares” extension installed it sets a CURL_TIMEOUT_RESOLVE that handles DNS timeout.

In Closing…

When we implemented the DNS fallback strategy earlier last year, we were confident it was the very last required piece of code to implement the high-availability of our API. Testing such a DNS fallback strategy is complex and it turns out that not having the ability to perfectly reproduce all the conditions of the attack – be it the OS configuration or weird behavior of the underlying HTTP library you don’t understand- was more of a handicap than we thought.

Today we have a dedicated domain name and robust tests to ensure that our fallback code is working in order to alleviate this problem in the future.

And finally, if you are an Algolia customer, please ensure that you are using the latest version of our API clients in order to avoid such impact in the future.

About the author
Rémy-Christophe Schermesser

Staff Software Engineer

Algolia infrastructure

More info
Algolia infrastructure

Recommended Articles

Powered byAlgolia Algolia Recommend

DNS fallback for better resilience
engineering

Adam Surak

Director of Infrastructure & Security @ Algolia

How to build a scalable DNS load balancing system
engineering

Paul Berthaux

Sr. Site Reliability Engineer

Testing for Failure in a 99.999% Reliability World
engineering

Rémy-Christophe Schermesser

Staff Software Engineer