Search by Algolia

Sorry, there is no results for this query

Testing for Failure in a 99.999% Reliability World

Even in a perfect world where everyone is doing test-driven development (TDD), even when everything is well planned and as a result, those plans succeed – Even in this world, things will fail. Bugs happen. Inevitably, there is always a little thingy that was forgotten. That’s this little thingy that this post is about.

In our years of building Algolia, we’ve had our share of learning experiences when it comes to building reliable infrastructure, especially when it comes to hardware and networks. As we strive to have a five 9s SLA, we need to have as many failovers as possible. Part of this failover is done in our API clients, where we implement automatic retries in case of TCP or DNS failures.

TCP is complex and can fail. The TCP stack of most programming languages is sound, but not all HTTP clients know how to handle failure correctly. That is why we need to thoroughly test how our API clients are behaving in case of network failures.

If something fails or is not fast enough, we want to make sure that another method is used. So, the most important factors, in our view, are timeouts. We need to be sure they are handled correctly by the HTTP clients we are using.

With our knowledge of TCP, we knew that the timeouts we wanted to enforce were:

  • Connection timeout: the time to make the initial connection, i.e. the time to initiate the TCP handshake
  • Read timeout: the time to wait to read data, i.e. the delay between 2 bytes sent by the server

For DNS, it’s a bit more complex. Most of the time, it’s not a connected protocol and uses UDP. We saw that it is handled very differently in each programming language, so we needed to make sure our API Clients were behaving in the same way whatever their programming language. Hence, we wanted to enforce only one timeout: the time to resolve a hostname.

Simulating network errors

With in mind what we wanted to test, how could we simulate network errors easily and in a way that is language agnostic?

First, connection timeout. This one is quite easy, as you only need a host that resolves to an IP that doesn’t answer. Some ranges of IPv4 are reserved, so you only need one host that resolves into  the private network range, we chose randomly

Second, read timeout. For this one, we need a host that resolves to an IP that accepts connections, but never answer when we ask for data.

Third, the DNS timeout, which is a bit more tricky than the first two tests. To test for this condition, we need a host where its DNS resolution times out. So we created a new domain where the DNS resolution is handled by a server that timeouts. Ring a bell? It’s the same as the connection timeouts. The resolver of our domain is the same IP that the one for the connection timeout:

With all of this we could test timeouts in every language possible.

Simulating user input

We are operating a public facing API, so anyone can send us a request. And a small part of those are invalid:

  • Invalid JSON
  • Bad UTF-8 characters
  • SQL injection, remote code, or attacks of the same kind

For this, there was already a lot of resources on Internet, so we used them.

For JSON, we use YAJL, so we were also pretty confident in the handling of JSON. For various reasons, we tried developing our own JSON parser, so we wanted to make sure it was handling bad and good JSON correctly. We stumbled upon this article and this test suite. We used it to test our JSON parser & YAJL. Funny thing, we discovered that YAJL accepts line feed (\f) as a valid whitespace character, where the JSON standard doesn’t.

UTF-8 is a complex encoding format, and it’s quite easy to generate a sequence of bytes that result in a bad UTF-8 character. For this, we aggregated multiple source of bad sequence so we could use them.

Last, but not least, we evaluated naughty strings. It’s strings that could be a security issue/flaw:

So with little effort, we manage to add quite a few tests to ensure that we handle corner cases correctly.

Discovering Failure

For the previous failures, it was something we knew beforehand. So it was quite easy to know what to test and how to simulate errors. But what happens when the unexpected happens?

Let’s take an example. We have an internal application that reads logs that are in the following format: “key1=value1;key2=value2”. This format is quite straightforward. It’s a key/value separated by semicolons. So, there isn’t a ton of code needed to parse it. But this application is business critical and should handle incorrect logs in a proper way, aka not to crash.

To ensure it doesn’t crash, we can add some basic unit tests as well on some corner cases we thought, but there was probably a lot more that we didn’t think about.

One way to do this is to use property testing. It’s a way to test code where you let the computer generates the testing data. It comes from functional languages, where it works pretty well as all functions are pure and could be described by its inputs and outputs.

Property testing works when you describe properties on your code, you describe how to generate the data, and then you let the property test framework generates the data and it checks if those data validates the properties.

Let’s take a full example with our log parsing application. One property could be “I should not throw an exception if I receive an invalid log”.

So what is a log?

  • It’s a string
  • It’s a sequence of strings separated by semicolon
  • It’s a sequence of “key=value” separated by semicolon
  • We could then generate the key/values we expect, and so on

Then we run the testing framework, and it will test our application with the data that is constrained by what is a log. With this we managed to found some corner cases in our parsing of logs. One field was expecting a IP address but we didn’t check it was in the correct format, for example.

In Summary

Testing for failures is not a lot of work if you know what are the areas to look for. As long as you know it, you can find some good documentation of corner cases. For all the rest, with little effort, you can code property tests that will test your software in a new, and unexpected way.

About the author
Rémy-Christophe Schermesser

Staff Software Engineer

Recommended Articles

Powered byAlgolia Algolia Recommend

Our Post Mortem of the DNS DDoS which took place on Monday May 16th

Rémy-Christophe Schermesser

Staff Software Engineer

Introducing our new navigation

Craig Williams

Director of Product Design & Research

Algolia's top 10 tips to achieve highly relevant search results

Julien Lemoine

Co-founder & CTO at Algolia