Sorry, there is no results for this query
Today (Jan 29) at 9:30pm UTC, our service experienced an 8 minute partial outage during which we have rejected many write operations sent to the indexing API (exactly 2841 calls). We call it “partial” as all search queries have been honored without any problem. For end-users, there was no visible problem.
Transparency is in our DNA: this outage is visible on our status page (status.algolia.com) but we also wanted to share with you all the details of the outage and more importantly the details of our response.
This morning I fixed a rare bug in indexing complex hierarchical objects. This fix successfully passed all the tests after development. We have 6000+ unit tests and asserts, and 200+ non regression tests. So I felt confident when I entered the deploy password in our automatic deployment script.
A few seconds after, I started to receive a lot of text messages on my cellphone.
We developed several embedded probes to detect all kinds of problems and alert us using Twilio and Hipchat APIs. They detect for example:
In case embedded probes can’t run, other external probes run once a minute from an independent datacenter (Google App Engine). These also automatically update our status page when a problem impacts the quality of service.
Our indexing processes were crash looping. I immediately decided to rollback to the previous version.
Until today, our standard rollback process was to revert the commit, launch the recompile and finally deploy. This is long, very long when your know that you have an outage in production. The rollback took about 5 minutes in total out of the 8 minutes.
Even if the outage was on a relatively small period of time, we still believe it was too long. To make sure this will not happen again:
Having all these probes in our infrastructure was key to detect today’s problem and react quickly. In real conditions, it proved not to be enough. In a few hours we have implemented a much better way to handle this kind of situation. The quality of our service is our top priority. Thank you for your support!