Search by Algolia
5 considerations for Black Friday 2023 readiness
e-commerce

5 considerations for Black Friday 2023 readiness

It’s hard to imagine having to think about Black Friday less than 4 months out from the previous one ...

Piyush Patel

Chief Strategic Business Development Officer

How to increase your sales and ROI with optimized ecommerce merchandising
e-commerce

How to increase your sales and ROI with optimized ecommerce merchandising

What happens if an online shopper arrives on your ecommerce site and: Your navigation provides no obvious or helpful direction ...

Catherine Dee

Search and Discovery writer

Mobile search UX best practices, part 3: Optimizing display of search results
ux

Mobile search UX best practices, part 3: Optimizing display of search results

In part 1 of this blog-post series, we looked at app interface design obstacles in the mobile search experience ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

Mobile search UX best practices, part 2: Streamlining search functionality
ux

Mobile search UX best practices, part 2: Streamlining search functionality

In part 1 of this series on mobile UX design, we talked about how designing a successful search user experience ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

Mobile search UX best practices, part 1: Understanding the challenges
ux

Mobile search UX best practices, part 1: Understanding the challenges

Welcome to our three-part series on creating winning search UX design for your mobile app! This post identifies developer ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

Teaching English with Zapier and Algolia
engineering

Teaching English with Zapier and Algolia

National No Code Day falls on March 11th in the United States to encourage more people to build things online ...

Alita Leite da Silva

How AI search enables ecommerce companies to boost revenue and cut costs
ai

How AI search enables ecommerce companies to boost revenue and cut costs

Consulting powerhouse McKinsey is bullish on AI. Their forecasting estimates that AI could add around 16 percent to global GDP ...

Michelle Adams

Chief Revenue Officer at Algolia

What is digital product merchandising?
e-commerce

What is digital product merchandising?

How do you sell a product when your customers can’t assess it in person: pick it up, feel what ...

Catherine Dee

Search and Discovery writer

Scaling marketplace search with AI
ai

Scaling marketplace search with AI

It is clear that for online businesses and especially for Marketplaces, content discovery can be especially challenging due to the ...

Bharat Guruprakash

Chief Product Officer

The changing face of digital merchandising
e-commerce

The changing face of digital merchandising

This 2-part feature dives into the transformational journey made by digital merchandising to drive positive ecommerce experiences. Part 1 ...

Reshma Iyer

Director of Product Marketing, Ecommerce

What’s a convolutional neural network and how is it used for image recognition in search?
ai

What’s a convolutional neural network and how is it used for image recognition in search?

A social media user is shown snapshots of people he may know based on face-recognition technology and asked if ...

Catherine Dee

Search and Discovery writer

What’s organizational knowledge and how can you make it accessible to the right people?
product

What’s organizational knowledge and how can you make it accessible to the right people?

How’s your company’s organizational knowledge holding up? In other words, if an employee were to leave, would they ...

Catherine Dee

Search and Discovery writer

Adding trending recommendations to your existing e-commerce store
engineering

Adding trending recommendations to your existing e-commerce store

Recommendations can make or break an online shopping experience. In a world full of endless choices and infinite scrolling, recommendations ...

Ashley Huynh

Ecommerce trends for 2023: Personalization
e-commerce

Ecommerce trends for 2023: Personalization

Algolia sponsored the 2023 Ecommerce Site Search Trends report which was produced and written by Coleman Parkes Research. The report ...

Piyush Patel

Chief Strategic Business Development Officer

10 ways to know it’s fake AI search
ai

10 ways to know it’s fake AI search

You think your search engine really is powered by AI? Well maybe it is… or maybe not.  Here’s a ...

Michelle Adams

Chief Revenue Officer at Algolia

Cosine similarity: what is it and how does it enable effective (and profitable) recommendations?
ai

Cosine similarity: what is it and how does it enable effective (and profitable) recommendations?

You looked at this scarf twice; need matching mittens? How about an expensive down vest? You watched this goofy flick ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

What is cognitive search, and what could it mean for your business?
ai

What is cognitive search, and what could it mean for your business?

“I can’t find it.”  Sadly, this conclusion is often still part of the modern enterprise search experience. But ...

Vincent Caruana

Sr. SEO Web Digital Marketing Manager

How neural hashing can unleash the full potential of AI retrieval
ai

How neural hashing can unleash the full potential of AI retrieval

Search can feel both simple and complicated at the same time. Searching on Google is simple, and the results are ...

Bharat Guruprakash

Chief Product Officer

Looking for something?

Postmortem of today’s 8min indexing downtime
facebookfacebooklinkedinlinkedintwittertwittermailmail

Today (Jan 29) at 9:30pm UTC, our service experienced an 8 minute partial outage during which we have rejected many write operations sent to the indexing API (exactly 2841 calls). We call it “partial” as all search queries have been honored without any problem. For end-users, there was no visible problem.

Transparency is in our DNA: this outage is visible on our status page (status.algolia.com) but we also wanted to share with you all the details of the outage and more importantly the details of our response.

The alert

This morning I fixed a rare bug in indexing complex hierarchical objects. This fix successfully passed all the tests after development. We have 6000+ unit tests and asserts, and 200+ non regression tests. So I felt confident when I entered the deploy password in our automatic deployment script.

A few seconds after, I started to receive a lot of text messages on my cellphone.

We developed several embedded probes to detect all kinds of problems and alert us using Twilio and Hipchat APIs. They detect for example:

  • a process that restart
  • an unusually long query
  • a write failure
  • a low memory warning
  • a low disk-free warning
  • etc.

In case embedded probes can’t run, other external probes run once a minute from an independent datacenter (Google App Engine). These also automatically update our status page when a problem impacts the quality of service.

Our indexing processes were crash looping. I immediately decided to rollback to the previous version.

The rollback

Until today, our standard rollback process was to revert the commit, launch the recompile and finally deploy. This is long, very long when your know that you have an outage in production. The rollback took about 5 minutes in total out of the 8 minutes.

How we will avoid this situation in the future

Even if the outage was on a relatively small period of time, we still believe it was too long. To make sure this will not happen again:

  • We have added a very fast rollback process in the way of a simple press button like the one we use to deploy. An automatic deploy is nice, but an automatic rollback is actually more critical when needed!
  • Starting now, we will deploy new versions of the service on clusters hosting community projects such as Hacker News Search or Twitter handle search, before pushing the update on clusters hosting paying customers. Having real traffic is key to detect some types of errors. Unit-tests & non-regression tests cannot catch everything.
  • And of course we added non-regression tests for this specific error.

Conclusion

Having all these probes in our infrastructure was key to detect today’s problem and react quickly. In real conditions, it proved not to be enough. In a few hours we have implemented a much better way to handle this kind of situation. The quality of our service is our top priority. Thank you for your support!

About the author
Julien Lemoine

Co-founder & former CTO at Algolia

githublinkedintwitter

Algolia infrastructure

More info
Algolia infrastructure

Recommended Articles

Powered byAlgolia Algolia Recommend

Resilience testing in production: test as you deploy
engineering

Xavier Grand

Engineering Manager

Algolia's Checklist for Selecting a Critical SaaS Service
engineering

Julien Lemoine

Co-founder & CTO at Algolia

Inside the Algolia Engine Part 1 — Indexing vs. Search
engineering

Julien Lemoine

Co-founder & former CTO at Algolia