Product

Busting the myth of 100% uptime and designing our own SLA
facebooklinkedintwittermail

Infrastructure is hard. Modern businesses rely on payment processing and paycheck generation and marketing analytics and other SaaS tools, and they trust that those tools are dependable enough to not fail when they’re needed most. For example, Amazon’s search went down for a few hours recently, and even the most conservative estimates say it cost them tens of millions of dollars in revenue. 

How can companies guarantee that their most-needed infrastructure stays up, or at least get insurance for when inevitably it goes down? That’s the idea behind an Service Level Agreement (SLA), an insurance policy that kicks in when the service you paid for doesn’t operate as advertised.

In general, the more critical a service is to a company’s operations, the stronger the SLA, or promise of continual operation from the provider, will need to be to satisfy the customers’ worries.

As a SaaS player and a customer of several external services for both our search platform and our operations, as a company we’ve seen more than our fair share of SLAs. As we evolved and improved our own SLA over the past few years – providing an increasingly strong and transparent promise to our customers – we began combing through other companies’ SLAs with a fine-toothed comb. As it turns out, they’re not all created equal.

Busting the myth of the 100% SLA

In recent years it’s become fashionable for companies to include 100% uptime guarantees in their SLA – and, in some cases, even more than 100%, despite its mathematical impossibility.

Now, don’t get us wrong – all service providers have an obligation to put 100% of their effort into keeping their service running like a well-oiled machine; however, the detection of an outage itself can sometimes even be impossible… until it’s too late, of course. Being a SaaS provider on the internet implies dozens of dependencies on intermediary devices and networks, which themselves have downtime. When you promise 100% uptime, every millisecond of downtime counts – however, what if you can’t detect the outage? How can one tell if the issue comes from your connection dropping data, the service provider, or any of the dozens of intermediaries in between? To resolve this issue, SLAs define a minimum outage necessary in order to be triggered. The market standard is typically 1 minute; however, 1 minute of downtime per month means 99.9977% uptime – so what exactly is 100% uptime then?

One SaaS provider on the market today promises 100% uptime, yet their SLA only promises any refunds or credits after 0.05% downtime per month, which is a little over 20 minutes! Their service could go down for 19 minutes, which could take down your site and cause you to lose revenue, but they wouldn’t be responsible for compensating you for any of that.

We knew we could do better than this.

How we did better

When we set out to design our SLA, we had three goals:

  1. Make it simple – it needs to be understood by our users, and it’d hardly be fair to expect them to take in something worded like a legal document.
  2. Make it transparent – no one wants unexpected surprises, especially in the already stressful situation of their services not working.
  3. Trust our platform – we trust the system we built, and we want an SLA that speaks to that trust.

At Algolia, we currently have two different setups for our customers:

  • Enterprise: we replicate your search on at least three different machines hosted by two different providers in different datacenters and autonomous systems
  • Premium: we replicate your search on at least three different machines hosted by three different providers in three different data centers with three autonomous systems using at least two different Tier1 upstream providers.

 

SLAs

 

These setups are not different just on paper but they’re also different in terms of infrastructure and come with two different SLAs:

  • Enterprise: 99.99% uptime, each minute of downtime would make you eligible for 100 minutes of refund, up to a cumulative value of 100% of the monthly service billing.
  • Premium: 99.999% uptime, each minute of downtime would make you eligible for 1,000 minutes of refund, up to a cumulative value of 600% of the monthly service billing over a year.

Our outage detection starts at 30 seconds (0.001% of a month) instead of 1 minute. This is so granular that it can’t be measured with traditional monitoring architecture, so we built our own monitoring network that continuously monitors our API infrastructure, that gives us a fairly unique ability to detect downtime this fast. 

Here’s what our refund policy looks like in practice:

 

SLAs

 

Search down time Total refund of the monthly service bill
Enterprise SLA Premium SLA
30 seconds 0% 1%
1 minute 0% 2.3%
5 minutes 1% 11.6%
30 minutes 7% 70%
45 minutes 10% 100%
1 hour 13.8% 138%
2 hours 27.7% 277%
4 hours 55.5% 555%
8 hours 100% 600%

As you can see, with our Premium SLA, if our service is down 45 minutes, we refund you 100% of your monthly bill – it doesn’t get much simpler than that.

Is an SLA just SaaS insurance?

Most people don’t really see SLA as much more than a form of SaaS insurance – at Algolia, we see it as something much greater: a way to remind our customers of our reliability. We back our Premium SLA with our reinforced infrastructure and our goal is to make sure we provide the best service in the market – we don’t want downtime any more than you do, and we put our money where our mouth is. We incentivize ourselves to do everything possible to ensure that the probability of an outage is as close to zero as possible!

It has been a year since we introduced our three provider set-up, and, with it, we’ve been able to placate the worries of even the most cautious of customers. Our setup has been extensively tested with outages of entire datacenters and networks and we’ve still been able to maintain 100% uptime. 

To the best of our knowledge, our Premium SLA is unique to the market – in terms of simplicity, transparency & refund guarantee –  and we’d love to tell you more about it if you have any questions, or would like to see how your current SLA stands up against ours!

About the authorAdam Surak

Adam Surak

Director of Infrastructure & Security @ Algolia

Recommended Articles

Powered by Algolia AI Recommendations

Algolia's Checklist for Selecting a Critical SaaS Service
Engineering

Algolia's Checklist for Selecting a Critical SaaS Service

Julien Lemoine

Julien Lemoine

Co-founder & former CTO at Algolia
What to look for in a Search API
Product

What to look for in a Search API

Benoit Perrot

Benoit Perrot

Director, Engineering
10 things to ask your search provider about security
Product

10 things to ask your search provider about security

Denis Petit

Denis Petit

Senior Manager, Security