Infrastructure is hard. Modern businesses rely on payment processing and paycheck generation and marketing analytics and other SaaS tools, and they trust that those tools are dependable enough to not fail when they’re needed most. For example, Amazon’s search went down for a few hours recently, and even the most conservative estimates say it cost them tens of millions of dollars in revenue.
How can companies guarantee that their most-needed infrastructure stays up, or at least get insurance for when inevitably it goes down? That’s the idea behind an Service Level Agreement (SLA), an insurance policy that kicks in when the service you paid for doesn’t operate as advertised.
In general, the more critical a service is to a company’s operations, the stronger the SLA, or promise of continual operation from the provider, will need to be to satisfy the customers’ worries.
As a SaaS player and a customer of several external services for both our search platform and our operations, as a company we’ve seen more than our fair share of SLAs. As we evolved and improved our own SLA over the past few years – providing an increasingly strong and transparent promise to our customers – we began combing through other companies’ SLAs with a fine-toothed comb. As it turns out, they’re not all created equal.
In recent years it’s become fashionable for companies to include 100% uptime guarantees in their SLA – and, in some cases, even more than 100%, despite its mathematical impossibility.
Now, don’t get us wrong – all service providers have an obligation to put 100% of their effort into keeping their service running like a well-oiled machine; however, the detection of an outage itself can sometimes even be impossible… until it’s too late, of course. Being a SaaS provider on the internet implies dozens of dependencies on intermediary devices and networks, which themselves have downtime. When you promise 100% uptime, every millisecond of downtime counts – however, what if you can’t detect the outage? How can one tell if the issue comes from your connection dropping data, the service provider, or any of the dozens of intermediaries in between? To resolve this issue, SLAs define a minimum outage necessary in order to be triggered. The market standard is typically 1 minute; however, 1 minute of downtime per month means 99.9977% uptime – so what exactly is 100% uptime then?
One SaaS provider on the market today promises 100% uptime, yet their SLA only promises any refunds or credits after 0.05% downtime per month, which is a little over 20 minutes! Their service could go down for 19 minutes, which could take down your site and cause you to lose revenue, but they wouldn’t be responsible for compensating you for any of that.
We knew we could do better than this.
When we set out to design our SLA, we had three goals:
At Algolia, we currently have two different setups for our customers:
These setups are not different just on paper but they’re also different in terms of infrastructure and come with two different SLAs:
Our outage detection starts at 30 seconds (0.001% of a month) instead of 1 minute. This is so granular that it can’t be measured with traditional monitoring architecture, so we built our own monitoring network that continuously monitors our API infrastructure, that gives us a fairly unique ability to detect downtime this fast.
Here’s what our refund policy looks like in practice:
Search down time | Total refund of the monthly service bill | |
Enterprise SLA | Premium SLA | |
30 seconds | 0% | 1% |
1 minute | 0% | 2.3% |
5 minutes | 1% | 11.6% |
30 minutes | 7% | 70% |
45 minutes | 10% | 100% |
1 hour | 13.8% | 138% |
2 hours | 27.7% | 277% |
4 hours | 55.5% | 555% |
8 hours | 100% | 600% |
As you can see, with our Premium SLA, if our service is down 45 minutes, we refund you 100% of your monthly bill – it doesn’t get much simpler than that.
Most people don’t really see SLA as much more than a form of SaaS insurance – at Algolia, we see it as something much greater: a way to remind our customers of our reliability. We back our Premium SLA with our reinforced infrastructure and our goal is to make sure we provide the best service in the market – we don’t want downtime any more than you do, and we put our money where our mouth is. We incentivize ourselves to do everything possible to ensure that the probability of an outage is as close to zero as possible!
It has been a year since we introduced our three provider set-up, and, with it, we’ve been able to placate the worries of even the most cautious of customers. Our setup has been extensively tested with outages of entire datacenters and networks and we’ve still been able to maintain 100% uptime.
To the best of our knowledge, our Premium SLA is unique to the market – in terms of simplicity, transparency & refund guarantee – and we’d love to tell you more about it if you have any questions, or would like to see how your current SLA stands up against ours!
Adam Surak
Director of Infrastructure & Security @ AlgoliaPowered by Algolia AI Recommendations