|Summary and key takeaways
On May 30, 2020, 10:48 UTC, we experienced a rare situation in the Public Key Infrastructure of the Internet when two of the root certification authorities expired, one cross-signed by the other.
In theory, this isn’t anything unusual. Certificates expire all the time (normally in a year or even 90 days if you use Let’s Encrypt), and certification authorities expire once in many years. But on May 30, the expiring certification authority exposed an underlying issue which made our API unavailable for some of our customers.
On May 30, a minute after 10:48 UTC when the certification authorities expired, our Site Reliability Engineering (SRE) team got notified that there was a certificate problem with our service. This was an unexpected message because we carefully check that our certificates are valid and don’t expire any time soon. A quick verification in the browser also confirmed that everything seemed normal and that the API was responding correctly. However, a second alert from a different service notified the SRE team, claiming again that the certificate expired. Another verification in the browser again confirmed that the service was working and responding. For some reason, the Pingdom monitoring service saw the certificate as expired, but everything was working for us in the browsers.
Looking at the traffic graphs on our infrastructure, we noticed that there was a decrease in the number of API calls, but they were nowhere close to zero. When things don’t work completely, it is often easier to identify what does not work rather than in situations when some things work and some things don’t.
The direction of the investigation quickly changed when we tested the impacted services from the command line of our laptops with macOS. Suddenly, a simple curl to the domain was failing saying the certificate was not valid. We were on to something, and the Qualys SSL Labs scanner showed many interesting things:
- Our certificate was valid.
- The certificate of our intermediate certification authority Sectigo RSA Organization Validation Secure Server CA was valid.
- In Path 1, the root certificate USERTrust RSA Certification Authority was valid, and the full chain was valid.
- In Path 2, the certificate of our root certification authority USERTrust RSA Certification Authority expired on May 30, 10:48 UTC.
- The certificate of root certification authority AddTrust External CA Root cross-signing the certificate of USERTrust RSA expired on May 30, 10:48 UTC.
This was an interesting situation. There was a valid path to the USERTrust RSA Certification Authority, and there was also an expired path. The browser was able to find the valid chain, but the curl was not able to find it. The Qualys SSL Labs test was showing the system was working and the certificate configuration was working fine.
We updated the status page with the information we had, dedicated people to respond to emails coming to our support mailbox and started to work on mitigations of what we thought was the source of the problem.
In our testing environment, we verified that removing the expired certificates from the certificate chain served by our servers was a good approach. The servers were still available from the browser and now even curl in the command line was able to verify the certificate chain and let the requests through. This was a promising solution worth deploying to production and so we went ahead and started deploying.
Once the change hit the first production servers we saw the traffic levels recover to what we would expect at that time of the day. With every additional server getting the now shorter version of the certificate, the situation was getting better and better until the moment when we thought everything was done and we could finish the incident. However, there was still a small group of customers telling us they were unable to verify the certificate and connect to the API.
As seen above, the test indicated that there was one valid trusted path, as well as one new trusted path that was not visible before—this time not finishing at AddTrust External CA root, but AAA Certificate Services. Where did this new path come from? As it turned out, the USERTrust RSA Certification Authority certificate exists in 3 versions:
- self-signed root certification authority version created in 2010
- cross-signed by AddTrust External CA root version created in 2000
- cross-signed by AAA Certificate Services version created in 2019
The situation at this time was that the vast majority of our customers were now recognizing the self-signed root version from their certificate stores, we couldn’t use the AddTrust and there was only a AAA Certificate Services to use. It was unclear why the self-signed version wasn’t recognized everywhere since it had existed in the commonly available certificate stores for some time. After a quick look at the AAA Certificate Services certificate showed it was created in 2004, there was a strong likelihood that it was in more certificate stores and had been in place for a bit longer.
Our SRE team started to deploy the second change to our public certificate trying to restore the service for the remaining customers and shortly after deployment confirming with them they can reach the service again. The incident was now finally over and the service was restored for all customers. But why did some of our customers lose access in the first place when even the Qualys scanner says the certificate is valid?
The certificate was indeed valid and the whole certificate chain was generated somewhat correctly by the Comodo/Sectigo certification authority (expiring any certificate during the weekend is not a great practice and neither is expiring a certification authority because its leaf certificates, but technically it’s not incorrect.) What was not envisioned was how client libraries are going to handle this situation, primarily OpenSSL which powers the vast majority of HTTPS on Earth. During the analysis of the incident, we landed on an interesting OpenSSL bug from 2014 which says:
Don’t use expired certificates if possible.
When looking for the issuer of a certificate, if current candidate is expired, continue looking. Only return an expired certificate if no valid certificates are found.
This means that before this change was implemented, whenever OpenSSL detects an invalid certificate in the chain it declares the certificate as invalid and refuses the connection. After this change was implemented, OpenSSL skips the expired certificate and correctly continues looking for additional certificates that can prove the certificate is valid. This tiny change adapts to the nature of the certificate chain—a single certification authority can be signed by multiple certification authorities, some of them still valid and some of them not anymore. Everything looks great, so why the impact then?
Digging further, this change to OpenSSL is only part of OpenSSL 1.1.1 and is not part of OpenSSL 1.0.x. This, for example, means that all versions of Ubuntu 16.04 and older, Debian 9 and older, CentOS 7 and older or RedHat 7 and older are impacted, but all of these are still generally supported versions, at least from the security point of view. On our macOS laptops, we use LibreSSL 2.x, and LibreSSL has a similar mention in the release notes of LibreSSL 3.2.0:
* Use non-expired certificates first when building a certificate chain.
We’ll now have to wait for new LibreSSL to appear in new versions of macOS.
But why did the browsers verify the certificates correctly? Browsers ship with their own SSL/TLS libraries and their own ways of verifying PKI. Chrome ships with BoringSSL and Firefox with NSS, independently from the SSL/TLS libraries of the underlying operating system, not having the same bug and being updated much more often.
And here we are, finally having the full picture of what had happened and why only back end implementations of some customers were impacted and why there were some systems with outdated certificate stores that needed a much older root certificate.
What is next? We’re reaching out to impacted customers and explaining what happened, we’re improving our certificate checking tool to verify expiration in the full chain, not just our leaf certificate. Last but not least, we’ll include OpenSSL in our annual Open Source donations and will financially support the work of the OpenSSL team.
- Was the service unavailability caused by an issue on the Algolia servers?
No, the service continued working and was unavailable only for outdated systems.
- What should I do to avoid a similar situation happening in the future with Algolia or other services?
Update your OpenSSL to at least version 1.1.1 and LibreSSL to at least 3.2.0. Also update your certificate stores, on Linux often as “ca-certificates” package.
- Did Algolia’s certificate expire?
No, Algolia’s certificate didn’t expire and is still valid. What expired was one out of 3 versions of certification authority signing our certificate.
- Was any other provider impacted or was the issue specific to Algolia?
There were unfortunately other providers impacted: Heroku, Stripe, kernel.org, Datadog, Gandi.net, and many others.
- What was the impact to customers?
We detected approximately 10% of our application clusters were impacted for about 1.5 hours. After this period, there was a single digit number of customers that was impacted for up to 3 hours. During the incident no search queries coming from the browsers were impacted.