CI/CDay 2024: What makes a good CI/CD platform?

The Algolia team went to CI/CDay, the first European conference dedicated to CI/CD. CI/CD is often discussed in larger conferences dedicated to infrastructure and scalability. But here, we were a group of 80 people working on CI/CD Platforms specifically, all in the same room, for a full day.

Below is a quick recap of the various topics that have been discussed. Unfortunately, there is no recording of the talks.

Hidden Complexity

When you’re a developer, the lens through which you see CI/CD will mostly be the green and red checks on your PR. You might wonder how one could spend a whole day just discussing that specific topic. In reality, there is a whole world of hidden complexity that you discover once you start spending time building such systems.

People have been using CI/CD Platforms to do all kinds of tests, from unit tests to integration tests, performance regression tests or security tests. CI/CD Platforms are also used to assess code quality metrics or check the compatibility of various licenses pulled in from dependencies. They are also used to build, package, and deploy apps and websites of increasing complexity.

More than two states

With so many different usages, having only two possible statuses (the happy green check mark and the sad red cross) is not enough to represent all outcomes.

What about jobs that have been canceled? For example, there is a job we run every day, but if no relevant file has changed since the last run, we want to cancel it. Should we mark it as a Success or a Failure?

We can’t mark it as a Failure as it’s the expected outcome; it started, but should stop early. We can’t mark it as a Success either, because it would artificially increase our success rate, giving us a false sense of security.

Failures are not Errors

A Failure is when a build fails. An Error is when the CI/CD itself is broken.

A Failure is a good thing to have! It’s the expected workflow. It means you caught a broken path before it hits your users. Developers should be fixing failing builds.

An Error is a bad thing to have. It should not happen. It means there is something broken in the actual infrastructure of your CI. Ops should be fixing builds in error.

Naming is hard

At their core, most of the common CI/CD platforms (CircleCI, GitHub Actions, Travis CI, or your own internal one) do the same thing. If you’ve ever worked with more than one, you quickly realize that they use different names for the same concepts, or, even more confusingly, the same name for different concepts. Jobs, runs, steps and workflows do not exactly mean the same thing on different providers.

Our community needs vendor-agnostic terminology to represent all those various steps, moving pieces and status codes.

CI/CD as a DevOps interface

The CI/CD platform sits at the intersection of developers and ops people. Sounds familiar? Yes, it’s a DevOps platform, shared by both groups.

For developers, the CI/CD platform needs to “Just Work™”. They want to be able to push their code and run it in an environment as close to production as possible to catch any failures in their tests. The platform has to get out of their way so they can focus on what they do best: write code.

For Ops, the CI/CD platform needs to come with all the right tooling and scaffolding so it’s smooth to run and monitor. They want to know when something breaks and why. They want to monitor it and keep an eye on the cost.

All CI/CD platforms probably started as a simple bash script, manually run by one individual on one machine. Today, mature CI/CD Platforms require dedicated Platform Teams to run properly.

The road from a CI/CD to a Platform Team is long and fraught with peril

Don’t build the perfect house; build a toolbox

There is no one-size-fits-all. You cannot build the perfect pipeline that will work for every engineering team. Back-end, front-end, mobile and machine learning engineers all have very different needs. You can’t build the perfect solution to every problem.

What you can build, though, is a toolbox for each dev team to build its own solution. Developers will be developers; give them some good building blocks and watch them build a magic castle. A good CI/CD platform should be like A4 paper: highly standardized but leaving a lot of room for imagination.

Externalizing Trust

CI/CD platforms create trust. Trust that it will run quickly and correctly. Trust that whatever status it outputs is the truth. Trust that you can deploy the code to production because the CI/CD platform said it was good. You want to have more trust in your CI/CD builds than in your manual testing.

This is why everyone’s goal, Devs and Ops alike, is to keep the main branch green. The default state of your CI should be green. Whatever makes your builds go red should become the number one priority to fix. If you don’t, people will lose trust in the platform.

CI fatigue and flaky tests

When red builds are no longer an exception but become the norm, you get CI fatigue. CI fatigue gets you into the habit of just re-running the build until it eventually succeeds. This will in turn increase the load on the platform even more, starting a spiraling, vicious circle of more red builds decreasing the trust even more.

Having flaky builds is even worse than having failed builds. Flaky builds are a poison that decreases the overall quality of the CI/CD platform for everyone using it.

Faster is always better

When in doubt, make it faster. Nobody ever complains that a build was too quick to run. The faster your builds are, the faster you can get your main branch back from red to green.

The fastest job is the one you don’t have to run.

Don’t run all the tests all the time; only run tests on the subset of code that actually changed.
Don’t waste cycles rebuilding the same thing; use caching to fetch previously built artifacts.
Don’t make the slowest component a bottleneck; split builds and run them in parallel.
Don’t test commits sequentially one by one; use a merge queue to batch them, and bisect to find the problem only if the build is red.

Handling the load

The best DX you can provide is building a fully isolated CI/CD environment to each feature branch that mimics production as best as possible. This is also the most expensive solution, and it is common to have a CI/CD infrastructure that costs more than the production one.

Watch for the $/PR (dollar per PR) metric, or, in other words, how much money does opening one PR cost? Expose that metric to both developers and ops to make it more tangible to everyone.

Those ephemeral environments need to be shut down once the associated PR is merged or if it has gone stale for too long. The speed optimizations (caching, selective testing, etc.) mentioned above also help reduce the cost. You can also have the default spawned environment be bare-bones and require developers to opt-in for the additional elements they need on a PR-by-PR basis.

See you next year!

We learned a lot by spending the day with fellow CI/CD enthusiasts. The topic is broad, covering tech as well as organizational questions. It’s also very deep, and one can spend a whole career building a CI/CD platform. We will definitely come back next year, and this time we might even be on stage, sharing some of our insights.

ABOUT THE AUTHOR