Vector vs Keyword Search: Why You Should Care
Search has been around for a while, to the point that it is now considered a standard requirement in many ...
Senior Machine Learning Engineer
Search has been around for a while, to the point that it is now considered a standard requirement in many ...
Senior Machine Learning Engineer
With the advent of artificial intelligence (AI) technologies enabling services such as Alexa, Google search, and self-driving cars, the ...
VP Corporate Marketing
It’s no secret that B2B (business-to-business) transactions have largely migrated online. According to Gartner, by 2025, 80 ...
Sr. SEO Web Digital Marketing Manager
Twice a year, B2B Online brings together industry leaders to discuss the trends affecting the B2B ecommerce industry. At the ...
Director of Product Marketing & Strategy
This is Part 2 of a series that dives into the transformational journey made by digital merchandising to drive positive ...
Benoit Reulier &
Reshma Iyer
Get ready for the ride: online shopping is about to be completely upended by AI. Over the past few years ...
Director, User Experience & UI Platform
Remember life before online shopping? When you had to actually leave the house for a brick-and-mortar store to ...
Search and Discovery writer
If you imagine pushing a virtual shopping cart down the aisles of an online store, or browsing items in an ...
Sr. SEO Web Digital Marketing Manager
Remember the world before the convenience of online commerce? Before the pandemic, before the proliferation of ecommerce sites, when the ...
Search and Discovery writer
Artificial intelligence (AI) is no longer just the stuff of scary futuristic movies; it’s recently burst into the headlines ...
Search and Discovery writer
Imagine you are the CTO of a company that has just undergone a massive decade long digital transformation. You’ve ...
CTO @Algolia
Did you know that the tiny search bar at the top of many ecommerce sites can offer an outsized return ...
Director, Digital Marketing
Artificial intelligence (AI) has quickly moved from hot topic to everyday life. Now, ecommerce businesses are beginning to clearly see ...
VP of Product
We couldn’t be more excited to announce the availability of our breakthrough product, Algolia NeuralSearch. The world has stepped ...
Chief Executive Officer and Board Member at Algolia
The ecommerce industry has experienced steady and reliable growth over the last 20 years (albeit interrupted briefly by a global ...
CTO @Algolia
As an ecommerce professional, you know the importance of providing a five-star search experience on your site or in ...
Sr. SEO Web Digital Marketing Manager
Hashing. Yep, you read that right. Not hashtags. Not golden, crisp-on-the-outside, melty-on-the-inside hash browns ...
Search and Discovery writer
We’re just back from ECIR23, the leading European conference around Information Retrieval systems, which ran its 45th edition in ...
Senior ML Engineer
Jul 9th 2019 engineering
Site Reliability Engineers (SREs) design and manage efficient processes and operations, and they keep a company’s infrastructure in healthy working order.
Here at Algolia, our team has grown from 4 to 10 in less than two years, a growth rate similar to the company as a whole. The team has grown not only in number but in how we work together and create sound operational processes.
This blog is about how a group of hard-working individuals, with unique skills and working methods, managed to create a successful SRE team.
My own journey into the SRE field reflects this maturation process. Before I joined Algolia, I was traveling around the world as an Integration Engineer for a telco company. On more than one occasion, I found the opportunity to build small tools that helped me be more productive. This increased my confidence to go further professionally. That’s when I started digging into what an SRE is. Here’s what Google VP Ben Trayner says about SREs:
An SRE is what happens when a software engineer is tasked with what used to be called operations.
Although I wasn’t technically a software engineer, my career followed this same pattern: writing code to manage operational tasks. I was therefore excited by the SRE role and ready for a change.
Every member of the SRE team gets involved in all three of these activities:
Projects: We work on different kinds of projects. Some of them have a direct impact on the business and others help improve the global infrastructure.
Operations: The majority of our infrastructure is bare metal, which means that we require a decent amount of automation and work on installing/destroying/repairing servers. We need to debug live applications and support other teams searching for technical advice. On top of that, we have to contact providers to manage any issue with the providers.
On call: We need to provide support for the whole infrastructure. This means that each one of us has to be on call 1 week every 4 weeks, 24/7.
Every Monday we have a one-hour meeting where we review the previous week and plan the current one. We talk about:
Operations at Algolia has changed over time. For instance, when I joined, operations were performed on a daily basis, which made it difficult to gain a deep context about what tasks had been completed. Our golden rule regarding priority is
Operations are now performed on a weekly basis. You can see the rotation plan below. On top of this, we have two levels of on call. In the end, on call should not really be different from the normal operations we do. Overall, operations is more than maintaining and improving our infrastructure.
In my first days at Algolia, I noticed right away that my colleagues were communicating mainly through Slack – even if they were just a few meters from each other. This felt a bit cold, especially given my natural inclination to get up and talk to people. Additionally, the team was scattered across the office. It didn’t feel like a cohesive unit. For these reasons, communication was difficult.
Interestingly enough, I wasn’t the only one experiencing this. Newcomers arrived and noticed the same issues.
Critical mass probably pushed us in the right direction: there were just too many tasks to continue functioning as we did before. Our first step was to set expectations for every member, so that everyone would know in advance what they can get from each other.
New people bring in unexpected benefits and qualities to the team that the older members wouldn’t have expected. Additionally, if they are open to it, older members can actually learn from the incoming ones and have these new people/qualities change the team for the better. The right mix of old and new is what makes a team great.
One initiative we took was to create a coffee break culture, two days a week. During these breaks, we speak about different things, work or not work related. We get to know each other and communicate better.
Initially each member of the squad worked individually on their given projects, choosing for themselves the subject they wanted to work on. This kind of autonomy and personal initiative wasn’t all that bad, but for a newcomer it was overwhelming; it forced me to switch gears often: to learn and do my daily support and come up with project ideas and complete them, all on my own.
Because we were growing, we quickly realized that we needed to change this, we needed to start working together. The first step was to do some pairing, working on teams of two people. This gave us an opportunity to interact and get to know other members on the team.
On this project, Paul and I worked on finding out how the current deployment worked, how we can recover in a disaster scenario, and how much time the recovery would take.
The project took 2 weeks. Once we were done, we had the choice to change partners or continue working together. At the end of these two weeks we had:
The next two weeks I continued to work with Paul on getting more insights on the new Load Balancer. If you have not read his blog post on one year of load balancing, I suggest you do it.
The main problem we found was that during operations or on call, any request regarding the load balancer had to be forward to Paul. This knowledge sharing has two benefits: first, it provides insights to more than one person; secondly, it offloads tasks from the person having all the knowledge.
Recently, a new member joined Algolia and started working on the load balancer full-time. Dedicated support on the load balancer not only provides the new member with more operational skills, but also gives the team more people to consult.
In this project, three members of the Foundation squad worked together to bring up a new backup solution. The interesting part here is that we decided to start using Scrum methodology. This was a huge success as it allowed us to:
Our efforts have worked so well, and our team has become so stable, that I’ve taken a fresh look at the original “problems” I encountered when I started: the overuse of Slack and a scattered team not sitting together. Today, this is not a problem: we use Slack constantly, and yet the conversation feels natural and direct. And I can sit on a different floor or desk every day and easily collaborate with my team – because I know how the people on my team work. This is what makes for a great SRE experience: effective communication, efficient processes.
Site Reliability Engineer
Powered by Algolia Recommend