In today’s complex world, ops engineers and SREs get to worry not just about quality of infrastructure, but about cost reduction. Here, I’ll share a few tips on reducing cost of servers with popular cloud providers, as well as a general approach for those to whom this type of work is new.
When you get hired as an Ops engineer or an SRE, you probably know what you are getting into and what you are supposed to do: things like maintaining servers, developing build and deployment pipelines, provisioning and initiating cloud servers, monitoring services and the likes come to mind. One thing that does not usually come to mind when discussing the role of the operations engineer are finance-related activities such as cost analysis and cost reduction.
This, as it turns out, can sometimes be quite a big part of our jobs in companies.
Who would have thought, for example, that I’d become quite familiar with a term like COGS, or “cost of goods sold”, defined by Investopedia as “the direct costs attributable to the production of the goods sold by a company”. In plain English: “how much does it cost to create a product before selling it”, or, if you prefer plain engineering English: “how much do we spend to have our services running in production”. Basically, this is about cost reduction: you start off with creating a budget in order to understand how much you are spending, how much you are going to spend, and whether your costs are going up or down. Let’s look step by step on how to do this with cloud services.
Handling cost in today’s cloud is very similar to other analysis tasks such as monitoring. You have to:
Amazon Web Services (AWS) has a very nice billing dashboard, They basically collect the data for you, so you can just skip to the analyze part \o/. Google cloud platform (GCP) does not currently provide a very convenient way to see your billing data, but they do provide a way to export your billing data to BigQuery.
There are all kinds of ways to analyze your data, but a simple way you can start with is looking at 2 types of analyses:
Month-by-month analysis over a period of a year (or even less than a year) can give you interesting insights, such as: is the cost of the infrastructure increasing at the same rate as the growth of the business?
2. Daily Analysis
Track the changes you have made this week, and see if they have the effect on the monthly budget that you expected.
AWS has a great tool for cost analysis — the Cost Explorer:
It is a part of the billing dashboard and it has very strong analytical abilities: you can analyze cost by date (yearly/monthly/daily), and it has hundreds of other dimensions that can be configured — such as what machines and databases you are spending on, when things start going up or down, etc. A good practice here is to tag assets; for example, you can tag an instance by its principal user so you can quickly track back to the person over-utilizing or under-utilizing a machine. Another good practice is to tag by product which will make it easy to know at a certain point in time how much a project costs.
In GCP — once your data is in BigQuery — you can use Google’s Data Studio, or Redash, an open source tool to do quite sophisticated analyses. Check out this elaborate blog post from Google.
Both AWS and GCP let you set up billing alerts, like this one:
In AWS, you can set up billing alerts (such as notifications of costs exceeding the free tier) in a granular way.
You can see that cost of traffic between different regions and AWS services differs greatly, so it can be very important to understand where you are transferring data to and from.
One tip: communication between regions is relatively expensive, but can be cheaper between regions that are close, or if one of them is new. If you want to be multiregional, one way to save is by finding cheaper connections between regions.
Another tip: you can reduce the pricing of load balancers by simply asking for a discount: cloudfront and CDN pricing are quite negotiable at high volumes.
– AWS: Switch to a new generation of instances. For example, the M3 instance had an inferior CPU, less memory, and cost more than the new M4. The recently released C5 instance family has faster CPU than C3 and C4 and once again costs less.
– AWS: Reserved instances allows you to get significant discounts on EC2 compute hours in return for a commitment to paying for instance hours of a specific instance type in a specific AWS region and availability zone for a pre-established time frame (1 or 3 years). Further discounts can be realized through “partial” or “all upfront” payment options.
– AWS: EC2 Spot instances are a way to get EC2 resources at a significant fluctuating discount — often many times cheaper than standard on-demand prices — if you’re willing to accept the possibility that they be terminated with little to no warning if you underbid. I highly recommend a company by the name of Spotinst that can manage the hard work and uncertainty for you.
– GCP: if your workload is stable and predictable, you can purchase a specific amount of vCPUs and memory for up to a 57% discount off of normal prices in return for committing to a usage term of 1 year or 3 years.
– GCP preemptible VM is an instance that you can create and run at a much lower fixed price than normal instances on GCP. However, Compute Engine might terminate (preempt) these instances if it requires access to those resources for other tasks. They also last a maximum of 24 hours.
Let’s take the example of a pretty strong machine: AWS EC2 M4.10XLARGE – [40vCPU, 160GiB RAM, 10Gig Network], and compare its prices with other possibilities and different optimizations (the table is sorted by provider).
Provider Machine Type | Price / Month | Comment | |
AWS – OnDemand | m4.10xlarge | $1,460.00 | |
AWS – Reserved 1 Year | m4.10xlarge | $904.47 | |
AWS – Reserved 3 Years | m4.10xlarge | $630.72 | |
AWS – Spotinstnce | ~$447.696 | Prices change very frequently | |
GCP – Sustained Price | Custom Instance | $1,054.00 | |
GCP – Upfront | Custom Instance | $873.77 | |
GCP – Preemptible | Custom Instance | $314.40 | |
LeaseWeb | R730XD | $374.99 | 2x 10 cores 256GB DDR3 RAM 2x480GB SSD 10 TB traffic |
OVH | MG-256 | $365.99 | 20 cores 256GB Disks 2x2TB |
The first thing you can easily spot is that bare metal providers such as leaseweb and ovh will provide the best value for the buck, and they also include storage and traffic in the same package. You can also see that those bare metal providers are much less flexible in terms of machine types.
Another thing we need to consider is that in a cloud environment we can pay for only what we use, so if we need a machine for an hour a day, we actually don’t have to pay a monthly fee, and this might reduce costs dramatically, especially if we use spot instances or preemptible instances.
Here at Algolia we actually chose a mix of providers. Using bare metal for the Algolia engine and API was the best decision for us, but we also use Google Cloud Platform for our log processing and analytics, and AWS for many different production and internal services.
The bottom line is that, as always with building and maintaining a robust infrastructure, you need to choose what’s best for your company and your use case. Hopefully, tips above will help you make the right choices at the right price. Have other tips? We’d love to hear them: @eranchetz, @algolia.
Eran Chetzroni
Powered by Algolia AI Recommendations
Catherine Dee
Search and Discovery writerPierre Fournier
Chief Product Officer @ ManoManoJulien Lemoine
Co-founder & former CTO at Algolia