🚀 Vantage Launches Issues: Track Issues and Collaborate with your Team

Coverage: The Most Important Metric in Cloud Costs

AWS is famous for its breadth of services. Developers can pick from an a-la-cart menu: spin up new compute instances, provision storage, store unlimited files into S3, and so forth. This capability makes it perfect for fast growing companies and markets with changing demand.

But this flexibility comes at a real cost in real dollars. On demand resources are the most expensive way to operate in the cloud. Companies such as Dropbox moved off the cloud entirely once they could reliably project their growth and they knew the demand for their service. But what do you do if you want cost savings without looking to migrate to your own on-premise hardware?

This is what Coverage is about. Coverage is the percentage of cloud resources covered by up-front financial commitments - and primarily used for compute resources. On AWS, coverage materializes in the amount of money customers spend on Reserved Instances (RIs) and through Savings Plans.

Calculating Coverage

According to the AWS docs on coverage, you can calculate the amount of coverage for your account with the following equation:

Let’s use everybody’s favorite AWS pricing tool, EC2Instances.info, to run through some coverage scenarios. Note: we are working on improving the experience for EC2Instances in the Github repo and you can also chat with us about the site or these estimations in the Vantage Slack.

Scenario 1: Travel Website

Here we are running a medium sized travel website like Trip Actions serving 1M requests per day. The site calculates routes between places, serves assets like images, and refers bookings to airlines, buses, and hotels. Most of the web traffic is served on new AWS Graviton Instances, but there is a crawler to get new prices that runs on R4 instances, and the database is on RDS.

The pricing for this architecture is:

Reserved price per month On Demand price per month
t4g.2xlarge $123 $196
r4.8xlarge $981 $1,553
db.t3.large (Oracle) $63 $99

As the travel site switched to t4s, it bought 5 reserved instances and uses 8 on-demand instances. It has 2 reserved r4s but added some new sites it crawls recently and so it has 3 on demand r4s. Finally, db workloads are fairly consistent so we have 5 reserved db.t3s and only 2 on demand db.t3s. Let’s calculate!

Reserved instance cost = 5 x $123 + 2 x $981 + 5 x $63 = $2,892

On demand cost = 8 x $196 + 3 x $1,553 + 2 x $99 = $6,425

Coverage = $2,892 / ($6,425 + $2,892) = 31%

What does it mean for this travel site to have 31% coverage? It means that 69% of their cloud spend is running in the most expensive fashion of on-demand! This site could save tens of thousands of dollars annually by leveraging up front commitments without having to make any changes to their infrastructure directly.

Scenario 2: Model Hub

Model hubs like Roboflow or Hugging Face are a new type of website for machine learning engineers. They host large AI models and often have training and deployment services attached. This means that storage optimized instances like I3s and GPUs like P2s are key pieces of the cloud architecture.

The pricing for this architecture is:

Reserved price per month On Demand price per month
x2iedn.12xlarge $4,498 $7,304
p2.8xlarge $3,587 $5,256

This model hub is taking advantage of the X2 100 Gigabit networking but is seeing the number of models stored growing. It’s committed to 5 X2s reserved and has added 5 X2s on demand recently. Beyond instances, data egress would be an issue and other providers like CloudFlare are priced competitively to CloudFront and S3. Training new models is a new product and the team is just now converging on what demand patterns exist. For its GPU servers, this model hub is only using 12 P2s reserved and has 20 on demand.

Reserved instance cost = 5 x $4,498 + 12 x $3,587 = $65,534

On demand cost = 5 × $7,304 + 20 × $5,256 = $141,640

Coverage = $78,364 / ($141,640 + $78,364) = 32%

Unfortunately this situation is common with machine learning companies which is why programs like Y Combinator have higher cloud credit allowances specifically for AI startups.

Scenario 3: Social Media Site

The last coverage scenario we are profiling is a large social media site, like Reddit. To serve requests, this site uses C5 instances because for each user we have to compute a unique newsfeed. There’s a large amount of media content like images and videos and so the site uses ElastiCache. Finally, we have to crunch user data to sell ads so Redshift and data warehousing is a key cost for us.

The pricing for this architecture is as follows. We are using 730 hours in a month and all upfront, 1 year term commitments for the reserved instances.

Reserved price per month On Demand price per month
c5d.24xlarge $2,120 $3,364
cache.m6g.16xlarge $2,363 $3,334
ra3.16xlarge $6,285 $9,519

Reddit sees north of 50 million visitors a day so we are talking about a significant cloud spend for our imaginary social media site. Conservatively a site of this size should be spending over ten million a month. However, at this scale there is greater sophistication about traffic and load and thus more reserved instances are in use.

Reserved instance cost = 750 × $2,120 + 300 × $2,363 + 100 × $6,285 = $2,927,400

On demand cost = 100 × $3,364 + 100 × $3,334 + 50 × $9,519 = $1,145,750

Coverage = $2,927,400 / ($1,145,750 + $2,927,400) = 72%

72% coverage is pretty good! But notice there is still over $1,000,000 a month in on demand spend that we could further optimize. At larger scales and higher coverage, making commitments can make an even greater difference than our two other, smaller sites.

How to Increase Coverage on AWS

Every site above could save money on AWS. And the great news is there’s no re-architecting or engineering required. The two financial primities to effectuate savings are Reserved Instances and Savings Plans. Reserved Instances have been around for a while and they are simple...at first. Developers know the instances they need, they have a rough idea of how their app is growing, and they buy a 1 year or 3 year term for the instance.

A better way to save is often through AWS Savings Plans. Savings Plans only cover EC2, Lambda and Fargate and don’t apply to other services with Reserved Instance support, for example RDS. But, for compute services like EC2, Fargate, and Lambda, they offer more flexibility. For example, let’s say a new Graviton processor comes out and the travel site above wants to move from their current generation to the next generation. If they bought a reserved instance with 10 months left, they would be stuck with that commitment. Or, they could sell their RI on the RI marketplace.

A more optimal outcome would be to buy 12 months of compute through a Savings Plan. In that scenario, the travel site could easily move to the newer generation instances, or move to a containerized architecture with Fargate, or a serverless architecture with Lambda. All they have done is made financial commitments instead of specific instance type commitments and AWS will always pass on the highest discount possible to the customer dependent upon the underlying compute being used.

Let’s take the example of the Model Hub and max out the Savins Plan using the AWS Savings Plan tool. Now there is nearly a 50% discount on the p2.8xlarge. If we take this discount, times our 32 p2 instances we save an additional $58,695 per month even though we had reserved instances before!

Pitfalls with Reserved Instances

As we said in a previous blog post.

Unfortunately, managing Reserved Instances can get complicated very quickly. Given that Reserved Instances are always tied to a specific instance type, DevOps teams have 400+ different instance types with choices spread across storage size, platforms, architectures, regions and availability zones. Finding the best price for the optimal virtual machine configuration can be daunting.

But that’s not all. There are organizational problems, sometimes called cloud cost governance problems, that come with RIs. Let’s say an important engineer purchased a 3 year reservation and then left the company 2 years ago. When that reservation expires, suddenly all that compute will be on demand and the organization may spend tens to hundreds of thousands of dollars extra until the reservation is renewed. Multiply that by the 400+ instance types described above and you end up with a situation where RIs are expiring every week, cloud costs are spiking, and the flexibility and dream of the cloud becomes a nightmare.

Autopilot and Automating Coverage

Happily, there is a better way. Vantage started with cost transparency, providing better cost reports and cost recommendations that developers could use to take action and get a handle on their cloud spend. The next step is for Vantage to automate the way companies save money on their developer’s behalf.

Autopilot is an upcoming feature that automatically reserves instances and utilizes savings plans to realize up to 70% savings on AWS. The calculations above, the selection of which plans and terms, and the renewal of RIs upon expiration is all handled automatically.