Practical Tips for Cloud Cost Governance

by Vantage Team


Cloud Cost Governance is the reporting, communication, and automation process for teams to consistently know and reduce cloud costs. Vantage has helped thousands of teams set up Governance for the cloud, learning and sharing best practices along the way. The recent launches of more Filter dimensions for Cost Reports and Showback through Cost Allocation are directly inspired by our learnings.

Governance does not have to be some ill-defined concept. Here we setup Governance in order, from visibility to accountability to automation. Each step is practical and immediately actionable:

  1. Increasing Tag Coverage
  2. Quick Cost Savings Wins
  3. Assigning Issues to apply Savings
  4. Showback and Chareback
  5. Automating Governance

By the end, the reader will have more than “frameworks” or “things to think about”. These are real techniques and tools that we see FinOps teams using every week as they implement Cloud Cost Governance.

Untagged Resources

A best practice with cloud resources is to tag them so that the costs per Service, Team, or Org can be measured, inclusive of all resources. For example, tagging Redshift instances, S3 buckets, and MSK Streams to exclude the spend from the data team.

A report listing untagged resources across multiple cloud providers.

A report listing untagged resources across multiple cloud providers.

Getting to 100% tagging coverage is an aspirational goal for infra teams and there are two tools that can be combined to make this happen.

Untagged Resources Report. As shown in the image above, Filters on Cost Reports can be used to show only the costs for untagged resources, and this even works across clouds. Teams save this report and add it to their Overview dashboard. When the spend of this report increases, it means that tagging coverage is slipping.

Untagged Resources Notifications. In combination with a saved report, teams can create a report notification that can send the spend of untagged resources to a channel in Slack or Teams every week. Over time, as infrastructure teams implement Tag Policies to force tagging or script the inclusion of tags in CI workflows, we can expect the spend of this report to go to $0!

First Places to Look

After setting up the untagged resources report, you can inspect a few areas right away to find cost savings.

Egress out of CloudFront, Cloud CDN, and Fastly

Compute and data storage are the first two cost centers that come to mind and indeed EC2 and S3 are #1 and #3 respectively in the Cloud Costs Leaderboard. But a surprise to some teams can be just how much of the bill belongs to data egress. Be sure to connect all possible providers to get full picture of data egress on entire company.

Two network changes may be appropriate. First, Cloudflare has a different pricing structure than CloudFront and switching CDN providers could mean significant savings. Second, many teams are pushing private traffic over public interfaces and can instead use VPC Endpoints on AWS to route data in the most efficient way.

(AWS Specific) Switch to Graviton

Graviton instances are new, ARM based servers custom built for AWS. They are cheaper for the same workloads as x86 instances and migrating to them is easy, especially for RDS.

Data from instances.vantage.sh

Data from instances.vantage.sh.

As shown in the image above, there are instant savings for using Graviton over Intel processors, and AWS is expanding the number of services this is true for, including EBS Volumes. Graviton is also available for EC2 but not as a managed service. This means that some recompiling or build system updates will be needed for your application to target ARM.

More Places for Quick Savings

The community-built Cloud Costs Handbook has dozens of tips likes these for reducing costs on each service. If you have connected Vantage, it can profile your infrastructure automatically and surface Cost Recommendations with specific amounts of savings.

Accountability with Issues

Now that the team has a handle on what resources exist has found a few areas to apply savings, one of the most challenging parts of governance kicks in: accountability. Someone must be responsible for tagging the resources or making the configuration changes needed to realize cost savings.

Create Issues for teammates and link them to a cost report where you expect to see the result.

Create Issues for teammates and link them to a cost report where you expect to see the result.

Issues help infrastructure teams keep track of what cost management tasks are in flight. They function exactly like Github or Jira issues and include email notifications.

Closing internal tickets in those systems can be tough. But knowing that a Vantage Issue is resolved is actually very easy - costs should go down! The best practice with these is to link the Cost Report that we expect to see the drop in. Here’s a detailed workflow:

  1. Create a specific Cost Report e.g. spiking S3 Egress costs
  2. Create an Issue that assigns responsibility
  3. Attach the Cost Report to the Issue
  4. Monitor your inbox
  5. When the Issue shows up as closed, click through to the Cost Report to see that costs are indeed lower.

Showback and Chargeback

Once the issue has been resolved, there are two concepts in finance and IT that keep organizations honest. Showbacks are a document that departments can send to each other expressing the cloud cost to run their service. Chargebacks are more rigorous and actually involve departments paying for services internally.

Cost Allocation

Using Cost Allocation to showback a percentage of a service cost to a specific team

Whichever process you choose, you can use Percentage Based Cost Allocation in your reporting tools to partition billing among teams most accurately. This can be especially useful in a scenario like Support:

Allocate a shared Support bill to each member account where each member account represents an engineering team. Rather than seeing one large Support bill in aggregate, customers can allocate only a percentage of it based upon the number of member accounts in their organization.

Automation and Maintaining a Coverage Number

So far we have dealt with workflows to find costs and manage people. The next step is automation to keep costs in line. We see many infrastructure teams work to maintain a coverage percentage for their servers. Coverage in this case refers to the percentage of cloud spend that is not on-demand.

It is real work to maintain a high coverage number. Reserved Instances and Savings Plans expire and teams have changing needs for cloud resources.

Autopilot is a managed service which can automatically buy and sell discounted reserved instances and maintain a high coverage number for engineering teams. It uses financial primitives made available by AWS and so there are no config changes needed and even no upfront commitments.

Cloud Costs Governance

Cost Governance is a challenging iniative to take on. But by getting started teams can follow the steps above and wake up to find that cloud costs - previously the fastest growing line item and a headache for every team - are now easily understood and managed. This week Vantage launched more dimensions for Filters so getting started on Governance with untagged resources and data egress costs is simpler. On Twitter there are some more examples of the new types of Cost Reports that you can create.

As we share what workflows we recommend and our learnings from the Vantage community, please join in the FinOps Slack to participate in a more transparent and cheaper future for the cloud.