Vantage Launches Azure GPU Kubernetes Cost Monitoring

Track, allocate, and optimize GPU-backed workloads running on Azure Kubernetes Service.

Vantage Launches Azure GPU Kubernetes Cost Monitoring
Author:Vantage Team
Vantage Team

Today, Vantage is announcing support for Azure GPU Kubernetes cost monitoring, enabling customers to track, allocate, and optimize GPU-backed workloads running on Azure Kubernetes Service (AKS). Customers can now view GPU costs alongside their other cloud and Kubernetes spend directly in the Vantage console.

Previously, customers running GPU workloads in AKS were not able to break out costs by processor type, as Azure billing data provides all-in host-level pricing without visibility into the resources that were actually consumed. This meant that while customers were able to view utilization of GPUs, they were not able to calculate idle costs from GPU resources.

Now, with the launch of Azure GPU Kubernetes cost monitoring, Vantage customers can monitor GPU consumption with the Vantage Kubernetes Agent and perform cost allocation and chargeback specifically on GPU costs. The Vantage Kubernetes Agent enables customers to isolate GPU costs per cluster, namespace, pod, or label, which can then be used for reporting in Cost Reports, allocation via Virtual Tags, or utilization analysis in Kubernetes Efficiency Reports.

Azure GPU Kubernetes cost monitoring is available to all Vantage customers starting today. To get started, deploy the Vantage Kubernetes Agent in your AKS clusters and connect your Azure accounts via the Integrations page. If you have already deployed the Vantage Kubernetes Agent to AKS clusters, no update is needed. For more details, see the Vantage documentation.

Frequently Asked Questions

1. What is being launched today?

Vantage is launching Azure GPU Kubernetes cost monitoring, which allows customers to view and allocate GPU-backed AKS workloads in the Vantage console via the Vantage Kubernetes Agent.

2. Who is the customer?

Any customers running GPU Kubernetes workloads on Azure Kubernetes Service.

3. How much does this cost?

There is no additional cost for Azure GPU cost monitoring.

4. How are GPU costs calculated?

When an instance includes GPUs, 95% of the cost of the node will be allocated to the memory of the GPU. The number of GPUs requested by the pod will dictate how much of the total memory is allocated to the pod. The idle costs are calculated using the used and total memory from allocated GPUs for the pod down to the container level (i.e., idle_memory = total_allocated_memory - used_memory).

5. Which GPU manufacturers are supported?

At this time, only metrics exported by NVIDIA/dcgm-exporter are supported. If you have a different manufacturer you want to see supported, contact support@vantage.sh.

6. How can I see the GPU idle costs in Vantage?

To view GPU idle costs, navigate to any Kubernetes Efficiency Report and set the Category filter equal to gpu. This filter criteria will show the GPU idle and total costs by cluster. You can add additional filters and group the report to see more specific versions of the data.

7. What needs to be installed on my cluster for the Vantage Kubernetes Agent to collect this data?

Usage data is collected via NVIDIA/dcgm-exporter. This is included as part of the NVIDIA/gpu-operator, but it can also be installed independently. The agent scrapes the exporter directly and exposes configuration for the namespace, service name, port name, and path. The default values are configured for the gpu-operator default case.

8. Are fractional GPU requests supported?

At this time, the Vantage Kubernetes Agent collects only the most commonly used whole GPU requests. If you use fractional GPUs and want those represented, please contact support@vantage.sh.

9. Is GPU power usage considered as part of these calculations?

No, GPU utilization is not factored into efficiency; only GPU memory is tracked. If you have a workload that specifically needs GPU utilization, please contact support@vantage.sh.

10. Can I see overall cluster idle for GPU memory?

Yes, when you group a report by namespace, an _idle_ namespace is displayed that includes the idle GPU costs.

11. Are there Kubernetes rightsizing recommendations for GPU memory?

At this time, GPU efficiency is not included in rightsizing recommendations.

12. Are there any GPU specific settings I need to configure on the agent to begin collecting these metrics?

 The agent needs to be installed or upgraded to include the following value: --set agent.gpu.usageMetrics=true.

13. Are there any requirements for installing the NVIDIA operator?

See the documentation for details on the steps needed to install the operator.

14. What is the minimum version number required for GPU tracking?

You must be Vantage Kubernetes Agent version 1.0.26 or higher.

Sign up for a free trial.

Get started with tracking your cloud costs.

Sign up

TakeCtrlof YourCloud Costs

You've probably burned $0 just thinking about it. Time to act.