Engineering teams run inference workloads on Kubernetes clusters for easier scaling and job management. But GPUs, one of the most expensive resources you can run in the cloud, are easy to set up in Kubernetes and forget about the cost. Most teams don’t have visibility or cost savings methods implemented for GPU workloads, so they often overpay without knowing it.
Kubernetes GPUs Cost Benefits
Maintaining GPUs with Kubernetes is easier than provisioning standalone GPU VMs and managing them manually, especially when workloads vary throughout the day or week. It also provides cost visibility and optimization benefits.
For one, you can configure your cluster to run multiple jobs on the same GPU. Instead of dedicating an entire GPU to a single workload, Kubernetes can help you maximize usage across jobs—saving you both capacity and money.
It also gives you deeper visibility into GPU usage. With the right tools in place, you can attribute GPU memory consumption to specific jobs, users, or namespaces. Combined with a scheduler that supports autoscaling, Kubernetes can dynamically consolidate workloads and scale GPU nodes up or down as needed.
In contrast, running standalone GPU VMs often leads to over-provisioning and idle capacity, without the insight or flexibility to fix it.
Kubernetes GPU Pricing for AWS, Azure, and GCP
Each of the three main cloud providers has its own GPU offerings for Kubernetes.
Cloud providers differ in GPU model availability, pricing, and regional coverage. However, common challenges such as idle usage, memory inefficiency, and lack of observability persist across platforms.
Cloud pricing is typically per instance-hour, not based on actual GPU usage. You’re billed for the full GPU allocation, regardless of utilization. Billing dimensions usually include: GPU instance size and family, number of GPUs per node, total node uptime, and any attached storage or networking resources.
Pricing is listed in the US East region.
AWS EKS GPUs
Charges include the underlying EC2 instance, a version support fee of $0.10 per cluster per hour, and additional per-second Auto Mode charges when enabled.
EC2 GPU Instance | On-Demand Hourly Cost | vCPU | Memory (GiB) | GPU Memory (GiB) |
g4dn.2xlarge | $0.75 | 8 | 32 | 16 |
g4dn.4xlarge | $1.20 | 16 | 64 | 16 |
g4dn.8xlarge | $2.18 | 32 | 128 | 16 |
g4dn.12xlarge | $3.91 | 48 | 192 | 64 |
g4dn.16xlarge | $4.35 | 64 | 256 | 16 |
g4dn.metal | $7.82 | 96 | 384 | 128 |
g4dn.xlarge | $0.53 | 4 | 16 | 16 |
g5.2xlarge | $1.21 | 8 | 32 | 24 |
g5.4xlarge | $1.62 | 16 | 64 | 24 |
g5.8xlarge | $2.45 | 32 | 128 | 24 |
g5.12xlarge | $5.67 | 48 | 192 | 96 |
g5.16xlarge | $4.10 | 64 | 256 | 24 |
g5.24xlarge | $8.14 | 96 | 384 | 96 |
g5.48xlarge | $16.29 | 192 | 768 | 192 |
g5.xlarge | $1.01 | 4 | 16 | 24 |
g5g.2xlarge | $0.56 | 8 | 16 | 16 |
g5g.4xlarge | $0.83 | 16 | 32 | 16 |
g5g.8xlarge | $1.37 | 32 | 64 | 16 |
g5g.16xlarge | $2.74 | 64 | 128 | 32 |
g5g.metal | $2.74 | 64 | 128 | 32 |
g5g.xlarge | $0.42 | 4 | 8 | 16 |
g6.2xlarge | $0.98 | 8 | 32 | 24 |
g6.4xlarge | $1.32 | 16 | 64 | 24 |
g6.8xlarge | $2.01 | 32 | 128 | 24 |
g6.12xlarge | $4.60 | 48 | 192 | 96 |
g6.16xlarge | $3.40 | 64 | 256 | 24 |
g6.24xlarge | $6.68 | 96 | 384 | 96 |
g6.48xlarge | $13.35 | 192 | 768 | 192 |
g6.xlarge | $0.80 | 4 | 16 | 24 |
g6e.2xlarge | $2.24 | 8 | 64 | 48 |
g6e.4xlarge | $3.00 | 16 | 128 | 48 |
g6e.8xlarge | $4.53 | 32 | 256 | 48 |
g6e.12xlarge | $10.49 | 48 | 384 | 192 |
g6e.16xlarge | $7.58 | 64 | 512 | 48 |
g6e.24xlarge | $15.07 | 96 | 768 | 192 |
g6e.48xlarge | $30.13 | 192 | 1536 | 384 |
g6e.xlarge | $1.86 | 4 | 32 | 48 |
gr6.4xlarge | $1.54 | 16 | 128 | 24 |
gr6.8xlarge | $2.45 | 32 | 256 | 24 |
p4d.24xlarge | $32.77 | 96 | 1152 | |
p5.48xlarge | $98.32 | 192 | 2 TiB | 640 GB HBM3 |
AWS EKS current generation EC2 GPU instances. (scroll to see full table)
Azure AKS GPUs
Azure charges for the underlying VM chosen as well as a charge per cluster hour, $0.10 per cluster per hour for the Standard tier and $0.60 per cluster per hour for the Premium tier.
Azure GPU VM | On-Demand Hourly Cost | CPU | Memory (GB) |
NC40ads H100 v5 | $6.98 | 40 | 320 |
NC80adis H100 v5 | $13.96 | 80 | 640 |
NCC40ads H100 v5 | $6.98 | 40 | 320 |
NC6s v3 | $3.06 | 6 | 112 |
NC12s v3 | $6.12 | 12 | 224 |
NC24s v3 | $12.24 | 24 | 448 |
NC24rs v3 | $13.46 | 24 | 448 |
NC4as T4 v3 | $0.53 | 4 | 28 |
NC8as T4 v3 | $0.75 | 8 | 56 |
NC16as T4 v3 | $1.20 | 16 | 110 |
NC64as T4 v3 | $4.35 | 64 | 440 |
NC24ads A100 v4 | $3.67 | 24 | 220 |
NC48ads A100 v4 | $7.35 | 48 | 440 |
NC96ads A100 v4 | $14.69 | 96 | 880 |
ND96isr MI300X v5 | $48.00 | 96 | 1850 |
ND96isr H100 v5 | $98.32 | 96 | 1900 |
ND96amsr A100 v4 | $32.77 | 96 | 1900 |
ND96asr A100 v4 | $27.20 | 96 | 900 |
NG8ads V620 v1 | $0.64 | 8 | 16 |
NG16ads V620 v1 | $1.27 | 16 | 32 |
NG32ads V620 v1 | $2.54 | 32 | 64 |
NG32adms V620 v1 | $3.30 | 32 | 176 |
NV12s v3 | $1.14 | 12 | 112 |
NV24s v3 | $2.28 | 24 | 224 |
NV48s v3 | $4.56 | 48 | 448 |
NV4as v4 | Currently Unavailable | 4 | 14 |
NV8as v4 | Currently Unavailable | 8 | 28 |
NV16as v4 | Currently Unavailable | 16 | 56 |
NV32as v4 | Currently Unavailable | 32 | 112 |
NV6ads A10 v5 | $0.45 | 6 | 55 |
NV12ads A10 v5 | $0.91 | 12 | 110 |
NV18ads A10 v5 | $1.60 | 18 | 220 |
NV36ads A10 v5 | $3.20 | 36 | 440 |
NV36adms A10 v5 | $4.52 | 36 | 880 |
NV72ads A10 v5 | $6.52 | 72 | 880 |
Azure AKS current generation VM GPU instances (scroll to see full table)
GCP GKE GPUs
GCP charges for the underlying Compute Engine instances and a GKE cluster management fee: $0.10 per cluster per hour for the Standard edition and $0.00822 per vCPU per hour for the Enterprise edition.
You can choose from the following GPU VM types.
VM | GPU Memory | On-Demand Hourly Cost |
---|---|---|
a4-highgpu-8g | 8 GB HBM3e | Currently Unavailable |
a3-ultragpu-8g | 1128 GB HBM3e | $84.80690849 |
a2-ultragpu-1g | 80 GB HBM3 | $5.06879789 |
a2-ultragpu-2g | 160 GB HBM3 | $10.13759578 |
a2-ultragpu-4g | 320 GB HBM3 | $20.27519156 |
a2-ultragpu-8g | 640 GB HBM3 | $40.55038312 |
g2-standard-4 | 24 GB GDDR6 | $0.70683228 |
g2-standard-8 | 24 GB GDDR6 | $0.85362431 |
g2-standard-12 | 24 GB GDDR6 | $1.00041635 |
g2-standard-16 | 24 GB GDDR6 | $1.14720838 |
g2-standard-24 | 48 GB GDDR6 | $2.0008327 |
g2-standard-32 | 24 GB GDDR6 | $1.73437653 |
g2-standard-48 | 96 GB GDDR6 | $4.00166539 |
g2-standard-96 | 192 GB GDDR6 | $8.00333078 |
GCP GKE current generation VM GPU instances (scroll to see full table)
Or, you can also attach GPUs to N1 VMs manually, listed below (costs also apply for the N1 instance, billed $0.03 per hour).
VM | GPU memory | On-Demand Hourly Cost |
NVIDIA T4 Virtual Workstation - 1 GPU | 16 GB GDDR6 | $0.55 |
NVIDIA T4 Virtual Workstation - 2 GPUs | 32 GB GDDR6 | $1.10 |
NVIDIA T4 Virtual Workstation - 4 GPUs | 64 GB GDDR6 | $2.20 |
NVIDIA P4 Virtual Workstation - 1 GPU | 8 GB GDDR5 | $0.80 |
NVIDIA P4 Virtual Workstation - 2 GPUs | 16 GB GDDR5 | $1.60 |
NVIDIA P4 Virtual Workstation - 4 GPUs | 32 GB GDDR5 | $3.20 |
NVIDIA P100 Virtual Workstation - 1 GPU | 16 GB HBM2 | $1.66 |
NVIDIA P100 Virtual Workstation 2 GPUs | 32 GB HBM2 | $3.32 |
NVIDIA P100 Virtual Workstation - 4 GPUs | 64 GB HBM2 | $6.64 |
GCP GKE current generation N1 GPU attachments (scroll to see full table)
Optimizing GPU Costs in Kubernetes: Right-Sizing, Autoscaling, and Sharing Strategies
As you can see, the hourly price is inherently expensive, but selecting the right size and running workloads efficiently can make a big difference.
Select the Right GPU Instance and Right-Sizing
Most GPU waste comes from choosing an instance that’s bigger than necessary.
A big reason is because GPU processing time is hard to measure accurately. Metrics can be noisy and inconsistent. Memory usage, on the other hand, is more stable and measurable, and it’s often the best proxy for cost efficiency.
You can calculate the idle memory by comparing memory used by workload to available memory per GPU.
idle_memory = total_allocated_memory - used_memory
Then, determine the number of GPUs needed to get a much better idea of which instance type is the best fit.
To do this automatically for certain NVIDIA workloads, the Vantage Kubernetes agent integrates with NVIDIA DCGM and automatically calculates GPU idle costs by attributing GPU memory usage per workload. This provides a granular view of how memory is consumed, helping you avoid over-provisioning and improve cost efficiency.
Autoscaling
Kubernetes autoscalers can dynamically consolidate workloads and scale GPU nodes up or down as needed. This ensures you’re only paying for the resources you actually need. Proper autoscaling configuration requires careful tuning of thresholds and scaling policies to avoid oscillation while still responding promptly to changing demands.
Commit and Save
While On-Demand pricing gives flexibility, committing to compute usage via Savings Plans can significantly reduce the hourly instance costs for long-running GPU workloads.
For example, the g6.48xlarge Savings Plan rate is $8.69 per hour compared to the $13.35 On-Demand rate, 35% less.
GPU Sharing
Running multiple workloads on a single GPU is one of the most effective ways to reduce cost, especially for smaller models or batch inference jobs that don’t fully utilize GPU resources. There are a few strategies available, each applicable and listed as options by the three main cloud providers.
Time Slicing/Sharing
Time slicing provides a simple approach to GPU sharing by allocating GPU time slices to different containers on a time-rotated basis. This works well for workloads that don’t require constant GPU access but still benefit from acceleration.
Read more about how to implement:
NVIDIA MPS
NVIDIA CUDA Multi-Process Service (MPS) allows processes to share the same GPU. In Kubernetes, this is supported through fractional GPU memory requests. Instead of requesting a full GPU, you can request only the memory your workload needs. For example, you can specify a resource request like nvidia.com/gpu-memory: 4Gi
to request a subset of memory on a shared GPU.
Multi-Instance GPUs
NVIDIA Multi-instance GPUs (MIGs) are applicable for certain NVIDIA GPUs, such as NVIDIA’s A100 Tensor Core. It provides hardware-level partitioning of a GPU into multiple isolated instances. Each MIG instance has its own memory, cache, and compute cores, providing better isolation than software-based sharing methods.
Conclusion
By right-sizing GPU instances, enabling autoscaling, and leveraging sharing strategies like time slicing or MIGs, teams can significantly reduce GPU costs in Kubernetes. These optimizations not only lower spend but also improve resource efficiency.