GPU Cost Efficiency in Kubernetes: Selection, Sharing, and Savings Strategies

Engineering teams run inference workloads on Kubernetes clusters for easier scaling and job management. But GPUs, one of the most expensive resources you can run in the cloud, are easy to set up in Kubernetes and forget about the cost. Most teams don’t have visibility or cost savings methods implemented for GPU workloads, so they often overpay without knowing it.

Kubernetes GPUs Cost Benefits

Maintaining GPUs with Kubernetes is easier than provisioning standalone GPU VMs and managing them manually, especially when workloads vary throughout the day or week. It also provides cost visibility and optimization benefits.

For one, you can configure your cluster to run multiple jobs on the same GPU. Instead of dedicating an entire GPU to a single workload, Kubernetes can help you maximize usage across jobs—saving you both capacity and money.

It also gives you deeper visibility into GPU usage. With the right tools in place, you can attribute GPU memory consumption to specific jobs, users, or namespaces. Combined with a scheduler that supports autoscaling, Kubernetes can dynamically consolidate workloads and scale GPU nodes up or down as needed.

In contrast, running standalone GPU VMs often leads to over-provisioning and idle capacity, without the insight or flexibility to fix it.

Kubernetes GPU Pricing for AWS, Azure, and GCP

Each of the three main cloud providers has its own GPU offerings for Kubernetes.

Cloud providers differ in GPU model availability, pricing, and regional coverage. However, common challenges such as idle usage, memory inefficiency, and lack of observability persist across platforms.

Cloud pricing is typically per instance-hour, not based on actual GPU usage. You're billed for the full GPU allocation, regardless of utilization. Billing dimensions usually include: GPU instance size and family, number of GPUs per node, total node uptime, and any attached storage or networking resources.

Pricing is listed in the US East region.

AWS EKS GPUs

Charges include the underlying EC2 instance, a version support fee of $0.10 per cluster per hour, and additional per-second Auto Mode charges when enabled.

EC2 GPU Instance	On-Demand Hourly Cost	vCPU	Memory (GiB)	GPU Memory (GiB)
g4dn.2xlarge	$0.75	8	32	16
g4dn.4xlarge	$1.20	16	64	16
g4dn.8xlarge	$2.18	32	128	16
g4dn.12xlarge	$3.91	48	192	64
g4dn.16xlarge	$4.35	64	256	16
g4dn.metal	$7.82	96	384	128
g4dn.xlarge	$0.53	4	16	16
g5.2xlarge	$1.21	8	32	24
g5.4xlarge	$1.62	16	64	24
g5.8xlarge	$2.45	32	128	24
g5.12xlarge	$5.67	48	192	96
g5.16xlarge	$4.10	64	256	24
g5.24xlarge	$8.14	96	384	96
g5.48xlarge	$16.29	192	768	192
g5.xlarge	$1.01	4	16	24
g5g.2xlarge	$0.56	8	16	16
g5g.4xlarge	$0.83	16	32	16
g5g.8xlarge	$1.37	32	64	16
g5g.16xlarge	$2.74	64	128	32
g5g.metal	$2.74	64	128	32
g5g.xlarge	$0.42	4	8	16
g6.2xlarge	$0.98	8	32	24
g6.4xlarge	$1.32	16	64	24
g6.8xlarge	$2.01	32	128	24
g6.12xlarge	$4.60	48	192	96
g6.16xlarge	$3.40	64	256	24
g6.24xlarge	$6.68	96	384	96
g6.48xlarge	$13.35	192	768	192
g6.xlarge	$0.80	4	16	24
g6e.2xlarge	$2.24	8	64	48
g6e.4xlarge	$3.00	16	128	48
g6e.8xlarge	$4.53	32	256	48
g6e.12xlarge	$10.49	48	384	192
g6e.16xlarge	$7.58	64	512	48
g6e.24xlarge	$15.07	96	768	192
g6e.48xlarge	$30.13	192	1536	384
g6e.xlarge	$1.86	4	32	48
gr6.4xlarge	$1.54	16	128	24
gr6.8xlarge	$2.45	32	256	24
p4d.24xlarge	$32.77	96	1152
p5.48xlarge	$98.32	192	2 TiB	640 GB HBM3

AWS EKS current generation EC2 GPU instances. (scroll to see full table)

Azure AKS GPUs

Azure charges for the underlying VM chosen as well as a charge per cluster hour, $0.10 per cluster per hour for the Standard tier and $0.60 per cluster per hour for the Premium tier.

Azure GPU VM	On-Demand Hourly Cost	CPU	Memory (GB)
NC40ads H100 v5	$6.98	40	320
NC80adis H100 v5	$13.96	80	640
NCC40ads H100 v5	$6.98	40	320
NC6s v3	$3.06	6	112
NC12s v3	$6.12	12	224
NC24s v3	$12.24	24	448
NC24rs v3	$13.46	24	448
NC4as T4 v3	$0.53	4	28
NC8as T4 v3	$0.75	8	56
NC16as T4 v3	$1.20	16	110
NC64as T4 v3	$4.35	64	440
NC24ads A100 v4	$3.67	24	220
NC48ads A100 v4	$7.35	48	440
NC96ads A100 v4	$14.69	96	880
ND96isr MI300X v5	$48.00	96	1850
ND96isr H100 v5	$98.32	96	1900
ND96amsr A100 v4	$32.77	96	1900
ND96asr A100 v4	$27.20	96	900
NG8ads V620 v1	$0.64	8	16
NG16ads V620 v1	$1.27	16	32
NG32ads V620 v1	$2.54	32	64
NG32adms V620 v1	$3.30	32	176
NV12s v3	$1.14	12	112
NV24s v3	$2.28	24	224
NV48s v3	$4.56	48	448
NV4as v4	Currently Unavailable	4	14
NV8as v4	Currently Unavailable	8	28
NV16as v4	Currently Unavailable	16	56
NV32as v4	Currently Unavailable	32	112
NV6ads A10 v5	$0.45	6	55
NV12ads A10 v5	$0.91	12	110
NV18ads A10 v5	$1.60	18	220
NV36ads A10 v5	$3.20	36	440
NV36adms A10 v5	$4.52	36	880
NV72ads A10 v5	$6.52	72	880

Azure AKS current generation VM GPU instances (scroll to see full table)

GCP GKE GPUs

GCP charges for the underlying Compute Engine instances and a GKE cluster management fee: $0.10 per cluster per hour for the Standard edition and $0.00822 per vCPU per hour for the Enterprise edition.

You can choose from the following GPU VM types.

VM	GPU Memory	On-Demand Hourly Cost
a4-highgpu-8g	8 GB HBM3e	Currently Unavailable
a3-ultragpu-8g	1128 GB HBM3e	$84.80690849
a2-ultragpu-1g	80 GB HBM3	$5.06879789
a2-ultragpu-2g	160 GB HBM3	$10.13759578
a2-ultragpu-4g	320 GB HBM3	$20.27519156
a2-ultragpu-8g	640 GB HBM3	$40.55038312
g2-standard-4	24 GB GDDR6	$0.70683228
g2-standard-8	24 GB GDDR6	$0.85362431
g2-standard-12	24 GB GDDR6	$1.00041635
g2-standard-16	24 GB GDDR6	$1.14720838
g2-standard-24	48 GB GDDR6	$2.0008327
g2-standard-32	24 GB GDDR6	$1.73437653
g2-standard-48	96 GB GDDR6	$4.00166539
g2-standard-96	192 GB GDDR6	$8.00333078

GCP GKE current generation VM GPU instances (scroll to see full table)

Or, you can also attach GPUs to N1 VMs manually, listed below (costs also apply for the N1 instance, billed $0.03 per hour).

VM	GPU memory	On-Demand Hourly Cost
NVIDIA T4 Virtual Workstation - 1 GPU	16 GB GDDR6	$0.55
NVIDIA T4 Virtual Workstation - 2 GPUs	32 GB GDDR6	$1.10
NVIDIA T4 Virtual Workstation - 4 GPUs	64 GB GDDR6	$2.20
NVIDIA P4 Virtual Workstation - 1 GPU	8 GB GDDR5	$0.80
NVIDIA P4 Virtual Workstation - 2 GPUs	16 GB GDDR5	$1.60
NVIDIA P4 Virtual Workstation - 4 GPUs	32 GB GDDR5	$3.20
NVIDIA P100 Virtual Workstation - 1 GPU	16 GB HBM2	$1.66
NVIDIA P100 Virtual Workstation 2 GPUs	32 GB HBM2	$3.32
NVIDIA P100 Virtual Workstation - 4 GPUs	64 GB HBM2	$6.64

GCP GKE current generation N1 GPU attachments (scroll to see full table)

As you can see, the hourly price is inherently expensive, but selecting the right size and running workloads efficiently can make a big difference.

Select the Right GPU Instance and Right-Sizing

Most GPU waste comes from choosing an instance that’s bigger than necessary.

A big reason is because GPU processing time is hard to measure accurately. Metrics can be noisy and inconsistent. Memory usage, on the other hand, is more stable and measurable, and it’s often the best proxy for cost efficiency.

You can calculate the idle memory by comparing memory used by workload to available memory per GPU.

idle_memory = total_allocated_memory - used_memory

Then, determine the number of GPUs needed to get a much better idea of which instance type is the best fit.

To do this automatically for certain NVIDIA workloads, the Vantage Kubernetes agent integrates with NVIDIA DCGM and automatically calculates GPU idle costs by attributing GPU memory usage per workload. This provides a granular view of how memory is consumed, helping you avoid over-provisioning and improve cost efficiency.

Autoscaling

Kubernetes autoscalers can dynamically consolidate workloads and scale GPU nodes up or down as needed. This ensures you're only paying for the resources you actually need. Proper autoscaling configuration requires careful tuning of thresholds and scaling policies to avoid oscillation while still responding promptly to changing demands.

Commit and Save

While On-Demand pricing gives flexibility, committing to compute usage via Savings Plans can significantly reduce the hourly instance costs for long-running GPU workloads.

For example, the g6.48xlarge Savings Plan rate is $8.69 per hour compared to the $13.35 On-Demand rate, 35% less.

Running multiple workloads on a single GPU is one of the most effective ways to reduce cost, especially for smaller models or batch inference jobs that don’t fully utilize GPU resources. There are a few strategies available, each applicable and listed as options by the three main cloud providers.

Time Slicing/Sharing

Time slicing provides a simple approach to GPU sharing by allocating GPU time slices to different containers on a time-rotated basis. This works well for workloads that don’t require constant GPU access but still benefit from acceleration.

NVIDIA MPS

NVIDIA CUDA Multi-Process Service (MPS) allows processes to share the same GPU. In Kubernetes, this is supported through fractional GPU memory requests. Instead of requesting a full GPU, you can request only the memory your workload needs. For example, you can specify a resource request like nvidia.com/gpu-memory: 4Gi to request a subset of memory on a shared GPU.

Multi-Instance GPUs

NVIDIA Multi-instance GPUs (MIGs) are applicable for certain NVIDIA GPUs, such as NVIDIA’s A100 Tensor Core. It provides hardware-level partitioning of a GPU into multiple isolated instances. Each MIG instance has its own memory, cache, and compute cores, providing better isolation than software-based sharing methods.

Conclusion

By right-sizing GPU instances, enabling autoscaling, and leveraging sharing strategies like time slicing or MIGs, teams can significantly reduce GPU costs in Kubernetes. These optimizations not only lower spend but also improve resource efficiency.