Infrastructure Service

Kubernetes for AI: GPU Scheduling, Kueue, and Why Your Cluster Is Starving

Default Kubernetes scheduling is wrong for AI. Kueue, Volcano, and topology-aware placement are not optional for serious GPU workloads in 2026.

28 February 202611 min readBy the DataX Power team

Server rack with illuminated GPUs in a data centre

Why the default scheduler is wrong for AI

Kubernetes was designed for stateless, horizontally-scalable web services. Its default scheduler optimises for placing individual pods on nodes that have room, treats each pod as independent, and evicts opportunistically. That behaviour is appropriate for microservices. It is almost entirely wrong for AI workloads.

An AI training job is not a pod – it is a gang of pods that must all start, must run co-located for network performance, must share a tight topology to keep NVLink and InfiniBand utilised, and must checkpoint before it can be safely preempted. Scheduled with default policies, training jobs partially start, sit with one pod waiting for capacity, time out, restart, and burn money. Inference workloads have a different but related problem: latency-sensitive traffic gets scheduled next to batch jobs competing for the same GPU, and the p99 latency page that fires at 3am is usually caused by the scheduler, not the model.

In 2026, running AI at scale on Kubernetes means replacing or supplementing the default scheduler. The specifics of what and how are the difference between a cluster at 40% utilisation and one at 75%.

Gang scheduling: Kueue and Volcano

Gang scheduling – guaranteeing that either all pods of a job get their resources simultaneously, or none do – is the single-most-important capability for training workloads. Two open-source solutions have converged as the defaults in 2026.

Kueue – Kubernetes-native, lives in the kubernetes-sigs ecosystem, designed to sit on top of existing clusters. Queues, workload admission, fair sharing across teams, resource flavours. Increasingly the default choice for organisations that want a Kubernetes-first story.
Volcano – CNCF-hosted, batch-scheduling focused, strong support for gang scheduling and job-level policies. Deeper feature set than Kueue for pure batch workloads, somewhat heavier to operate.

Topology-aware placement and networking

Large-model training is bottlenecked by inter-GPU communication. NCCL, NVLink, and RDMA-over-InfiniBand all assume specific topologies. The default Kubernetes scheduler has no awareness of whether two pods are on NVLink-connected GPUs, same-switch nodes, or across an oversubscribed fabric.

The fix is topology-aware scheduling with explicit hints. On NVIDIA GPU clusters, Topology Manager plus the NVIDIA GPU operator now exposes topology labels that let schedulers co-locate tightly-coupled jobs. On multi-node training, the combination of Kueue's topology-aware placement policies and the GPU operator's device-plugin reporting has become reliable enough to count on. For inference at scale, Multi-Instance GPU (MIG) partitioning is the under-used control that lets a single H100 or A100 host many smaller inference workloads at predictable latency – provided your scheduler is MIG-aware.

Inference scheduling is different

The scheduling pattern that works for training fails for inference. Training wants co-location, gang scheduling, and tolerance for long queue times. Inference wants autoscaling, horizontal distribution, latency guarantees, and resistance to preemption.

The two patterns we see working reliably in 2026: Knative-on-KServe for HTTP inference workloads, providing scale-to-zero, scale-from-zero, and autoscaling triggered by request concurrency or GPU utilisation; and Ray Serve for more complex inference topologies (multi-model routing, composition, batching). Both integrate with the GPU operator for device reporting and both coexist with Kueue if you want a single cluster that serves both training and inference – which most enterprise teams end up wanting.

Quotas, fair sharing, and the politics of GPU allocation

The operational problem that most surprises infrastructure teams is not technical. It is political: who gets GPUs, when, and who adjudicates. The default of first-come-first-served produces exactly the outcomes you would expect – one team grabbing the entire cluster, others waiting, morale-eroding disputes at sprint planning.

Kueue's ClusterQueue and LocalQueue primitives let you carve the cluster into named queues with guaranteed capacity, borrowable capacity, and priority classes. A useful starting topology for a mid-sized AI organisation: one guaranteed queue per team (covering their 50th-percentile load), one borrowable pool (sized to absorb bursts), one shared low-priority queue for experiments that can be preempted. Preemption policies matter: enable preemption for the low-priority queue only, so casual experiments do not kill production training runs.

Observability is half the battle

You cannot tune a scheduler you cannot measure. Three metrics belong on the wall of any GPU cluster in 2026.

GPU utilisation (dcgm_gpu_utilization, MFU / model FLOPs utilisation where available). The difference between "GPU allocated" and "GPU used" is where money hides.
Queue wait time per queue and per priority. A spike in wait time for the production queue is an early indicator of a misconfigured fair-share policy.
Preemption count per queue. A cluster that is preempting more than 5% of jobs is either under-provisioned or mis-configured; either way, it is not in a good place.

The rollout plan that works

For teams that have grown organically and are now hitting the scheduling wall, a pragmatic 90-day sequence lands cleanly most of the time. Week 1-2: install the NVIDIA GPU operator if you have not, and get dcgm_exporter metrics flowing. Week 3-4: stand up Kueue, migrate one team's workload to a queue, leave everyone else on default scheduling as baseline. Week 5-8: expand to remaining teams, introduce priority classes, enable preemption for the lowest tier. Week 9-12: add topology-aware placement for multi-GPU training, and set up dashboards on wait time and utilisation per queue.

The win at the end of that cycle is usually substantial: utilisation up 15-25 points, queue-wait complaints down sharply, and – most important – the cluster stops being the bottleneck for the ML teams that pay for it. The organisations that treat this as "platform plumbing" tend to compound the gains over subsequent quarters. The ones that treat it as a one-time project slowly drift back to the default.

Back to all posts

Keep reading

Modern Hanoi office tower at dusk, evoking Vietnam's growing tech-services sector

Data Annotation Service

Top 5 Data Annotation Service Providers in Vietnam (2026)

Vietnam has emerged as a strategic destination for AI training data, offering cost advantages and a skilled workforce. This ranking evaluates the top annotation providers based on capacity, quality, security, and international track record.

Rows of server racks with status lights, evoking the data infrastructure that underpins modern ML pipelines

Data Annotation Service

The Cost of Bad Labels: Why Annotation Quality Decides AI ROI

A 2021 MIT study found measurable label errors in every one of ten classic ML benchmarks – ImageNet, MNIST, CIFAR-10, and more. The implications for enterprise pipelines are larger than the headlines suggest.

Ready to Get Started

Let's build what's next

Share your challenge – AI, data, or infrastructure. We'll scope your project and put the right team on it.

Start a Conversation See Case Studies