Why the default scheduler is wrong for AI
Kubernetes was designed for stateless, horizontally-scalable web services. Its default scheduler optimises for placing individual pods on nodes that have room, treats each pod as independent, and evicts opportunistically. That behaviour is appropriate for microservices. It is almost entirely wrong for AI workloads.
An AI training job is not a pod – it is a gang of pods that must all start, must run co-located for network performance, must share a tight topology to keep NVLink and InfiniBand utilised, and must checkpoint before it can be safely preempted. Scheduled with default policies, training jobs partially start, sit with one pod waiting for capacity, time out, restart, and burn money. Inference workloads have a different but related problem: latency-sensitive traffic gets scheduled next to batch jobs competing for the same GPU, and the p99 latency page that fires at 3am is usually caused by the scheduler, not the model.
In 2026, running AI at scale on Kubernetes means replacing or supplementing the default scheduler. The specifics of what and how are the difference between a cluster at 40% utilisation and one at 75%.
Gang scheduling: Kueue and Volcano
Gang scheduling – guaranteeing that either all pods of a job get their resources simultaneously, or none do – is the single-most-important capability for training workloads. Two open-source solutions have converged as the defaults in 2026.
- Kueue – Kubernetes-native, lives in the kubernetes-sigs ecosystem, designed to sit on top of existing clusters. Queues, workload admission, fair sharing across teams, resource flavours. Increasingly the default choice for organisations that want a Kubernetes-first story.
- Volcano – CNCF-hosted, batch-scheduling focused, strong support for gang scheduling and job-level policies. Deeper feature set than Kueue for pure batch workloads, somewhat heavier to operate.
Topology-aware placement and networking
Large-model training is bottlenecked by inter-GPU communication. NCCL, NVLink, and RDMA-over-InfiniBand all assume specific topologies. The default Kubernetes scheduler has no awareness of whether two pods are on NVLink-connected GPUs, same-switch nodes, or across an oversubscribed fabric.
The fix is topology-aware scheduling with explicit hints. On NVIDIA GPU clusters, Topology Manager plus the NVIDIA GPU operator now exposes topology labels that let schedulers co-locate tightly-coupled jobs. On multi-node training, the combination of Kueue's topology-aware placement policies and the GPU operator's device-plugin reporting has become reliable enough to count on. For inference at scale, Multi-Instance GPU (MIG) partitioning is the under-used control that lets a single H100 or A100 host many smaller inference workloads at predictable latency – provided your scheduler is MIG-aware.
Inference scheduling is different
The scheduling pattern that works for training fails for inference. Training wants co-location, gang scheduling, and tolerance for long queue times. Inference wants autoscaling, horizontal distribution, latency guarantees, and resistance to preemption.
The two patterns we see working reliably in 2026: Knative-on-KServe for HTTP inference workloads, providing scale-to-zero, scale-from-zero, and autoscaling triggered by request concurrency or GPU utilisation; and Ray Serve for more complex inference topologies (multi-model routing, composition, batching). Both integrate with the GPU operator for device reporting and both coexist with Kueue if you want a single cluster that serves both training and inference – which most enterprise teams end up wanting.
Quotas, fair sharing, and the politics of GPU allocation
The operational problem that most surprises infrastructure teams is not technical. It is political: who gets GPUs, when, and who adjudicates. The default of first-come-first-served produces exactly the outcomes you would expect – one team grabbing the entire cluster, others waiting, morale-eroding disputes at sprint planning.
Kueue's ClusterQueue and LocalQueue primitives let you carve the cluster into named queues with guaranteed capacity, borrowable capacity, and priority classes. A useful starting topology for a mid-sized AI organisation: one guaranteed queue per team (covering their 50th-percentile load), one borrowable pool (sized to absorb bursts), one shared low-priority queue for experiments that can be preempted. Preemption policies matter: enable preemption for the low-priority queue only, so casual experiments do not kill production training runs.
Observability is half the battle
You cannot tune a scheduler you cannot measure. Three metrics belong on the wall of any GPU cluster in 2026.
- GPU utilisation (dcgm_gpu_utilization, MFU / model FLOPs utilisation where available). The difference between "GPU allocated" and "GPU used" is where money hides.
- Queue wait time per queue and per priority. A spike in wait time for the production queue is an early indicator of a misconfigured fair-share policy.
- Preemption count per queue. A cluster that is preempting more than 5% of jobs is either under-provisioned or mis-configured; either way, it is not in a good place.
The rollout plan that works
For teams that have grown organically and are now hitting the scheduling wall, a pragmatic 90-day sequence lands cleanly most of the time. Week 1-2: install the NVIDIA GPU operator if you have not, and get dcgm_exporter metrics flowing. Week 3-4: stand up Kueue, migrate one team's workload to a queue, leave everyone else on default scheduling as baseline. Week 5-8: expand to remaining teams, introduce priority classes, enable preemption for the lowest tier. Week 9-12: add topology-aware placement for multi-GPU training, and set up dashboards on wait time and utilisation per queue.
The win at the end of that cycle is usually substantial: utilisation up 15-25 points, queue-wait complaints down sharply, and – most important – the cluster stops being the bottleneck for the ML teams that pay for it. The organisations that treat this as "platform plumbing" tend to compound the gains over subsequent quarters. The ones that treat it as a one-time project slowly drift back to the default.