Workload Controllers — Deployment, StatefulSet, DaemonSet for AI
Hosted
Beta

Workload Controllers — Deployment, StatefulSet, DaemonSet for AI

Three controller types, three workload shapes, three different production failure modes. Learn when to use a Deployment for inference, a StatefulSet for distributed training, and a DaemonSet for per-node GPU infrastructure — and how to spot when someone picked the wrong one.

35 min·5 steps·2 domains·Intermediate·nca-aiioncp-aiincp-aio

What you'll learn

  1. 1
    The controller pattern — why bare pods aren't production
    A bare Pod is a single container's life. The kubelet runs it; when its container exits, the pod stays in Succeeded (exit 0) or Failed (non-zero) — but it does not restart unless you explicitly write restartPolicy: Always, and even then the *pod* doesn't come back if it gets evicted, deleted, or its node disappears. There's nothing reconciling "I want one of these running" — the bare Pod is a one-shot.
  2. 2
    Deployment — the inference shape
    A Deployment is the right controller for stateless, horizontally-scalable workloads. The canonical AI use case is inference: each replica is interchangeable, the Service load-balances requests across them, and a request can land on any pod. There's no notion of "request must go to replica 0" — that would defeat the point of horizontal scaling.
  3. 3
    StatefulSet — the distributed training shape
    A StatefulSet is a controller for workloads where each replica has an *identity* — a stable name, a stable network address, and (typically) its own persistent volume. The canonical AI use case is distributed training: PyTorch DDP, NCCL all-reduce, Horovod, parameter-server architectures. Each rank needs to find its peers by hostname; rank 0 is the coordinator and starts first; per-rank checkpoints don't share storage.
  4. 4
    DaemonSet — the per-node infrastructure shape
    A DaemonSet runs exactly one pod per matching node. There's no replicas field — the count comes from the cluster's nodes that satisfy the selector. The classic AI use cases are all node-scoped infrastructure:
  5. 5
    Triage Day — three controllers, three controller-specific failures
    The lab platform has just deployed three workload controllers — one of each kind — that are *all* broken in different ways. Each is broken at a layer specific to its controller type:

Prerequisites

  • Completed `nca-aiio-resource-requests-limits` and `nca-aiio-priority-preemption` (or comfortable with pod requests, QoS, PriorityClass)
  • Familiar with `kubectl get`, `describe`, and reading owner-references / ReplicaSet relationships

Exam domains covered

AI Infrastructure & OperationsWorkload Management

Skills & technologies you'll practice

This intermediate-level ai/ml lab gives you real-world reps across:

DeploymentStatefulSetDaemonSetWorkload ManagementControllersGPU SchedulingNCA-AIIOKubernetesTriage

What you'll learn

Kubernetes ships with three workload controllers that account for ~95% of production AI workloads: Deployment for stateless inference, StatefulSet for distributed training pods that need stable identity and ordered start, DaemonSet for per-node infrastructure (DCGM exporters, GPU monitors, node-local model caches). Each one is a different shape with a different reconciliation contract — the controller chooses the right shape, not the engineer's preference. Picking the wrong one is one of the most common production mistakes; symptoms range from "my distributed trainer can't find its peers" to "my DaemonSet is missing nodes."

This lab teaches the three controllers from the angle that matters most for an NCA-AIIO platform engineer: which AI workload shape does each one fit? You'll author a Deployment for a stateless inference service, a StatefulSet for a 3-replica distributed-training pool with stable hostnames and per-pod PVCs, and a DaemonSet for a GPU-node monitoring agent. You finish with a Triage Day where each of the three controllers is broken in a controller-specific way you'd actually see in production, and you fix each by reading the cluster's response.

Frequently asked questions

Why use a StatefulSet instead of a Deployment for distributed training?

Three reasons unique to StatefulSet. (1) Stable network identity — each pod gets a predictable DNS name like trainer-0.trainer-svc.default.svc.cluster.local (paired with a headless Service). Distributed training frameworks (PyTorch DDP, NCCL all-reduce, Horovod) need every worker to discover every other worker by hostname; random Deployment pod names make that brittle. (2) Ordered creation and terminationtrainer-0 becomes Ready before trainer-1 starts. The rank-0 worker is conventionally the coordinator; ordered start ensures it's up before the others try to reach it. (3) Per-pod PVCs via volumeClaimTemplates — each pod gets its own PVC automatically (e.g., data-trainer-0, data-trainer-1), so per-rank checkpoints don't collide. None of these guarantees come with a Deployment.

When should I NOT use a StatefulSet?

For stateless replicas — model inference servers, API gateways, function workers — Deployment is correct. StatefulSet's ordered creation is slower on scale-up (you wait for each pod to be Ready before the next starts), and pod identity is overhead you don't need. The rule of thumb: do your replicas have any requirement beyond "be one of N identical workers"? If yes (stable hostname, ordered start, per-pod storage), StatefulSet. If no, Deployment. The most common "wrong choice" is using a StatefulSet for inference because someone wanted "stable pod names" — which they almost certainly don't need.

What workloads should be DaemonSets?

Anything that should run exactly one pod per matching node for the lifetime of that node. The classic AI examples: nvidia-dcgm-exporter (GPU metrics — one per GPU node), nvidia-device-plugin (advertises nvidia.com/gpu capacity — one per GPU node), gpu-feature-discovery (writes node labels — one per GPU node), node-local model caches that pre-pull weights, log/trace collectors. DaemonSets target nodes via nodeSelector or nodeAffinity and automatically schedule a new pod when matching nodes join the cluster. They're also exempt from default kube-scheduler priority — DaemonSet pods bind directly to specific nodes via the daemonset controller, not through the regular scheduling path.

Why is my DaemonSet missing pods on some nodes?

Three common reasons. (1) nodeSelector doesn't match — your spec targets nvidia.com/gpu.product=NVIDIA-A100 but those nodes have Tesla-K80. Run kubectl get nodes --show-labels | grep nvidia to see what's actually labeled. (2) Node taints with no matching toleration — control-plane nodes typically have node-role.kubernetes.io/control-plane:NoSchedule; your DaemonSet needs a toleration to run there. The nvidia.com/gpu=present:NoSchedule taint is similarly common in production GPU pools. (3) Pod resource requests exceed the node's allocatable — DaemonSet pods compete for resources the same way; if your pod requests 8Gi memory and the node has 7Gi allocatable, the pod stays Pending. Diagnose with kubectl describe ds <name> (status fields show desired/current/numberReady) and kubectl describe pod <ds-pod> (events explain the per-node failure).