Multi-GPU-Type Targeting — nodeSelector, nodeAffinity & Tolerations
Your fleet has A100s for training, L40S for inference, and Tesla-K80s for dev. Workloads need to land on the right hardware. Master the four primitives Kubernetes gives you — nodeSelector, nodeAffinity (required vs preferred), taints + tolerations — across a real multi-pool cluster.
What you'll learn
- 1The multi-GPU node landscape — read what your fleet hasThis cluster has 5 nodes across 3 GPU types — the same shape you'll see in a real multi-tenant GPU production cluster:
- 2nodeSelector — target a single GPU productnodeSelector is the simplest of the four placement primitives. The pod spec carries a map[string]string of label keys and values; the scheduler considers a node a candidate ONLY if every key=value pair on the pod's selector matches a label on the node.
- 3nodeAffinity — OR semantics, Set membership, and Soft preferencenodeAffinity is nodeSelector with three superpowers:
- 4Taints + tolerations — reserved-pool patternsnodeSelector / nodeAffinity are pod-side opt-INs. Taints are node-side opt-OUTs — they default-exclude pods unless the pod actively tolerates the taint. The asymmetry is exactly the production pattern for reserved hardware: "this A100 pool is for the ML research team only — no random pod lands here without explicit permission."
- 5Triage Day — three placement failuresThree pods are stuck Pending — one for each placement primitive you've worked through:
Prerequisites
- Completed `nca-aiio-gpu-operator-chain` (or comfortable with gpu-feature-discovery labels)
- Familiar with `kubectl describe pod` events and reading scheduler messages
Exam domains covered
Skills & technologies you'll practice
This intermediate-level ai/ml lab gives you real-world reps across:
What you'll learn
The four primitives Kubernetes ships for placing pods on the right nodes are nodeSelector, nodeAffinity (required and preferred), node taints, and pod tolerations. Each has a different semantic — selector is a hard filter, affinity is also hard or soft, taints repel pods unless they tolerate. The combination you choose is determined by the workload-to-hardware mapping you're encoding: "this pod MUST run on A100" vs "this pod PREFERS A100 but L40S is fine" vs "this hardware is reserved for these pods only."
This lab walks all four with a real multi-pool cluster (Tesla-K80, A100-80GB, L40S simulated nodes). You'll author selectors that target one product line, nodeAffinity expressions that match a set, tolerations to opt into reserved hardware, and finish with a Triage Day where three pods are broken at three placement layers — selector typos, missing tolerations, over-strict required-affinity. Recognizing each in kubectl describe pod events is the platform-engineering muscle the NCA-AIIO exam tests under AI Infrastructure & Operations.
Frequently asked questions
When should I use nodeSelector vs nodeAffinity?
nodeSelector is the simplest case: a single key=value match (or a few). If your only condition is "node has label nvidia.com/gpu.product=NVIDIA-A100", nodeSelector is one line and reads cleanly. nodeAffinity is the richer form — it supports operators like In, NotIn, Exists, DoesNotExist, Gt, Lt, lets you OR multiple terms together, and crucially has both a requiredDuringSchedulingIgnoredDuringExecution mode (hard match — same as nodeSelector) AND a preferredDuringSchedulingIgnoredDuringExecution mode (soft preference, with weights). Use nodeAffinity when you need OR semantics ("A100 OR H100"), set membership, or the soft-preference fallback ("prefer A100, but L40S is acceptable"). Stick with nodeSelector for the trivial hard match.What's the difference between taints and nodeSelector — they both block scheduling?
nodeSelector is on the pod and says "this pod requires nodes with these labels"; pods opt-IN to specific nodes. taints are on the node and say "this node repels pods that don't have a matching toleration"; pods opt-IN by tolerating, but the default is to be excluded. The asymmetry matters in production: if your A100 nodes are expensive, you want the default to be "no random pod lands here" — that's taints. If you need workloads to discriminate but the hardware is shared freely, that's nodeSelector. Most production fleets use BOTH: taint A100 nodes (only A100-tolerating pods land here) AND select with nodeSelector (only pods explicitly targeting A100 land here).What does requiredDuringSchedulingIgnoredDuringExecution actually do?
requiredDuringSchedulingIgnoredDuringExecution actually do?FailedScheduling events). The mirror is preferredDuringSchedulingIgnoredDuringExecution, which is soft — the scheduler tries to match preferences but will fall back to ANY node if no match. The required version is what most NCA-AIIO content tests; the preferred version is the production-friendly "prefer this hardware but don't block on it." Note: there is no RequiredDuringExecution (it would mean live-evict on label change) — Kubernetes doesn't ship that.Why does my pod stay Pending with node(s) had untolerated taint?
node(s) had untolerated taint?kubectl get node <name> -o jsonpath='{.spec.taints}'), then add a matching toleration to your pod spec. Tolerations match on the same key+value+effect triple as the taint. Three operators: Equal (matches when key+value+effect all match), Exists (matches any value with that key+effect — useful for opt-in to ALL taints with a given key), and the special "no operator + no key" combo which tolerates ALL taints (use sparingly). Production GPU node pools commonly carry the taint nvidia.com/gpu=present:NoSchedule to repel non-GPU workloads; only pods with tolerations: [{key: nvidia.com/gpu, operator: Exists, effect: NoSchedule}] land there.