Kubernetes Resource Requests & Limits — Who Gets What, and Who Survives
Hosted
Beta

Kubernetes Resource Requests & Limits — Who Gets What, and Who Survives

Master the most consequential six lines in any Kubernetes manifest: requests, limits, and how they decide scheduling, throttling, eviction, and survival under pressure. Includes the CFS throttling controversy and what 2026 production teams actually do with CPU limits.

40 min·7 steps·2 domains·Intermediate·nca-aiioncp-aiincp-aio

What you'll learn

  1. 1
    Admission — the rules before scheduling
    Imagine you and four teammates share a single Kubernetes cluster — say, 16 CPU cores and 64 GiB of RAM total, with two GPUs. Everyone deploys whatever they want. Whose pods get the resources? Whose get killed when memory runs out?
  2. 2
    Requests — what the scheduler reserves
    You've now seen that LimitRange and ResourceQuota gate-keep at the namespace boundary. Once a pod passes those, the scheduler takes over. And the scheduler looks at exactly one thing: requests.
  3. 3
    Limits — and the CPU controversy
    Requests are about the *future* — what will be reserved for you. Limits are about the *present* — what the kubelet (the agent on the node) will let your container actually consume right now.
  4. 4
    QoS classes — who survives node pressure
    Kubernetes classifies every pod into one of three QoS classes (Quality of Service) the moment it's admitted. The classification is purely a function of how requests and limits are set:
  5. 5
    GPU resources — the rules that surprise everyone
    Now we zoom into nvidia.com/gpu. It's not measured in millicores or mebibytes — it's a count of whole devices. And it has rules that surprise everyone the first time:
  6. 6
    Putting it all together — a production-shape GPU pod
    You've now seen every concept the lab covers. Combine them into a single production-shape pod manifest — the kind you'd actually submit for a real GPU workload.
  7. 7
    Triage — fix what's broken in your cluster
    You've now learned the mental model. Time to use it. Three pods have just been deployed into your cluster — every one of them is broken in a different way that the previous six steps taught you to recognize. Your job: find what's wrong with each, and fix it so all three are Running.

Prerequisites

  • Completed the welcome smoke-test lab
  • Comfortable reading and editing YAML
  • Familiar with kubectl get, apply, describe, logs

Exam domains covered

AI Infrastructure & OperationsGPU Acceleration & Distributed Training

Skills & technologies you'll practice

This intermediate-level ai/ml lab gives you real-world reps across:

KubernetesResource RequestsResource LimitsQoS ClassResourceQuotaLimitRangeCFS ThrottlingOOMKilledGPU Scheduling

What you'll learn

Resource requests and limits are the most consequential six lines in any Kubernetes manifest, but most engineers learn them by copy-pasting examples until something works. This lab teaches the why behind each line, the precise difference between request and limit, why CPU and memory enforcement are fundamentally different, and what actually happens during a node pressure event. You'll watch the kernel OOM-kill a container that exceeds its memory limit, watch CFS bandwidth control throttle a CPU-bound process even when the node is idle, and see Kubernetes reject pods at admission for exceeding the namespace ResourceQuota.

The lab also covers what most courses skip: the real-world controversy around CPU limits. Modern production teams routinely omit CPU limits on latency-sensitive workloads because the CFS bandwidth control mechanism causes throttling even on otherwise-idle nodes. You'll see the throttling happen, then learn the mental model — requests-only for latency-critical services, requests-and-limits for batch and untrusted workloads, always-set memory limits because OOM has no equivalent of throttling. By the end, you'll be able to read any pod manifest and immediately spot what failure modes its author has and hasn't defended against.

Frequently asked questions

What's the difference between a resource request and a resource limit?

A request is what the scheduler uses — it's a reservation. The scheduler sums all pods' requests against each node's allocatable resources to decide where to place new pods, and against the namespace ResourceQuota to decide whether to admit them at all. A limit is what the kubelet and Linux kernel enforce at runtime via cgroups. CPU limits cause throttling (paused process); memory limits cause OOM-kill (terminated process). Requests determine placement; limits determine in-the-moment behavior.

Should I always set CPU limits on my pods?

No, and this is one of the most contested topics in production Kubernetes operation. CPU limits are enforced via Linux CFS bandwidth control, which can throttle a process even when the node has idle CPU available. For latency-sensitive workloads (HTTP APIs, gateways, real-time services), production teams routinely set CPU requests but omit CPU limits — the request gives the scheduler what it needs, while letting the pod use any spare CPU when available. Memory limits, on the other hand, should always be set, because the alternative is uncontrolled memory growth that triggers node-wide eviction. The lab demonstrates the throttling mechanism so you can make the call yourself.

Why is nvidia.com/gpu integer-only when CPU and memory are fractional?

GPUs are an extended resource in Kubernetes' resource model, and extended resources must be integers — the API server rejects fractional values. The reason isn't laziness: a CUDA context can't be safely split at the K8s API layer. If you want fractional GPU sharing, you need an explicit operator-level mechanism — MIG (hardware partitioning), time-slicing (software multiplexing without isolation), or MPS (NVIDIA's multi-process scheduler). Each is configured at the GPU operator before pods are scheduled, then advertises its partitions as separate resource types. We'll cover each in dedicated later labs.

What QoS class should production workloads aim for?

Guaranteed for almost anything important. The kubelet evicts pods in a strict order under node pressure: BestEffort first, then Burstable (with the noisiest neighbors going first), then Guaranteed. Guaranteed pods almost never get evicted unless the kubelet itself is dying. Achieving Guaranteed requires every container in the pod to have CPU and memory requests and limits set, with each request equal to its corresponding limit. For GPU jobs and stateful workloads this is non-negotiable — the cost of being evicted mid-training is much worse than the cost of CPU throttling.