GPU Container Lifecycle: Build, Test, Ship, Rollback
GPU sandbox · jupyter
Beta

GPU Container Lifecycle: Build, Test, Ship, Rollback

Walk through the full lifecycle of a production GPU container — multi-stage Dockerfile, self-hosted GPU CI, a fail-fast smoke test, and a Kubernetes Deployment with readiness probes gated on real GPU compute. The pipeline that stops bad images before users see a 500.

40 min·4 steps·2 domains·Intermediate·ncp-aionca-aiionca-genlncp-aii

What you'll learn

  1. 1
    The production Dockerfile
  2. 2
    The CI workflow
  3. 3
    The GPU smoke test
  4. 4
    Kubernetes rollout + rollback

Prerequisites

  • Comfortable with Docker multi-stage builds and Dockerfile directives
  • Basic Kubernetes (Deployments, probes, nodeSelector)
  • Familiarity with CI/CD concepts (GitHub Actions or equivalent)

Exam domains covered

GPU Infrastructure & OperationsModel Deployment & Inference Optimization

Skills & technologies you'll practice

This intermediate-level gpu lab gives you real-world reps across:

DockerKubernetesCI/CDGitHub Actionsnvidia-smiHEALTHCHECKRolloutRollback

What you'll build in this GPU container lifecycle lab

Shipping a GPU container to production is where most AI teams first discover that docker build + docker push is not a deploy pipeline. This lab gives you the full pattern — multi-stage Dockerfile, self-hosted GPU CI, a GPU-aware smoke test, Kubernetes Deployment with probes gated on real compute, and a rollout/rollback playbook — so the next time a driver mismatch, a cuDNN drift, or a silent VRAM leak tries to reach users, it hits a wall instead. You'll leave with a runnable Dockerfile on nvidia/cuda:*-runtime-ubuntu22.04, a GitHub Actions workflow targeting a self-hosted, gpu runner, a three-check smoke test with distinct exit codes, a Deployment YAML with maxUnavailable: 0 / maxSurge: 1 and both probes, and a concrete mental model of why four layers of essentially-the-same-test is the feature, not the bug. ~40 minutes on a real NVIDIA GPU pod we hand you; no local Docker, no kubectl context juggling.

The technical backbone is defense in depth on the same question — can this container actually use a GPU right now? — asked from four places. CI runs the smoke test on a self-hosted GPU runner (GitHub-hosted runners have no GPUs, so docker run --gpus all silently no-ops and your pipeline lies to you). Dockerfile HEALTHCHECK invokes python3 /app/smoke_test.py — NVML init, torch.cuda.is_available(), tensor round-trip — not curl /healthz, because HTTP 200 proves Python is running, not that CUDA is. The readiness probe reuses the same script to gate traffic on a pod that booted but can't allocate VRAM. The liveness probe reruns it continuously to catch the slow-motion failures — Xid fatal errors, thermal throttling, memory fragmentation, cuDNN version drift, kernel hangs — that CI on a cold GPU in a short run will never see. maxUnavailable: 0 + maxSurge: 1 gives you zero-downtime rolling updates on GPU-constrained clusters without deadlocking.

Prereqs: Docker multi-stage builds, basic Kubernetes (Deployments, probes, nodeSelector, resources.limits), and CI/CD concepts (GitHub Actions or equivalent). Preinstalled on the lab pod: Docker, NVIDIA Container Toolkit, kubectl, PyTorch, and CUDA. Grading checks the artifacts the way a reviewer would: the Dockerfile must have ≥2 FROM stages, a non-root USER, and a GPU-aware HEALTHCHECK; the workflow must declare build + test jobs on a GPU runner with a push stage gated on main; the smoke test must exit 0 on healthy and non-zero when CUDA_VISIBLE_DEVICES=''; the Deployment must declare both probes, nvidia.com/gpu, and the rolling-update knobs. The reflection step asks you to instrument past the four layers — DCGM DCGM_FI_DEV_XID_ERRORS, SM_ACTIVE, a model-shaped forward-pass probe — which is how you graduate from 'ships clean' to 'stays clean at 3am'.

Frequently asked questions

Why target runs-on: [self-hosted, gpu] instead of a GitHub-hosted runner?

GitHub-hosted runners don't have NVIDIA GPUs — your docker run --gpus all test step would either skip or silently pass. You either register a self-hosted runner on a GPU host, use a managed GPU CI provider (BuildJet, Actuated, Namespace), or rely on GitHub's large-runner GPU tier where available. The lab shows the self-hosted, gpu pattern because it's the most portable and because it teaches you to separate the 'where does my image build' question from the 'where does my image test' question — in production those often run on very different hardware.

Should the HEALTHCHECK in the Dockerfile call curl http://localhost:8000/healthz?

No — that's the mistake this lab is specifically designed to correct. A successful HTTP 200 from your server process proves Python is running; it proves nothing about the GPU. Dockerfile HEALTHCHECK should invoke a GPU-aware probe — python3 /app/smoke_test.py that calls torch.cuda.is_available(), initializes NVML, and allocates a small tensor — so Docker marks the container unhealthy when the driver, the toolkit, or the card itself fail. Your Kubernetes readiness probe reuses the exact same script.

Why distinguish readiness from liveness when both call the same smoke test?

Readiness answers 'should this pod receive traffic right now?' and failure simply pulls the pod out of the Service endpoint list — no restart. Liveness answers 'is this pod permanently wedged?' and failure triggers a kubelet restart. Using the same script is fine; using the same thresholds is a bug. Readiness should fail fast (a few seconds) during startup while the model warms up; liveness should fail slow (tens of seconds, with a tolerant failureThreshold) so transient Xid errors or a briefly-stuck kernel don't thrash your pods into a CrashLoopBackOff.

What can go wrong at runtime that CI and HEALTHCHECK won't catch?

Plenty. Xid fatal errors after hours of load. Memory fragmentation that only surfaces when a larger-than-usual batch arrives. Thermal throttling that turns into a kernel timeout. cuDNN version drift between warm-up and real traffic. Driver hangs after a specific sequence of CUDA API calls. Model shard OOMs under production-shaped inputs. CI runs briefly on a cold GPU; real traffic is long, hot, and adversarial. That's exactly why the livenessProbe reruns the smoke test continuously and why you'd graduate to a richer probe that exercises a real model forward pass.

Why use maxUnavailable: 0, maxSurge: 1 for the rolling update?

GPU pods are expensive and cluster GPU capacity is usually the binding constraint. maxUnavailable: 0 guarantees the old pod keeps serving traffic until the new one passes readiness, so you never drop below full capacity. maxSurge: 1 says you're willing to temporarily run one extra replica during the rollout — critical because without surge, with maxUnavailable: 0, the rollout would deadlock waiting for a pod that can't start until another pod dies. The combination gives you zero-downtime deploys on constrained GPU nodes.

How does grading work for the Kubernetes step?

The validator parses your deployment_yaml string and checks for the required fields: apiVersion, kind: Deployment, strategy.rollingUpdate with both maxUnavailable and maxSurge, nvidia.com/gpu in resource limits, and both readinessProbe and livenessProbe. It also inspects your rollback_commands list for kubectl rollout history and undo, and verifies your image_promotion_flow defines at least three stages (dev → staging → prod) each with env, tag_pattern, and gate keys. Nothing is applied to a live cluster — the lab grades the artifacts, not a running rollout.