Reproducible Training: The Flags, The Cost, The Artifacts
GPU sandbox · jupyter
Beta

Reproducible Training: The Flags, The Cost, The Artifacts

Measure the non-determinism noise floor in default PyTorch, flip every determinism flag until same-seed runs match bit-for-bit, quantify the perf cost, and capture a content-addressable training config that makes a run reproducible forever.

40 min·4 steps·2 domains·Intermediate·ncp-genlncp-aioncp-adsnca-genlnca-genm

What you'll learn

  1. 1
    The noise floor you're fighting
  2. 2
    Turn determinism on and re-verify
  3. 3
    Measure the cost, catalog what breaks
  4. 4
    Content-addressable training runs

Prerequisites

  • Comfortable with PyTorch training loops
  • Basic familiarity with CUDA / cuDNN configuration
  • Understanding of hashing and content addressing

Exam domains covered

MLOps & Training WorkflowsGPU Acceleration & Distributed Training

Skills & technologies you'll practice

This intermediate-level gpu lab gives you real-world reps across:

ReproducibilityDeterminismPyTorchcuDNNcuBLASSeedingMLOpsConfig Hashing

What you'll build in this reproducibility lab

Every ML engineer has been here: same seed, same code, two runs, two different final losses — and the security team wants to know why. Bit-for-bit reproducibility is a stack problem, not a seed problem, and this lab walks you through the whole stack. You'll walk away with a measured non-determinism noise floor on default PyTorch (the baseline everyone pretends doesn't exist), a flags-flipped configuration that produces byte-identical same-seed runs, a quantified determinism_cost_pct for your workload, a catalog of real non-det PyTorch ops (scatter_add, index_add, bincount, grid_sample), and a content-addressable training_config hash plus a multi-category reproducibility checklist you can hand to CI. About 40 minutes on a live NVIDIA GPU pod — PyTorch, a writable CUBLAS_WORKSPACE_CONFIG, and a clean environment are ready.

The substance is every knob you need and why each one matters. torch.use_deterministic_algorithms(True) is the umbrella switch — it'll raise if you haven't also set CUBLAS_WORKSPACE_CONFIG=:4096:8 because cuBLAS can't honor the guarantee without a fixed-size deterministic workspace. torch.backends.cudnn.deterministic = True stops cuDNN from hunting among kernel variants run-to-run; cudnn.benchmark = False stops the autotuner from making different choices based on warmup timings. Python's PYTHONHASHSEED matters because dict iteration order leaks into reduction order in some code paths. The deeper insight: same-seed runs drift by default because atomicAdd float reductions are order-dependent (FP addition isn't associative) and cuDNN picks among multiple kernel implementations by autotuning — seed controls your RNG, not the GPU's reduction order.

The cost story is more interesting than 'determinism is slow.' On small models with common ops, flipping the flags often costs single-digit percent or is slightly faster (because cudnn.benchmark was spending time autotuning and you just skipped that). On convnets with heavy cuDNN coverage, 10-30% is typical. On ops without a deterministic GPU path — scatter_add, index_put(accumulate=True), ctc_loss — it can be multiples, or PyTorch will raise and force you to redesign. The bigger catch: even with every flag on, a CUDA driver bump, a cuDNN kernel ship, or an A100→H100 migration will change your numbers without touching your code. That's why the content-addressable config hash is not enough on its own — you have to pin the container image digest, cuDNN version, and dataset checksum alongside the seed. Turn determinism on for debugging, CI, and security-sensitive models; turn it off for production training and lean on pinned environments + detailed logging.

Prereqs: PyTorch training-loop comfort, basic cuDNN/cuBLAS configuration, content-addressable hashing familiarity. Preinstalled: PyTorch, writable CUBLAS_WORKSPACE_CONFIG, JupyterLab. Grading is quantitative: the noise-floor step asserts the loss gap and weight L2 both exceed 1e-6, the determinism step requires same-seed runs match to < 1e-6 and weights to < 1e-5 with every flag set in determinism_settings, the cost step bounds determinism_cost_pct to [-50%, +300%] and cross-checks it against your measured timings, and the config step hashes the same config twice (must match), hashes a perturbed version (must differ), and requires a checklist spanning at least three categories (code, data, env, random state, hardware).

Frequently asked questions

Why do two same-seed runs drift in default PyTorch?

Because 'same seed' only controls the pseudo-random sources you hand it — the RNG for weight initialization, dropout masks, data shuffling. It doesn't control the GPU reduction order. Ops like scatter_add and atomicAdd-based kernels sum floats in whichever order the warps finish, and floating-point addition isn't associative, so the final value differs run to run. cuDNN also picks among multiple kernel implementations by autotuning, so the same matmul may use a different algorithm on run B.

What does CUBLAS_WORKSPACE_CONFIG actually do?

It tells cuBLAS to allocate a fixed-size deterministic workspace buffer instead of whatever happens to be available, which forces it to pick deterministic kernels for GEMM. Set it to :4096:8 (4 MB, 8 configurations) or :16:8 depending on your memory budget. PyTorch's torch.use_deterministic_algorithms(True) will actually raise if you enable determinism without setting this env var, because without it cuBLAS can't honor the guarantee.

How much does determinism usually cost in wall time?

Depends heavily on the workload. On a small model with simple ops it's often in the single-digit percent range or even slightly faster (because cudnn.benchmark was spending time autotuning and you just disabled that). On convnets with big cuDNN coverage it can be 10-30% slower. On ops where the deterministic path is fundamentally worse — scatter reductions, sparse ops, some interpolations — it can be multiples. The lab records determinism_cost_pct from your own measurement rather than guessing.

Which PyTorch ops have non-deterministic default paths?

The reliable list includes scatter_add, index_add, index_put with accumulate=True, bincount, grid_sample, interpolate (some modes), torch.Tensor.put_ with accumulate=True, torch.nn.functional.ctc_loss, and any op whose GPU implementation reduces with atomicAdd. torch.use_deterministic_algorithms(True) will either pick a slower deterministic path or raise on these — which one depends on the op and PyTorch version.

Why isn't a content-addressable config hash enough on its own?

Because the hash only covers what you put in it. Things it can't catch: the CUDA driver version shipped in your container, the cuDNN release that updated a kernel yesterday, a silent dataset mirror redirect, a pip-resolved dependency whose version moved because your lockfile was loose, and hardware SKU swaps (an H100 TensorCore path isn't bit-identical to an A100's). The reflection step asks you to name this failure mode and argue for pinning container image digests, dataset checksums, and a full lockfile alongside the seed.

Should I run with determinism on for all production training?

Usually no. Determinism is expensive, sometimes unavailable for ops you need, and unnecessary once you're at the scale where a single run takes days and you're going to log everything anyway. Turn it on for the debugging and CI paths: reproducing a specific gradient explosion, auditing a security-sensitive model, regression-testing a framework upgrade. Turn it off for production training and lean on pinned environments + detailed logging for post-hoc traceability.