CUDA Programming Fundamentals
Write four real CUDA C++ kernels and run them from PyTorch: vector add, 2D matrix add, tiled matmul with shared memory, and a custom autograd op.
What you'll learn
- 1Your first CUDA kernel — vector add
- 22D thread grids — matrix add
- 3Tiled matmul with shared memory
- 4Custom autograd op — wire a kernel into PyTorch
Prerequisites
- Comfortable with PyTorch tensors and CUDA basics
- Basic C/C++ — pointers, arrays, function signatures
- Understanding of matrix multiplication
Exam domains covered
Skills & technologies you'll practice
This advanced-level gpu lab gives you real-world reps across:
What you'll build in this CUDA lab
Writing CUDA kernels is the skill that separates AI engineers who consume PyTorch from the ones who extend it — and in 2026, with every inference optimisation (FlashAttention-3, PagedAttention, fused MoE gating, custom quantisation kernels) shipping as bespoke CUDA, it's the highest-leverage low-level skill an ML engineer can own. In about 45 minutes on a real NVIDIA GPU we provision, you'll walk away with four working kernels you wrote from scratch — vector add, 2D matrix add, a shared-memory tiled matmul, and a custom ReLU with a backward pass that drops into torch.autograd like any native op — plus a concrete mental model of the grid-block-warp-thread hierarchy and where your handwritten code still loses to cuBLAS.
The substance is the CUDA you're usually shielded from: __global__ kernels, thread indexing with blockIdx.x * blockDim.x + threadIdx.x, the <<<blocks, threads>>> launch syntax, 2D dim3 grids, and — for the tiled matmul — declaring __shared__ float sA[TILE][TILE], staging tiles, and synchronising with __syncthreads(). The matmul step benchmarks your kernel against torch.matmul so you see the 5-10x gap with cuBLAS yourself; that's the concrete answer to "when is it worth writing a kernel?" (custom fused ops, unusual shapes, ops vendor libs don't cover) versus "when should I just call cuBLAS / cuDNN / FlashAttention?" The autograd step uses torch.utils.cpp_extension.load_inline — the same JIT path FlashAttention, xformers, and most of torchao ship on — to compile CUDA source and bind a forward + backward kernel that loss.backward() flows through cleanly.
Prerequisites: comfort with PyTorch tensors plus enough C/C++ to read pointer arithmetic and function signatures. Prior CUDA experience isn't assumed — launch syntax and thread indexing are introduced from scratch at the start. The sandbox is a real NVIDIA GPU pod with nvcc, the CUDA toolkit, PyTorch, and the load_inline compile cache preinstalled, so iteration is measured in seconds per kernel edit rather than minutes of build setup. The lab also feeds the exam prep track for NVIDIA's NCP-AII, NCP-ADS, and NCP-GenL certifications, where low-level GPU programming shows up.
Frequently asked questions
Do I need prior CUDA experience to take this lab?
__global__, thread indexing (blockIdx.x * blockDim.x + threadIdx.x), and the <<<blocks, threads>>> launch syntax from scratch. You do need basic C/C++ fluency — reading pointer arithmetic, function signatures, and loops — and comfort with PyTorch tensors. If you can read a short .cu file and roughly follow what it does, you’re ready.Why write a matmul kernel when cuBLAS already exists?
torch.matmul and you’ll typically see a 5–10× gap. That concrete comparison teaches when writing a kernel pays off (custom fused ops, unusual tensor shapes, operations vendor libraries don’t cover) versus when you should lean on cuBLAS / cuDNN / FlashAttention instead.Does the custom autograd op in Step 4 actually work with loss.backward()?
loss.backward()?torch.autograd.Function, implement forward that calls your CUDA ReLU kernel, and backward that calls a second kernel multiplying grad_output by the ReLU mask. After the step, gradients flow through your kernel cleanly and match the reference torch.nn.functional.relu. This is the exact pattern real libraries use — FlashAttention, xformers, and most of torchao ship custom CUDA ops bound this way.What is torch.utils.cpp_extension.load_inline and why use it instead of a setup.py build?
torch.utils.cpp_extension.load_inline and why use it instead of a setup.py build?load_inline takes a string of CUDA source, JIT-compiles it with nvcc, caches the compiled module by source hash, and returns a Python-callable binding — no setup.py, no build system, no .cu files to manage. It’s the fastest way to iterate on kernels during research and prototyping, which is why the PyTorch team added it. For shipping production ops you’d graduate to a proper extension build, but the kernel code itself is identical.What do I need installed locally?
How is each step auto-graded?
vec_add output matches a + b on 1024 elements and that you measured CPU vs GPU timing. Step 2 verifies mat_add on a 256×384 matrix. Step 3 confirms your tiled matmul matches torch.matmul within floating-point tolerance and that the benchmark cell ran. Step 4 checks that your PreporatoReLU.apply produces correct forward values and correct gradients through .backward().