Question 1

Do I need prior CUDA experience to take this lab?

Accepted Answer

No. Step 1 introduces `__global__`, thread indexing (`blockIdx.x * blockDim.x + threadIdx.x`), and the `<<>>` launch syntax from scratch. You do need basic C/C++ fluency — reading pointer arithmetic, function signatures, and loops — and comfort with PyTorch tensors. If you can read a short `.cu` file and roughly follow what it does, you’re ready.

Question 2

Why write a matmul kernel when cuBLAS already exists?

Accepted Answer

The point isn’t to replace cuBLAS — it’s to understand why cuBLAS wins. Step 3 benchmarks your tiled kernel against `torch.matmul` and you’ll typically see a 5–10× gap. That concrete comparison teaches when writing a kernel pays off (custom fused ops, unusual tensor shapes, operations vendor libraries don’t cover) versus when you should lean on cuBLAS / cuDNN / FlashAttention instead.

Question 3

Does the custom autograd op in Step 4 actually work with `loss.backward()`?

Accepted Answer

Yes. You subclass `torch.autograd.Function`, implement `forward` that calls your CUDA ReLU kernel, and `backward` that calls a second kernel multiplying `grad_output` by the ReLU mask. After the step, gradients flow through your kernel cleanly and match the reference `torch.nn.functional.relu`. This is the exact pattern real libraries use — FlashAttention, xformers, and most of torchao ship custom CUDA ops bound this way.

Question 4

What is `torch.utils.cpp_extension.load_inline` and why use it instead of a setup.py build?

Accepted Answer

`load_inline` takes a string of CUDA source, JIT-compiles it with nvcc, caches the compiled module by source hash, and returns a Python-callable binding — no setup.py, no build system, no `.cu` files to manage. It’s the fastest way to iterate on kernels during research and prototyping, which is why the PyTorch team added it. For shipping production ops you’d graduate to a proper extension build, but the kernel code itself is identical.

Question 5

What do I need installed locally?

Accepted Answer

Nothing. The lab runs on a real NVIDIA GPU pod we provision on demand — nvcc, the CUDA toolkit, PyTorch, and the compile cache are preinstalled. You only need a browser. Your session persists between visits, so compiled modules and workspace files stay put if you pause mid-lab.

Question 6

How is each step auto-graded?

Accepted Answer

By running your compiled kernel against a correctness reference. Step 1 checks `vec_add` output matches `a + b` on 1024 elements and that you measured CPU vs GPU timing. Step 2 verifies `mat_add` on a 256×384 matrix. Step 3 confirms your tiled matmul matches `torch.matmul` within floating-point tolerance and that the benchmark cell ran. Step 4 checks that your `PreporatoReLU.apply` produces correct forward values and correct gradients through `.backward()`.

CUDA Programming Fundamentals

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll build in this CUDA lab

Frequently asked questions

Do I need prior CUDA experience to take this lab?

Why write a matmul kernel when cuBLAS already exists?

Does the custom autograd op in Step 4 actually work with `loss.backward()`?

What is `torch.utils.cpp_extension.load_inline` and why use it instead of a setup.py build?

What do I need installed locally?

How is each step auto-graded?

CUDA Programming Fundamentals

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll build in this CUDA lab

Frequently asked questions

Do I need prior CUDA experience to take this lab?

Why write a matmul kernel when cuBLAS already exists?

Does the custom autograd op in Step 4 actually work with loss.backward()?

What is torch.utils.cpp_extension.load_inline and why use it instead of a setup.py build?

What do I need installed locally?

How is each step auto-graded?

Does the custom autograd op in Step 4 actually work with `loss.backward()`?

What is `torch.utils.cpp_extension.load_inline` and why use it instead of a setup.py build?