Profile PyTorch Training with the Built-in Profiler
Instrument a training loop with torch.profiler, read the op-level table, inspect the Chrome/Perfetto timeline, and decide when to reach for Nsight Systems instead.
What you'll learn
- 1Profile a training step, extract top ops
- 2The canonical key_averages().table() summary
- 3Memory profiling: which op allocates the most VRAM?
- 4Export a Chrome trace for visual analysis
Prerequisites
- Comfortable writing a PyTorch training loop (forward, loss, backward, optimizer step)
- Basic understanding of CPU vs GPU execution and why `torch.cuda.synchronize()` matters
- Familiarity with nn.Module / CNNs (the demo uses a small conv net)
Exam domains covered
Skills & technologies you'll practice
This intermediate-level gpu lab gives you real-world reps across:
What you'll profile in this PyTorch Profiler lab
PyTorch Profiler is the five-lines-of-code tool that tells you which op is eating your training budget — aten::addmm, cudnn::conv2d, or the Optimizer.step you forgot was expensive. It's the first profiler every PyTorch engineer should learn and the one you'll reach for 80% of the time. You'll walk away with four concrete artifacts from a real training loop: a top_ops ranking by CUDA time, the canonical key_averages().table() string you paste into GitHub issues, a per-op memory allocation breakdown in MiB, and a Chrome trace JSON you open in Perfetto to see CUDA streams and CPU threads on a unified timeline. Plus the judgment to know when to graduate from this tool to Nsight Systems (for system-level view) or Nsight Compute (for SM-level kernel internals). About 35 minutes on a live NVIDIA GPU pod — PyTorch, CUDA, and the profiler extensions are ready; Perfetto runs in your browser at ui.perfetto.dev.
The technical substance is torch.profiler.profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) plus three critical switches: profile_memory=True (lights up per-op device_memory_usage, which surfaces activation-storage hot spots that weight checkpointing or lower-precision recompute can actually fix), record_shapes=True (tensor dims so you can tell which conv layer is the culprit when the same op name shows up ten times), and export_chrome_trace (writes Chrome Trace Event Format JSON). The non-negotiable: warmup outside the profiled region and torch.cuda.synchronize() before exiting the context. Without it, async kernel launches + lazy cuDNN autotuning + allocator warmup contaminate every number. The trap engineers fall into: aggregated tables hide scheduler gaps, implicit syncs, and H2D serialization — those pathologies only show up on the Perfetto timeline. Tables answer 'which op is hot,' timelines answer 'why is the GPU idle between ops.'
The hand-off rule: when a single aten::addmm is hot and you need warp-level instruction mix, Tensor Core utilization, SM occupancy, or L2 hit rate, you drop into Nsight Compute. For cross-process, CUDA-graph, full-CUDA-API visibility, you use Nsight Systems. PyTorch Profiler lives squarely in between: model-level diagnosis, memory accounting, and a good-enough timeline for 80% of what you'll debug. Knowing which tool to reach for is half the skill.
Prereqs: a PyTorch training loop (forward/loss/backward/optimizer), nn.Module familiarity, a mental model of why torch.cuda.synchronize() matters. Preinstalled: PyTorch, CUDA, profiler extensions, Perfetto link. Grading validates real artifacts: top_ops has ≥5 entries with recognizable torch/autograd markers (aten::, cudnn::, Optimizer), table_str contains the canonical columns, mem_ops shows at least one op above 1 MiB, and your Chrome trace is valid JSON with >50 events.
Frequently asked questions
Do I need TensorBoard to use the PyTorch Profiler?
torch.profiler.profile context manager plus key_averages().table() gives you the full op-level summary directly in your notebook, and export_chrome_trace writes a JSON file that opens in any Chromium browser at chrome://tracing or in Perfetto. TensorBoard adds a nicer web UI on top of the same data via the torch-tb-profiler plugin, but it's optional. This lab uses the native API and Perfetto so you can read a trace without any extra install.Why do I need warmup and torch.cuda.synchronize() before timing?
torch.cuda.synchronize() before timing?model(x) returns immediately while the GPU is still crunching. Without synchronize() your timer captures the launch, not the work. Worse, the first few iterations hit lazy kernel compilation, cuDNN autotuner probing, and allocator warmup; those costs contaminate the profile. The Step 1 hint makes both explicit: run a few training steps outside the profiler, then enter the with profile(...) block, then sync right before exit.What does profile_memory=True actually record?
profile_memory=True actually record?evt.device_memory_usage (or cuda_memory_usage on older torch) for each EventAggregator and convert to MiB — the top entries are usually activation storage for convolution/matmul ops, not weights. That's how you find the ops to target with activation checkpointing or a lower-precision recompute. record_shapes=True adds tensor dimensions so you can see which layer width is the culprit when the same op name appears multiple times.How do I read a Chrome trace in Perfetto?
ui.perfetto.dev in your browser, drag your exported JSON onto the page, and you'll see CUDA streams, CPU threads, and NVTX markers on a unified timeline. The important shapes: long solid CUDA-stream blocks (kernels running), gaps in the CUDA stream with CPU-thread activity (launch-bound), and cudaMemcpy regions that don't overlap with kernels (H2D stall). Perfetto is the same engine Chrome DevTools uses — if you've read a flame chart before, it'll feel familiar.When does PyTorch Profiler stop being enough?
aten::addmm is hot and you need to know why — warp-level instruction mix, Tensor Core utilization, SM occupancy, L2 cache hit rate — you drop into Nsight Compute. For system-level view across multiple processes, CUDA graphs, and the full CUDA API, you use Nsight Systems. PyTorch Profiler lives in between: model-level diagnosis and memory accounting. Great first tool, rarely the last.What does the lab grade on each step?
top_ops has ≥5 entries, each with name, cpu_time_us, cuda_time_us, and that the top op name contains a real torch/autograd marker (aten::, cudnn::, Optimizer, gemm, etc.). Step 2 verifies table_str is >200 chars and contains the canonical columns. Step 3 validates mem_ops has ≥3 ops with real cuda_mem_mib values and at least one above 1 MiB. Step 4 confirms the trace file exists, is valid JSON, and has >50 trace events.