Question 1

Do I need TensorBoard to use the PyTorch Profiler?

Accepted Answer

No. The native `torch.profiler.profile` context manager plus `key_averages().table()` gives you the full op-level summary directly in your notebook, and `export_chrome_trace` writes a JSON file that opens in any Chromium browser at `chrome://tracing` or in Perfetto. TensorBoard adds a nicer web UI on top of the same data via the `torch-tb-profiler` plugin, but it's optional. This lab uses the native API and Perfetto so you can read a trace without any extra install.

Question 2

Why do I need warmup and `torch.cuda.synchronize()` before timing?

Accepted Answer

CUDA kernel launches are asynchronous — calling `model(x)` returns immediately while the GPU is still crunching. Without `synchronize()` your timer captures the launch, not the work. Worse, the first few iterations hit lazy kernel compilation, cuDNN autotuner probing, and allocator warmup; those costs contaminate the profile. The Step 1 hint makes both explicit: run a few training steps *outside* the profiler, then enter the `with profile(...)` block, then sync right before exit.

Question 3

What does `profile_memory=True` actually record?

Accepted Answer

Per-op peak memory allocation on each device. In Step 3 you read `evt.device_memory_usage` (or `cuda_memory_usage` on older torch) for each `EventAggregator` and convert to MiB — the top entries are usually activation storage for convolution/matmul ops, not weights. That's how you find the ops to target with activation checkpointing or a lower-precision recompute. `record_shapes=True` adds tensor dimensions so you can see which layer width is the culprit when the same op name appears multiple times.

Question 4

How do I read a Chrome trace in Perfetto?

Accepted Answer

Open `ui.perfetto.dev` in your browser, drag your exported JSON onto the page, and you'll see CUDA streams, CPU threads, and NVTX markers on a unified timeline. The important shapes: long solid CUDA-stream blocks (kernels running), gaps in the CUDA stream with CPU-thread activity (launch-bound), and `cudaMemcpy` regions that don't overlap with kernels (H2D stall). Perfetto is the same engine Chrome DevTools uses — if you've read a flame chart before, it'll feel familiar.

Question 5

When does PyTorch Profiler stop being enough?

Accepted Answer

When your bottleneck is below the torch-op layer. If a single `aten::addmm` is hot and you need to know *why* — warp-level instruction mix, Tensor Core utilization, SM occupancy, L2 cache hit rate — you drop into Nsight Compute. For system-level view across multiple processes, CUDA graphs, and the full CUDA API, you use Nsight Systems. PyTorch Profiler lives in between: model-level diagnosis and memory accounting. Great first tool, rarely the last.

Question 6

What does the lab grade on each step?

Accepted Answer

Step 1 checks `top_ops` has ≥5 entries, each with `name`, `cpu_time_us`, `cuda_time_us`, and that the top op name contains a real torch/autograd marker (`aten::`, `cudnn::`, `Optimizer`, `gemm`, etc.). Step 2 verifies `table_str` is >200 chars and contains the canonical columns. Step 3 validates `mem_ops` has ≥3 ops with real `cuda_mem_mib` values and at least one above 1 MiB. Step 4 confirms the trace file exists, is valid JSON, and has >50 trace events.

Profile PyTorch Training with the Built-in Profiler

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll profile in this PyTorch Profiler lab

Frequently asked questions

Do I need TensorBoard to use the PyTorch Profiler?

Why do I need warmup and `torch.cuda.synchronize()` before timing?

What does `profile_memory=True` actually record?

How do I read a Chrome trace in Perfetto?

When does PyTorch Profiler stop being enough?

What does the lab grade on each step?

Profile PyTorch Training with the Built-in Profiler

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll profile in this PyTorch Profiler lab

Frequently asked questions

Do I need TensorBoard to use the PyTorch Profiler?

Why do I need warmup and torch.cuda.synchronize() before timing?

What does profile_memory=True actually record?

How do I read a Chrome trace in Perfetto?

When does PyTorch Profiler stop being enough?

What does the lab grade on each step?

Why do I need warmup and `torch.cuda.synchronize()` before timing?

What does `profile_memory=True` actually record?