Question 1

What's the difference between Nsight Systems and PyTorch Profiler?

Accepted Answer

PyTorch Profiler operates at the torch-op level — `aten::conv2d`, `Optimizer.step` — and its main outputs are the `key_averages()` table and a Chrome trace of torch ops. Nsight Systems operates a layer below that: it traces the CUDA API (`cudaLaunchKernel`, `cudaMemcpy`), CUDA stream activity, OS threads, and your NVTX ranges, on a unified timeline. Use PyTorch Profiler to answer 'which op is expensive'; use Nsight Systems to answer 'why is my GPU idle between ops'.

Question 2

Why NVTX ranges instead of just reading the raw kernel names?

Accepted Answer

Raw kernel names are anonymous — `sm80_xmma_gemm_f16f16_f16f32_...` tells you it's a GEMM but not which layer in your model. NVTX markers (`torch.cuda.nvtx.range_push('forward')`) inject human-readable labels into the CUDA stream, so `nsys stats --report=nvtx_sum` groups kernels by phase and you immediately see `data_load: 140 ms, forward: 22 ms, backward: 31 ms`. PyTorch also emits NVTX markers internally for torch ops when the profiler is active, which is why `--trace=cuda,nvtx,osrt` is the canonical trace set.

Question 3

Does the lab's fix (pre-allocating data on GPU) work in production?

Accepted Answer

Absolutely not — it's a teaching device. Pre-loading the full dataset into VRAM is fine on synthetic toy data but fails instantly on ImageNet or any real corpus that doesn't fit in GPU memory. The reflection question at the end asks you to design the real production fix: async `DataLoader` with `num_workers>0`, `pin_memory=True`, `non_blocking=True` transfers, or a DALI pipeline — so batch N+1 loads while batch N computes. The point of the lab's fix is to prove, with a clean before/after measurement, that the bottleneck was what the profile said it was.

Question 4

How do I tell from an Nsight timeline whether my CPU-GPU pipeline is overlapped?

Accepted Answer

On the timeline view, you should see `data_load` activity on a CPU worker thread running *concurrently* with `forward`/`backward` kernels on the CUDA stream — not before them. If the CUDA stream has a gap that's exactly aligned with a `data_load` NVTX range on the CPU side, your pipeline is serial. The NVTX ranges are what make this visually obvious; without them, you're squinting at anonymous CPU-thread samples and trying to correlate by wall-clock.

Question 5

Can I use nsys without opening the desktop Nsight Systems UI?

Accepted Answer

Yes — that's exactly what Steps 2 and 3 do. `nsys profile --stats=true` prints a CUDA API summary and kernel summary directly to stdout, and `nsys stats --report=nvtx_sum --format=csv` emits parseable CSV. You can build the whole automated-profile pipeline (capture, parse, surface the bottleneck, regression-test) from a Python script, which is what a real CI job would do. The GUI adds the pretty interactive timeline, but all the data is in the `.nsys-rep` file.

Question 6

What exactly does each step check?

Accepted Answer

Step 1 verifies `nsys_version` contains NVIDIA/Nsight branding and that your workload script exists with NVTX markers and a CUDA import. Step 2 requires the `.nsys-rep` report exists and is >0.5 MB and that the CUDA API summary mentions `cudaLaunchKernel`/`cudaMemcpy`. Step 3 parses `top_kernels`, checks at least 3 of the canonical NVTX ranges are present, and asserts `data_load` ranks in the top 2. Step 4 computes `before_avg_step_ms / after_avg_step_ms` and requires the speedup to exceed 1.1×.

Nsight Systems Profiling: Finding the Bottleneck That Costs You 40% of Your GPU

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll capture in this Nsight Systems lab

Frequently asked questions