Nsight Systems Profiling: Finding the Bottleneck That Costs You 40% of Your GPU
Run the full profile-then-fix loop with NVIDIA Nsight Systems — instrument a training loop with NVTX ranges, capture a .nsys-rep, parse the NVTX summary to pinpoint the bottleneck, then apply a targeted fix and measure the speedup.
What you'll learn
- 1Instrument a training workload with NVTX ranges
- 2Capture an Nsight profile with nsys profile
- 3Find the bottleneck by parsing the NVTX range report
- 4Apply a profile-driven fix and measure the gain
Prerequisites
- PyTorch training loop basics
- Comfortable with subprocess / shell invocation from Python
- Understanding of CPU vs GPU work and pipeline stalls
Exam domains covered
Skills & technologies you'll practice
This intermediate-level gpu lab gives you real-world reps across:
What you'll capture in this Nsight Systems lab
Most PyTorch training jobs leave 30-50% of GPU time on the floor, and the reason is almost never the model — it's a launch gap, a missed overlap, or a CPU dispatcher that can't keep up. Nsight Systems is how senior engineers find those gaps in minutes instead of weeks. You'll run the full profile-then-fix loop end to end: an instrumented training script with NVTX phase markers, a real .nsys-rep capture, a parsed NVTX summary that names the bottleneck, a targeted fix, and a measured speedup against the original. By the end you'll have a nsys workflow you can paste into a CI job (capture → parse → regression-test) and the mental model that separates CPU-launch-bound from compute-bound from memory-bandwidth-bound workloads. About 35 minutes on a real NVIDIA GPU pod we provision — nsys, the CUDA toolkit, and PyTorch are preinstalled, and you can download the .nsys-rep to open in the Nsight Systems desktop UI.
The technical core is the capture command and what it actually traces: nsys profile --trace=cuda,nvtx,osrt --sample=cpu --stats=true --output=profile.nsys-rep python workload.py. cuda catches cudaLaunchKernel / cudaMemcpyAsync and per-stream kernel execution; nvtx picks up the torch.cuda.nvtx.range_push('forward') markers you injected (and the ones PyTorch emits internally for torch ops); osrt traces OS runtime thread activity so you can see CPU workers blocking on pthread_cond_wait. NVTX is the piece that turns an Nsight trace from anonymous sm80_xmma_gemm_... kernel names into a readable timeline of data_load, forward, backward, optimizer. The headline insight: Nsight shows you the gaps between kernels — stretches where the GPU is idle because the CPU hasn't dispatched the next launch, or because batch N+1 wasn't ready when batch N finished. That's the view PyTorch Profiler structurally cannot give you.
The second insight most engineers miss: the lab's fix (pre-allocating data on GPU) is a teaching device, not a production pattern. Real datasets don't fit in VRAM. The right production fix is async DataLoader with num_workers>0, pin_memory=True, non_blocking=True transfers, or a DALI pipeline — so batch N+1 loads while batch N computes. The reflection forces you to describe what overlap actually looks like on the Nsight timeline: data_load on a CPU worker thread running concurrently with forward/backward kernels on the CUDA stream, not serially before them. That's how you verify from a profile whether the fix worked.
Prereqs: PyTorch training-loop basics, subprocess.run comfort in Python, a working mental model of CPU-GPU pipeline stalls. Preinstalled: nsys CLI, CUDA toolkit, PyTorch, JupyterLab. Grading checks real artifacts: the .nsys-rep must be >0.5 MB (anything smaller means profiling didn't actually run), cuda_api_summary must contain real CUDA API names like cudaLaunchKernel or cudaMemcpy, top_kernels must have data_load ranked in the top 2, and your post-fix step time must be at least 1.1× faster than the baseline — proving the profile-driven diagnosis was correct.
Frequently asked questions
What's the difference between Nsight Systems and PyTorch Profiler?
aten::conv2d, Optimizer.step — and its main outputs are the key_averages() table and a Chrome trace of torch ops. Nsight Systems operates a layer below that: it traces the CUDA API (cudaLaunchKernel, cudaMemcpy), CUDA stream activity, OS threads, and your NVTX ranges, on a unified timeline. Use PyTorch Profiler to answer 'which op is expensive'; use Nsight Systems to answer 'why is my GPU idle between ops'.Why NVTX ranges instead of just reading the raw kernel names?
sm80_xmma_gemm_f16f16_f16f32_... tells you it's a GEMM but not which layer in your model. NVTX markers (torch.cuda.nvtx.range_push('forward')) inject human-readable labels into the CUDA stream, so nsys stats --report=nvtx_sum groups kernels by phase and you immediately see data_load: 140 ms, forward: 22 ms, backward: 31 ms. PyTorch also emits NVTX markers internally for torch ops when the profiler is active, which is why --trace=cuda,nvtx,osrt is the canonical trace set.Does the lab's fix (pre-allocating data on GPU) work in production?
DataLoader with num_workers>0, pin_memory=True, non_blocking=True transfers, or a DALI pipeline — so batch N+1 loads while batch N computes. The point of the lab's fix is to prove, with a clean before/after measurement, that the bottleneck was what the profile said it was.How do I tell from an Nsight timeline whether my CPU-GPU pipeline is overlapped?
data_load activity on a CPU worker thread running concurrently with forward/backward kernels on the CUDA stream — not before them. If the CUDA stream has a gap that's exactly aligned with a data_load NVTX range on the CPU side, your pipeline is serial. The NVTX ranges are what make this visually obvious; without them, you're squinting at anonymous CPU-thread samples and trying to correlate by wall-clock.Can I use nsys without opening the desktop Nsight Systems UI?
nsys profile --stats=true prints a CUDA API summary and kernel summary directly to stdout, and nsys stats --report=nvtx_sum --format=csv emits parseable CSV. You can build the whole automated-profile pipeline (capture, parse, surface the bottleneck, regression-test) from a Python script, which is what a real CI job would do. The GUI adds the pretty interactive timeline, but all the data is in the .nsys-rep file.What exactly does each step check?
nsys_version contains NVIDIA/Nsight branding and that your workload script exists with NVTX markers and a CUDA import. Step 2 requires the .nsys-rep report exists and is >0.5 MB and that the CUDA API summary mentions cudaLaunchKernel/cudaMemcpy. Step 3 parses top_kernels, checks at least 3 of the canonical NVTX ranges are present, and asserts data_load ranks in the top 2. Step 4 computes before_avg_step_ms / after_avg_step_ms and requires the speedup to exceed 1.1×.