Nsight Systems Profiling: Finding the Bottleneck That Costs You 40% of Your GPU
GPU sandbox · jupyter
Beta

Nsight Systems Profiling: Finding the Bottleneck That Costs You 40% of Your GPU

Run the full profile-then-fix loop with NVIDIA Nsight Systems — instrument a training loop with NVTX ranges, capture a .nsys-rep, parse the NVTX summary to pinpoint the bottleneck, then apply a targeted fix and measure the speedup.

35 min·4 steps·2 domains·Intermediate·nca-aiioncp-aioncp-genlncp-adsnca-genlnca-genmncp-aii

What you'll learn

  1. 1
    Instrument a training workload with NVTX ranges
  2. 2
    Capture an Nsight profile with nsys profile
  3. 3
    Find the bottleneck by parsing the NVTX range report
  4. 4
    Apply a profile-driven fix and measure the gain

Prerequisites

  • PyTorch training loop basics
  • Comfortable with subprocess / shell invocation from Python
  • Understanding of CPU vs GPU work and pipeline stalls

Exam domains covered

GPU Acceleration & Distributed TrainingModel Deployment & Inference Optimization

Skills & technologies you'll practice

This intermediate-level gpu lab gives you real-world reps across:

Nsight SystemsnsysNVTXProfilingPyTorchBottleneckCUDAPerformance

What you'll capture in this Nsight Systems lab

Most PyTorch training jobs leave 30-50% of GPU time on the floor, and the reason is almost never the model — it's a launch gap, a missed overlap, or a CPU dispatcher that can't keep up. Nsight Systems is how senior engineers find those gaps in minutes instead of weeks. You'll run the full profile-then-fix loop end to end: an instrumented training script with NVTX phase markers, a real .nsys-rep capture, a parsed NVTX summary that names the bottleneck, a targeted fix, and a measured speedup against the original. By the end you'll have a nsys workflow you can paste into a CI job (capture → parse → regression-test) and the mental model that separates CPU-launch-bound from compute-bound from memory-bandwidth-bound workloads. About 35 minutes on a real NVIDIA GPU pod we provision — nsys, the CUDA toolkit, and PyTorch are preinstalled, and you can download the .nsys-rep to open in the Nsight Systems desktop UI.

The technical core is the capture command and what it actually traces: nsys profile --trace=cuda,nvtx,osrt --sample=cpu --stats=true --output=profile.nsys-rep python workload.py. cuda catches cudaLaunchKernel / cudaMemcpyAsync and per-stream kernel execution; nvtx picks up the torch.cuda.nvtx.range_push('forward') markers you injected (and the ones PyTorch emits internally for torch ops); osrt traces OS runtime thread activity so you can see CPU workers blocking on pthread_cond_wait. NVTX is the piece that turns an Nsight trace from anonymous sm80_xmma_gemm_... kernel names into a readable timeline of data_load, forward, backward, optimizer. The headline insight: Nsight shows you the gaps between kernels — stretches where the GPU is idle because the CPU hasn't dispatched the next launch, or because batch N+1 wasn't ready when batch N finished. That's the view PyTorch Profiler structurally cannot give you.

The second insight most engineers miss: the lab's fix (pre-allocating data on GPU) is a teaching device, not a production pattern. Real datasets don't fit in VRAM. The right production fix is async DataLoader with num_workers>0, pin_memory=True, non_blocking=True transfers, or a DALI pipeline — so batch N+1 loads while batch N computes. The reflection forces you to describe what overlap actually looks like on the Nsight timeline: data_load on a CPU worker thread running concurrently with forward/backward kernels on the CUDA stream, not serially before them. That's how you verify from a profile whether the fix worked.

Prereqs: PyTorch training-loop basics, subprocess.run comfort in Python, a working mental model of CPU-GPU pipeline stalls. Preinstalled: nsys CLI, CUDA toolkit, PyTorch, JupyterLab. Grading checks real artifacts: the .nsys-rep must be >0.5 MB (anything smaller means profiling didn't actually run), cuda_api_summary must contain real CUDA API names like cudaLaunchKernel or cudaMemcpy, top_kernels must have data_load ranked in the top 2, and your post-fix step time must be at least 1.1× faster than the baseline — proving the profile-driven diagnosis was correct.

Frequently asked questions

What's the difference between Nsight Systems and PyTorch Profiler?

PyTorch Profiler operates at the torch-op level — aten::conv2d, Optimizer.step — and its main outputs are the key_averages() table and a Chrome trace of torch ops. Nsight Systems operates a layer below that: it traces the CUDA API (cudaLaunchKernel, cudaMemcpy), CUDA stream activity, OS threads, and your NVTX ranges, on a unified timeline. Use PyTorch Profiler to answer 'which op is expensive'; use Nsight Systems to answer 'why is my GPU idle between ops'.

Why NVTX ranges instead of just reading the raw kernel names?

Raw kernel names are anonymous — sm80_xmma_gemm_f16f16_f16f32_... tells you it's a GEMM but not which layer in your model. NVTX markers (torch.cuda.nvtx.range_push('forward')) inject human-readable labels into the CUDA stream, so nsys stats --report=nvtx_sum groups kernels by phase and you immediately see data_load: 140 ms, forward: 22 ms, backward: 31 ms. PyTorch also emits NVTX markers internally for torch ops when the profiler is active, which is why --trace=cuda,nvtx,osrt is the canonical trace set.

Does the lab's fix (pre-allocating data on GPU) work in production?

Absolutely not — it's a teaching device. Pre-loading the full dataset into VRAM is fine on synthetic toy data but fails instantly on ImageNet or any real corpus that doesn't fit in GPU memory. The reflection question at the end asks you to design the real production fix: async DataLoader with num_workers>0, pin_memory=True, non_blocking=True transfers, or a DALI pipeline — so batch N+1 loads while batch N computes. The point of the lab's fix is to prove, with a clean before/after measurement, that the bottleneck was what the profile said it was.

How do I tell from an Nsight timeline whether my CPU-GPU pipeline is overlapped?

On the timeline view, you should see data_load activity on a CPU worker thread running concurrently with forward/backward kernels on the CUDA stream — not before them. If the CUDA stream has a gap that's exactly aligned with a data_load NVTX range on the CPU side, your pipeline is serial. The NVTX ranges are what make this visually obvious; without them, you're squinting at anonymous CPU-thread samples and trying to correlate by wall-clock.

Can I use nsys without opening the desktop Nsight Systems UI?

Yes — that's exactly what Steps 2 and 3 do. nsys profile --stats=true prints a CUDA API summary and kernel summary directly to stdout, and nsys stats --report=nvtx_sum --format=csv emits parseable CSV. You can build the whole automated-profile pipeline (capture, parse, surface the bottleneck, regression-test) from a Python script, which is what a real CI job would do. The GUI adds the pretty interactive timeline, but all the data is in the .nsys-rep file.

What exactly does each step check?

Step 1 verifies nsys_version contains NVIDIA/Nsight branding and that your workload script exists with NVTX markers and a CUDA import. Step 2 requires the .nsys-rep report exists and is >0.5 MB and that the CUDA API summary mentions cudaLaunchKernel/cudaMemcpy. Step 3 parses top_kernels, checks at least 3 of the canonical NVTX ranges are present, and asserts data_load ranks in the top 2. Step 4 computes before_avg_step_ms / after_avg_step_ms and requires the speedup to exceed 1.1×.