Question 1

Is DALI worth it if I'm not training on ImageNet?

Accepted Answer

Depends on whether your pipeline is CPU-bound. If `nvidia-smi` shows sustained sub-70% GPU utilization while your DataLoader workers are pegged, you're starved and DALI helps. If you're fine-tuning BERT where the 'preprocessing' is tokenization — almost no CPU work — DALI gives you nothing. The Step 4 reflection asks for this analysis explicitly: define the acceptance metric (images/sec at fixed batch size, or wall-clock epoch time) *before* you integrate, so you can tell afterward whether it paid off.

Question 2

What's nvJPEG and why does it matter?

Accepted Answer

nvJPEG is NVIDIA's GPU-accelerated JPEG decoder — it reads compressed bytes and produces decoded RGB tensors directly in GPU memory, bypassing libjpeg-turbo on the CPU. Dedicated hardware decoders on newer GPUs make this dramatically faster than a CPU decode + H2D copy, and it's what `fn.decoders.image` with `device='mixed'` triggers in a DALI pipeline. The Step 2 pipeline uses it implicitly; you're not wiring it by hand, but it's the piece doing the heavy lifting.

Question 3

Why does the PyTorch DataLoader look slow in this lab?

Accepted Answer

Because the Step 1 baseline intentionally uses the simplest possible configuration — one worker, synchronous CPU transforms — so the comparison is clean. A fully-tuned PyTorch `DataLoader` with `num_workers=8, pin_memory=True, non_blocking=True, prefetch_factor=2` closes most of the gap for many workloads. The point of the lab isn't 'DALI > DataLoader' unconditionally, it's measuring the delta on a controlled workload and then thinking carefully about where the advantage transfers.

Question 4

What's the warmup step about in Step 3?

Accepted Answer

DALI pays pipeline-prefill latency on the very first batch — building the execution graph, warming the CUDA kernels for each op, kicking off the async prefetcher. If you time starting from iteration zero, that prefill amortizes badly over a tiny test set and DALI looks slower than it is. One full warmup iteration, then time the real epochs — exactly the same discipline you'd apply to any async GPU benchmark, including the PyTorch baseline.

Question 5

Can I mix DALI with my existing PyTorch training loop?

Accepted Answer

Yes, and that's the whole point of `DALIGenericIterator` (or `DALIClassificationIterator` for classification, or the PyTorch Lightning adapters). It yields dicts/lists of CUDA tensors that drop directly into your existing forward pass — you replace the DataLoader, nothing else. For multi-GPU you'd use `DALIGenericIterator` with `num_shards`/`shard_id` matching your DDP rank so each process reads its own shard of the dataset.

Question 6

What does the grader actually check?

Accepted Answer

Step 1 walks `dataset_dir` recursively for ≥100 JPEG/JPEG files and verifies `pytorch_throughput > 0`. Step 2 confirms the DALI iterator was built and `first_batch` is a 4D CUDA tensor with batch size ≥16. Step 3 enforces `dali_throughput > pytorch_throughput * 1.1` — a modest floor that makes sure both pipelines ran on the same data. Step 4 requires `where_dali_wins` with ≥3 scenarios and `where_dali_loses` with ≥2 anti-cases as strings.

NVIDIA DALI: GPU-Accelerated Data Pipelines

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll build in this NVIDIA DALI lab

Frequently asked questions