Hands-on labs for LLMs, RAG & agents. Real GPUs.
Fine-tune LLMs with LoRA, ship RAG pipelines on NVIDIA NIM, build agentic systems, and profile CUDA — on real GPU sandboxes and hosted environments. Auto-graded against live output. No simulators.
Start here
Build a ReAct Agent with NVIDIA NIM
Build a complete reasoning + acting agent from scratch using LangChain, LangGraph, and NeMo Agent Toolkit — the three pillars of the NCP-AAI exam.
Model Context Protocol (MCP): Build a Tool Server
Build a Model Context Protocol server that exposes your company's tools and data — then connect a LangChain agent to it. Learn how MCP decouples tools from agents, when to use MCP vs Anthropic Skills vs native @tool, and why MCP is the emerging standard for AI tool interop.
Build a RAG Pipeline with NVIDIA NIM
Build a complete Retrieval Augmented Generation pipeline — from document chunking to vector search to an agent that answers questions from your knowledge base.
Agentic AI
Agent-to-Agent (A2A) Communication
Build two independent agents that talk to each other via the A2A protocol — each owned by a different team, running in its own process, discovered through a standardized AgentCard. Learn how A2A differs from multi-agent orchestration and when each architecture fits.
Agent Memory & Persistence
Build a sales intelligence assistant that remembers — short-term conversation state with LangGraph checkpointer, long-term facts in Milvus, and reflection loops that auto-extract knowledge. Learn the memory architecture every production agent needs.
Agent Patterns: ReAct vs Tool Calling vs Plan-and-Execute
Build the same SaaS customer support agent three different ways — ReAct, direct tool calling, and plan-and-execute — then compare them on speed, reasoning quality, and reliability to learn when to use each pattern in production.
Model Context Protocol (MCP): Build a Tool Server
Build a Model Context Protocol server that exposes your company's tools and data — then connect a LangChain agent to it. Learn how MCP decouples tools from agents, when to use MCP vs Anthropic Skills vs native @tool, and why MCP is the emerging standard for AI tool interop.
Multi-Agent Orchestration with LangGraph
Build a supervisor agent that routes queries to specialist agents — a core architecture pattern tested on the NCP-AAI exam.
Build a RAG Pipeline with NVIDIA NIM
Build a complete Retrieval Augmented Generation pipeline — from document chunking to vector search to an agent that answers questions from your knowledge base.
Build a ReAct Agent with NVIDIA NIM
Build a complete reasoning + acting agent from scratch using LangChain, LangGraph, and NeMo Agent Toolkit — the three pillars of the NCP-AAI exam.
Safety & Guardrails for AI Agents
Build a guarded IT support agent that blocks jailbreaks, refuses off-topic questions, and safely handles IT queries — using keyword checks, LLM-based validation, and NeMo Guardrails.
Evaluate an Agent with LLM-as-Judge
Build an eval harness that scores agent responses automatically — correctness via a reference-based judge, plus an accuracy metric and A/B comparison. Same pattern used by NeMo Evaluator for production agent evaluation.
Model Routing & Cost Cascade with NIM
Save 60–80% on inference by cascading queries through cheap → mid → expensive NIM models. Measure real costs via NIM's usage.cost field and compare against an always-large baseline.
Structured Output & Function Calling with NIM
Get reliable machine-parseable data out of an LLM. Compare prompt-only JSON extraction against the function-calling API, chain two tools, and measure the reliability gap on a real extraction task.
Visual Q&A with NVIDIA VLMs
Send images to a Vision-Language Model via NIM, answer questions about them, extract structured fields from a receipt-style image, and compare two VLMs on the same task — all through the OpenAI-compatible chat endpoint.
Multimodal RAG with NeMo Retriever
Build an image-query RAG system: embed a catalog with NeMo Retriever, translate an uploaded image into a retrieval query via a VLM, and ground the VLM's final answer in the retrieved passages.
LLM serving & inference
Deploy & Serve LLMs in Production
Go from slow single-request inference to production-ready LLM serving with vLLM. Benchmark throughput, tune settings, and learn when to use vLLM vs Triton vs TGI.
Deploy & Serve LLMs in Production (Jupyter)
Go from slow single-request inference to production-ready LLM serving with vLLM. Benchmark throughput, tune settings, and learn when to use vLLM vs Triton vs TGI.
Inference Serving Patterns: Dynamic Batching, Throughput, and the Triton Mental Model
Build a mini-Triton inference server in ~30 lines of Python: a dynamic batcher with max_batch_size and max_queue_delay knobs, load-tested against a naive baseline, swept for the throughput-latency tradeoff, and bridged to a real Triton config.pbtxt.
Batch Size & Precision Sweep: Finding Your Sweet Spot
Sweep batch sizes and numerical precisions (fp32, fp16, bf16) on a real model to find the throughput/VRAM knee, then ship a production recommendation with SKU-aware precision picks and an accuracy gate.
vLLM Production Serving: PagedAttention, Continuous Batching, Prefix Caching
Stand up vLLM and measure the three features that make it the de-facto inference server: PagedAttention's KV-cache capacity, continuous batching throughput, and prefix caching speedups. Then write the production spec — server args, Kubernetes deployment, monitoring, autoscaling.
Fine-tuning & alignment
Fine-Tune an LLM with LoRA and QLoRA
Fine-tune Meta Llama 3 8B on a custom instruction dataset using LoRA and QLoRA. Learn parameter-efficient fine-tuning from data preparation through evaluation — the #1 most demanded AI skill.
Fine-Tune an LLM with LoRA and QLoRA (Jupyter)
Fine-tune Meta Llama 3 8B on a custom instruction dataset using LoRA and QLoRA. Learn parameter-efficient fine-tuning from data preparation through evaluation — the #1 most demanded AI skill.
Quantize & Optimize LLMs with bitsandbytes
Load a model in fp16, INT8, and NF4, then benchmark the three precisions on VRAM, latency, and output quality. See where quantization wins and where it costs you.
RLHF & DPO Alignment
Run real Direct Preference Optimization on a small language model with TRL's DPOTrainer. Capture a baseline, build a preference dataset, train, and measurably shift the model's behavior in four steps.
Stable Diffusion + LoRA
Load Stable Diffusion, attach LoRA adapters to the U-Net's attention layers, run a tiny overfit training loop, and generate with the adapted weights to prove that a few million trainable parameters actually move pixels.
RAG & retrieval
Advanced RAG: Hybrid Search + Cross-Encoder Reranking
Build a production-shape retrieval stack — dense bi-encoder plus from-scratch BM25, fused with Reciprocal Rank Fusion, then re-ordered by a BAAI cross-encoder. The exact architecture behind modern enterprise RAG.
Retrieval-Augmented Generation (RAG) Pipeline with Local Models
Build an end-to-end RAG pipeline on a single GPU: BGE embeddings, L2-normalized vector retrieval by dot product, and a local generator that answers with and without retrieved context so you can see exactly what retrieval changes.
Training & pretraining
Continued Pre-Training: Adapt a Pretrained LM to a New Domain
Take GPT-2 and domain-adapt it to Python code in 150 steps, measuring both the gain on code and the cost in catastrophic forgetting on English. The exact recipe behind Code Llama, BloombergGPT, and every domain-specialized LLM of the last three years.
Train a Small Language Model from Scratch
Train a real GPT-style language model from zero on TinyStories: tokenize, wire up the optimizer and LR schedule, run the training loop with validation perplexity, and generate coherent text from your own weights. End-to-end pretraining in minutes on one GPU.
Transformer Architecture Deep Dive
Build every piece of a decoder-only transformer by hand — scaled dot-product attention, multi-head attention, the full block with residuals and LayerNorm, then assemble a tiny GPT and train it. No shortcuts, no pre-built attention modules.
CUDA & kernel optimization
CUDA Programming Fundamentals
Write four real CUDA C++ kernels and run them from PyTorch: vector add, 2D matrix add, tiled matmul with shared memory, and a custom autograd op.
GPU Sharing: Streams, MPS, MIG, and the Real Cost of Contention
Measure four ways to share a single GPU — CUDA streams, multi-process time-slicing, MPS, and MIG — and write the production artifacts (start scripts, k8s device-plugin ConfigMaps, MIG geometries) that turn 15%-utilized fleets into 80%-utilized ones.
Nsight Systems Profiling: Finding the Bottleneck That Costs You 40% of Your GPU
Run the full profile-then-fix loop with NVIDIA Nsight Systems — instrument a training loop with NVTX ranges, capture a .nsys-rep, parse the NVTX summary to pinpoint the bottleneck, then apply a targeted fix and measure the speedup.
Profiling & performance
Profile PyTorch Training with the Built-in Profiler
Instrument a training loop with torch.profiler, read the op-level table, inspect the Chrome/Perfetto timeline, and decide when to reach for Nsight Systems instead.
GPU Cost & Efficiency Audit
Build a four-stage cost-audit pipeline — measure, classify, price, recommend — that turns raw NVML samples into dollar-denominated waste and specific remediation actions. The skeleton behind every enterprise GPU cost product.
Multimodal
Vision-Language Models: Captioning and Visual QA
Load Qwen2-VL, caption a real image, run a battery of visual question-answering prompts, and dissect the architecture — vision encoder, projector, language model — to see exactly how pixels become tokens the LLM can reason over.
Data & pipelines
NVIDIA DALI: GPU-Accelerated Data Pipelines
Move image decoding, resizing, and augmentation from CPU to GPU with NVIDIA DALI, and benchmark it against a standard PyTorch DataLoader. The input-pipeline fix that unlocks real multi-GPU throughput.
Data Preparation for LLM Training
Build a real pretraining/instruction data pipeline: load a raw corpus, apply quality filters, deduplicate, train a BPE tokenizer, and batch-validate on GPU. This is the unglamorous work that actually decides how good your model will be.
Synthetic Data Generation for Model Training
Build a Self-Instruct style synthetic dataset end-to-end: seed instructions, LLM-driven generation, robust parsing, quality filtering, and dedup + diversity scoring. The same pipeline that produced Alpaca, WizardLM, and most modern instruction-tuning corpora.
GPU infrastructure
GPU Container Lifecycle: Build, Test, Ship, Rollback
Walk through the full lifecycle of a production GPU container — multi-stage Dockerfile, self-hosted GPU CI, a fail-fast smoke test, and a Kubernetes Deployment with readiness probes gated on real GPU compute. The pipeline that stops bad images before users see a 500.
GPU Health Checks + Auto-Remediation
Build a production-grade GPU watchdog: multi-dimensional NVML health probe, rogue-process detection, auto-remediation that kills the offender and verifies recovery, then wire it up with Prometheus alerts and Kubernetes liveness probes.
MLflow Experiment Tracking: From Single Run to Team Workflow
Wire the four load-bearing pieces of MLflow into a real training loop — tracked runs with params and metrics, a registered model with stage transitions, a multi-run sweep + search, and a production spec (server, k8s Job, tags, autolog).
NVIDIA GPU Operator on k3s: Single-Node Kubernetes for GPU Workloads
Bring up a lightweight single-node Kubernetes cluster with the NVIDIA GPU Operator — k3s install, containerd wiring, Helm values, workload manifests with RBAC and ResourceQuota, plus a full runbook (validation plan, troubleshooting matrix, day-2 ops).
Monitoring & ops
GPU Observability: From nvidia-smi to a Production Monitoring Stack
Go from a raw NVML snapshot to a real monitoring pipeline: capture live GPU telemetry during a workload, diagnose a dataloader bottleneck from the utilization trace, and expose everything as a Prometheus /metrics endpoint.
Accelerated data science
GPU-Accelerated Data Science with RAPIDS
Rewrite a pandas + sklearn data-science pipeline on GPU using cuDF and cuML, benchmark each stage against the CPU baseline, and run an end-to-end filter -> feature-engineer -> predict pipeline that never leaves the GPU.
More labs
Evaluation & Benchmarking LLMs
Four evaluation lenses in one lab: compute real perplexity, expose BLEU's blindness to paraphrase, run side-by-side model comparisons, and build an LLM-as-judge harness with position-bias detection.
Reproducible Training: The Flags, The Cost, The Artifacts
Measure the non-determinism noise floor in default PyTorch, flip every determinism flag until same-seed runs match bit-for-bit, quantify the perf cost, and capture a content-addressable training config that makes a run reproducible forever.
GPU Environment Smoke Test
Validate the GPU lab environment: terminal, file operations, PyTorch, CUDA, and model loading.
Every lab runs on real AI infrastructure.
No video simulators, no canned outputs. Spin up a real GPU, or hook into our hosted stack — either way, you're graded on the metrics you actually produce.