GPU-Accelerated Data Science with RAPIDS
GPU sandbox · jupyter
Beta

GPU-Accelerated Data Science with RAPIDS

Rewrite a pandas + sklearn data-science pipeline on GPU using cuDF and cuML, benchmark each stage against the CPU baseline, and run an end-to-end filter -> feature-engineer -> predict pipeline that never leaves the GPU.

40 min·4 steps·2 domains·Intermediate·ncp-adsnca-genl

What you'll learn

  1. 1
    cuDF vs pandas on a realistic groupby
  2. 2
    cuML KMeans clustering
  3. 3
    cuML RandomForest vs sklearn RandomForest
  4. 4
    End-to-end GPU pipeline

Prerequisites

  • Comfortable with pandas DataFrames
  • Familiarity with scikit-learn model training basics
  • Understanding of groupby / feature engineering

Exam domains covered

GPU Acceleration & Distributed TrainingData Engineering & Preparation

Skills & technologies you'll practice

This intermediate-level gpu lab gives you real-world reps across:

RAPIDScuDFcuMLpandassklearnKMeansRandomForestGPU Data Science

What you'll build in this RAPIDS / cuDF / cuML lab

RAPIDS is how data scientists stop throwing work over the wall to ML engineers in 2026 — cuDF and cuML keep the pandas + sklearn API you already know but run on GPU, which means the same feature engineering that ends up feeding a PyTorch model can live on the device the model will train on with no round-trip. In about 40 minutes on a real NVIDIA GPU we provision, you'll port a pandas groupby and a sklearn RandomForest workflow to cuDF and cuML, measure the speedup yourself at each stage (typically 10-100x on a million-row groupby, narrower on classification), and walk away with a working mental model of when GPU data science actually pays off — and the specific anti-patterns (.to_pandas() mid-pipeline) that silently kill the win.

The substance is the cost model of GPU data science. You'll build a 1M+-row synthetic e-commerce events table, run the identical groupby-aggregate in pandas and cuDF with the same API surface, then fit cuml.KMeans on features materialised directly from the cuDF DataFrame (no pandas detour) and train a cuml.RandomForestClassifier alongside sklearn on a 200k-row subset so you can see both wall-clock times and accuracy within the expected ~15-point bound (same algorithm, different histogram-based GPU split-finding). The deliberately tricky step is the end-to-end pipeline: a ~3% refund rate means a default-threshold RF often predicts all zeros and still scores ~97% — that's the Bayes-optimal constant predictor given the imbalance, not a broken model, which is the exact gap between an ML engineer who reads metrics and one who fixes pipelines (threshold tuning, class weights, PR-AUC or recall@k as the primary metric). The grader enforces isinstance(result_df, cudf.DataFrame) at the end specifically to catch the .to_pandas() anti-pattern that shows up in almost every RAPIDS pipeline review — because PCIe round-trips dwarf compute and a single silent conversion can eat a 50x speedup without changing any numbers.

Prerequisites: comfort with pandas DataFrames, sklearn basics, and groupby-style feature engineering. The sandbox has cuDF, cuML, pandas, sklearn, and cupy preinstalled with CUDA versions matched to RAPIDS. Search-intent hooks: "pandas to cuDF migration", "cuML vs sklearn RandomForest", "RAPIDS GPU speedup benchmark", "when to use cuDF vs Polars", "GPU data science pipeline" — the lab answers each with real timings and the specific failure modes you'll hit. Same pipeline code scales to tens of millions of rows on a single A100/H100 before you'd reach for Dask-cuDF.

Frequently asked questions

How much speedup should I actually expect from cuDF vs pandas?

Workload-dependent, but on a groupby over a million-row table you typically see 10-100x. The gap widens with larger tables and aggregations that benefit from GPU parallelism (joins, wide groupbys, string operations) and narrows on small data where PCIe transfer cost dominates or on operations cuDF hasn't specialized. The lab prints your measured speedup at the end of Step 1 — treat it as a datapoint, not a guarantee, and always benchmark on your own workload before claiming a production win.

Can I swap cuDF for Polars — they're both faster than pandas?

Different tools for different problems. Polars is a CPU-multithreaded DataFrame library with a lazy query planner — brilliant for wrangling data that fits in RAM on a beefy CPU box. cuDF is a GPU DataFrame that compiles a subset of pandas operations onto CUDA — the advantage is decisive when your data lives next to a GPU model and you want zero-copy hand-off to cuML, cuGraph, or PyTorch. If your pipeline ends at a CSV, Polars often wins. If it ends at a GPU model, cuDF keeps the data on the device where it belongs.

Why does the lab care so much about avoiding .to_pandas() mid-pipeline?

Because every round-trip is a PCIe copy — gigabytes moving from GPU VRAM to system RAM and back. It's the single most common way RAPIDS pipelines silently lose their speedup. Step 4's grader checks isinstance(result_df, cudf.DataFrame) specifically because a pipeline that runs filter on GPU → .to_pandas() for feature engineering → back to GPU for prediction will produce correct numbers and terrible throughput. Once the data is on the device, keep it there until the very end.

Why does cuML RandomForest's accuracy differ slightly from sklearn's?

Same algorithm, different implementation — cuML builds trees with a histogram-based GPU algorithm that's numerically close to but not identical to sklearn's exact split-finding. On clean synthetic data they match within a percent or two; on messy real data they can diverge more. The 15-point bound in Step 3 is a sanity check that you're not comparing apples to oranges (wrong features, wrong train/test split) rather than an expectation of exact parity.

The model predicted all zeros on the refund data — is it broken?

Probably not. With a ~3% positive rate and the default 0.5 threshold, predicting the majority class is the Bayes-optimal constant predictor for maximizing accuracy — the model hasn't failed, the metric has. Fix it at the pipeline level: tune the decision threshold, use class_weight='balanced', oversample the minority class with SMOTE, or switch the primary metric to PR-AUC or recall@k. The reflection step pushes you to name the right fix rather than staring at the accuracy number.

Do I need a specific GPU to run this lab?

No — the lab provisions a real NVIDIA GPU pod on demand with RAPIDS preinstalled against a compatible CUDA version. You only need a browser. For your own machine, RAPIDS documents the supported SKUs and CUDA matrix — most Pascal-or-newer consumer and datacenter cards work, and the RAPIDS install instructions pin the right conda/pip channel for your driver version.