GPU-Accelerated Data Science with RAPIDS
Rewrite a pandas + sklearn data-science pipeline on GPU using cuDF and cuML, benchmark each stage against the CPU baseline, and run an end-to-end filter -> feature-engineer -> predict pipeline that never leaves the GPU.
What you'll learn
- 1cuDF vs pandas on a realistic groupby
- 2cuML KMeans clustering
- 3cuML RandomForest vs sklearn RandomForest
- 4End-to-end GPU pipeline
Prerequisites
- Comfortable with pandas DataFrames
- Familiarity with scikit-learn model training basics
- Understanding of groupby / feature engineering
Exam domains covered
Skills & technologies you'll practice
This intermediate-level gpu lab gives you real-world reps across:
What you'll build in this RAPIDS / cuDF / cuML lab
RAPIDS is how data scientists stop throwing work over the wall to ML engineers in 2026 — cuDF and cuML keep the pandas + sklearn API you already know but run on GPU, which means the same feature engineering that ends up feeding a PyTorch model can live on the device the model will train on with no round-trip. In about 40 minutes on a real NVIDIA GPU we provision, you'll port a pandas groupby and a sklearn RandomForest workflow to cuDF and cuML, measure the speedup yourself at each stage (typically 10-100x on a million-row groupby, narrower on classification), and walk away with a working mental model of when GPU data science actually pays off — and the specific anti-patterns (.to_pandas() mid-pipeline) that silently kill the win.
The substance is the cost model of GPU data science. You'll build a 1M+-row synthetic e-commerce events table, run the identical groupby-aggregate in pandas and cuDF with the same API surface, then fit cuml.KMeans on features materialised directly from the cuDF DataFrame (no pandas detour) and train a cuml.RandomForestClassifier alongside sklearn on a 200k-row subset so you can see both wall-clock times and accuracy within the expected ~15-point bound (same algorithm, different histogram-based GPU split-finding). The deliberately tricky step is the end-to-end pipeline: a ~3% refund rate means a default-threshold RF often predicts all zeros and still scores ~97% — that's the Bayes-optimal constant predictor given the imbalance, not a broken model, which is the exact gap between an ML engineer who reads metrics and one who fixes pipelines (threshold tuning, class weights, PR-AUC or recall@k as the primary metric). The grader enforces isinstance(result_df, cudf.DataFrame) at the end specifically to catch the .to_pandas() anti-pattern that shows up in almost every RAPIDS pipeline review — because PCIe round-trips dwarf compute and a single silent conversion can eat a 50x speedup without changing any numbers.
Prerequisites: comfort with pandas DataFrames, sklearn basics, and groupby-style feature engineering. The sandbox has cuDF, cuML, pandas, sklearn, and cupy preinstalled with CUDA versions matched to RAPIDS. Search-intent hooks: "pandas to cuDF migration", "cuML vs sklearn RandomForest", "RAPIDS GPU speedup benchmark", "when to use cuDF vs Polars", "GPU data science pipeline" — the lab answers each with real timings and the specific failure modes you'll hit. Same pipeline code scales to tens of millions of rows on a single A100/H100 before you'd reach for Dask-cuDF.
Frequently asked questions
How much speedup should I actually expect from cuDF vs pandas?
Can I swap cuDF for Polars — they're both faster than pandas?
Why does the lab care so much about avoiding .to_pandas() mid-pipeline?
.to_pandas() mid-pipeline?isinstance(result_df, cudf.DataFrame) specifically because a pipeline that runs filter on GPU → .to_pandas() for feature engineering → back to GPU for prediction will produce correct numbers and terrible throughput. Once the data is on the device, keep it there until the very end.Why does cuML RandomForest's accuracy differ slightly from sklearn's?
The model predicted all zeros on the refund data — is it broken?
class_weight='balanced', oversample the minority class with SMOTE, or switch the primary metric to PR-AUC or recall@k. The reflection step pushes you to name the right fix rather than staring at the accuracy number.