Question 1

Why not just use `dataset.filter()` and `dataset.map()` everywhere?

Accepted Answer

You can, and at production scale you should. The lab implements filter and dedup by hand so you see the rules — 100-char minimum, alphanumeric ratio, normalization before hashing — rather than burying them in a helper. Once you've written the plain-Python version, swapping to Arrow-backed `datasets` operations is a trivial speed optimization. Getting the rules wrong, on the other hand, silently drops the wrong samples and is invisible until your loss curve misbehaves two days into training.

Question 2

Why is exact-hash deduplication 'only a floor'?

Accepted Answer

Because it catches byte-identical strings and nothing else. At web scale, near-duplicates dominate: boilerplate headers and footers, templated product descriptions, mirrored articles with one-word edits, paraphrased Wikipedia. Production pipelines layer on MinHash + LSH (for shingle-based near-dupe detection), SimHash, or semantic dedup via embedding clustering. Exact-hash is still worth running first because it's cheap and removes the obvious case, but on its own it leaves ~30-50% of effective duplicates in place on common crawl-type corpora.

Question 3

Why a ByteLevel pre-tokenizer for BPE?

Accepted Answer

ByteLevel is what GPT-2 pioneered and what most modern open LLMs still use. It maps every byte to a printable Unicode character before BPE runs, which guarantees the tokenizer never sees an 'unknown' symbol at inference — any input is representable, because every input byte has a mapping. Alternatives (Whitespace pre-tokenizer, SentencePiece BPE, Unigram) handle OOV differently and change downstream behavior. ByteLevel gives you coverage + reasonable compression + compatibility with the GPT-2/Llama tokenizer family, which is why it's the default.

Question 4

Why does tokenizer vocab size matter so much?

Accepted Answer

Because it's fixed for the life of the model — every later decision (embedding table size, token budget, maximum context cost, per-step compute) is downstream of it. Too small: the tokenizer splits common words into many pieces, inflating your sequence lengths and shrinking effective context. Too large: the embedding and output head balloon, wasting parameters on rare tokens. The 8,000 setting in this lab is small on purpose (fast to train, easy to inspect); real LLMs use 32k-128k depending on language coverage.

Question 5

Why feed the instruction batch through GPT-2 at the end?

Accepted Answer

To prove the pipeline actually produces tensors a real model can consume without crashing. Step 5 checks that `input_ids` and `attention_mask` have matching shapes, live on CUDA, and that the forward pass returns logits with the right batch/seq/vocab dimensions. This is the smallest-possible end-to-end sanity check: a pipeline that processes ten million samples but fails to produce a valid batch tensor is worse than useless, and catching it at sample size 4 is cheap.

Question 6

What's genuinely missing from this pipeline compared to Llama 3 / DBRX scale?

Accepted Answer

Language ID (keep English-only or build language-aware mixes), near-duplicate detection (MinHash/LSH), toxicity and PII scrubbing, source-weighted mixing (code, web, books at tuned ratios), per-source quality classifiers (the Llama 3 paper trained dedicated filters), and proper shuffling across shards. The reflection step pushes you to name which of these you'd build first and why — and the honest answer is usually 'near-duplicate dedup' because it's the biggest silent-quality win at scale.

Data Preparation for LLM Training

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll build in this LLM data-prep lab

Frequently asked questions

Why not just use `dataset.filter()` and `dataset.map()` everywhere?

Why is exact-hash deduplication 'only a floor'?

Why a ByteLevel pre-tokenizer for BPE?

Why does tokenizer vocab size matter so much?

Why feed the instruction batch through GPT-2 at the end?

What's genuinely missing from this pipeline compared to Llama 3 / DBRX scale?

Data Preparation for LLM Training

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll build in this LLM data-prep lab

Frequently asked questions

Why not just use dataset.filter() and dataset.map() everywhere?

Why is exact-hash deduplication 'only a floor'?

Why a ByteLevel pre-tokenizer for BPE?

Why does tokenizer vocab size matter so much?

Why feed the instruction batch through GPT-2 at the end?

What's genuinely missing from this pipeline compared to Llama 3 / DBRX scale?

Why not just use `dataset.filter()` and `dataset.map()` everywhere?