Question 1

Why use TinyLlama specifically — won't a stronger generator produce better data?

Accepted Answer

Absolutely it would, and that's one of the reflection lessons. TinyLlama is used here because it fits on a single small GPU and runs fast enough that you can iterate on prompts and filters in seconds. In production you'd swap in Llama 3 70B, Mixtral, or even a frontier API model — the pipeline code doesn't change, only the generator handle does. The point of the lab is the scaffolding, not the generator choice.

Question 2

Why expect parse-success well below 100%?

Accepted Answer

Because LLM output is fundamentally free-form and your regex/format parser is strict. A sampling temperature of 0.7-1.0 is what gives you instruction diversity, but it also makes the model occasionally skip delimiters, hallucinate extra fields, or emit markdown where you wanted plain text. Production Self-Instruct pipelines hit 40-70% parse success on first pass and compensate by generating more — tightening the parser below that level drops diversity faster than it drops noise.

Question 3

What's wrong with exact-match deduplication at scale?

Accepted Answer

It only catches byte-identical strings after normalization. A 50k dataset generated from similar prompts fills up with templated rewrites — `'Write a haiku about the sea'`, `'Compose a haiku about the ocean'`, `'Write me a haiku about a sea'` — that lexically differ but teach the model the same lesson ten thousand times. Embedding-based dedup (compute BGE vectors, cluster, keep one per cluster) is what catches those, and it's the first upgrade the reflection step asks you to defend.

Question 4

How should I pick a diversity score — token ratio, edit distance, or verb coverage?

Accepted Answer

All three are valid for this lab as long as they land in [0, 1]. Unique-token ratio (unique tokens / total tokens across all instructions) is cheapest and catches degenerate repetition well. Average pairwise edit distance catches near-duplicates that token ratio misses. Verb coverage (count of distinct lead verbs — 'write', 'classify', 'summarize', etc.) is what the original Self-Instruct paper used because it approximates task-type diversity. In practice you'd track all three and alert when any one collapses.

Question 5

Why do seed examples matter so much when the generator produces most of the samples?

Accepted Answer

Because the few-shot prompt is what defines the task distribution the model generates into. Three seeds that are all 'write a haiku about X' collapse the output to haikus. Three seeds covering classification, extraction, and open-ended writing produce a much broader distribution. This is why the original Self-Instruct used 175 hand-written seeds and Alpaca used 175 as well — the seeds are the implicit task ontology, and they're worth hand-crafting.

Question 6

When should I use LLM-as-judge filtering vs regex rules?

Accepted Answer

Length and format rules catch obvious junk for almost no cost — run them first. LLM-as-judge catches what rules can't see: factual errors, failure to follow the instruction, subtle toxicity, low-effort completions. The cost is real, though — every kept sample means an extra inference call, and the judge has its own biases (favoring verbose answers, penalizing correct-but-brief ones). The right pattern is regex rules as a cheap first pass, LLM-as-judge on the survivors, and human review on a calibration set that tunes both.

Synthetic Data Generation for Model Training

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll build in this synthetic-data lab

Frequently asked questions