Question 1

Why rank=4 and why only attention layers?

Accepted Answer

Because SD 1.5's attention modules are where style and composition are shaped, and 4-dimensional low-rank updates are enough to bias those modules without destroying the base knowledge. Rank-4 on `to_q`, `to_k`, `to_v`, `to_out` across the U-Net lands around 1-2M trainable parameters, which is <0.3% of the ~860M total. Going higher (rank 16-32) is standard for character/identity adapters where you need more capacity; going lower (rank 1-2) is common for subtle style LoRAs. Putting LoRA on the full U-Net — including conv layers — is an option but costs more parameters for a smaller marginal gain on style tasks.

Question 2

Why does the check require `mean_abs_diff > 0.5` instead of asserting the image looks 'better'?

Accepted Answer

Because automatic "is this image better" is an unsolved problem, but "did the adapters actually modify the diffusion trajectory" is a trivially checkable signal. A mean absolute pixel diff of 0 — with the same prompt and same seed — means the adapters are either not wired, not active during generation, or being trained at learning_rate=0. Above 0.5 on a 0-255 scale means the optimizer moved weights, the modified attention was used, and the diffusion trajectory diverged. That's the floor; whether the output is actually good is what the Step 4 reflection is for.

Question 3

Why overfit on a single image? That's usually a bug.

Accepted Answer

In a production style-LoRA training job it absolutely is. Here it's deliberately the fastest way to see the LoRA mechanism move pixels in ~30 seconds of GPU time. A single-image overfit guarantees the gradient is large and consistent, which means even rank-4 adapters drift visibly within 10 steps. The Step 3 reflection and the rubric pointedly note that validating on the training prompt is a trap — it'll always look right because the adapter memorised it — which is the core methodological lesson.

Question 4

What's the difference between a style LoRA and a character / identity LoRA?

Accepted Answer

Style LoRAs push the whole output distribution toward a look — cyberpunk neon, watercolor, vintage film — and can often be trained at rank 2-8 on cross-attention layers alone. Character / identity LoRAs need to encode specific facial geometry and clothing, which is higher-frequency information, so they typically want rank 16-64, training on self-attention modules too, and often augment with regularisation images to prevent the whole model collapsing onto that one identity. Dreambooth variants add a class-preservation loss on top of LoRA to keep the broader subject class intact.

Question 5

Why use the same seed for baseline and adapted generation?

Accepted Answer

Because Stable Diffusion is deterministic given the initial noise, the scheduler, and the text embeddings. Fixing the seed freezes everything except the U-Net's weights, so any pixel difference between `baseline_image` and `adapted_image` is causally attributable to the LoRA adapters. Use a different seed and you can't distinguish "the LoRA changed the trajectory" from "we sampled a different starting noise" — the check would become noise-dominated.

Question 6

Can I save this LoRA and share it like the ones on Civitai or Hugging Face?

Accepted Answer

Yes — that's half of why the format exists. After training, `peft_model.save_pretrained('./my-lora')` writes only the adapter weights plus an `adapter_config.json`, typically a few MB. Anyone with the matching SD 1.5 base can load them via `PeftModel.from_pretrained(...)` or the diffusers `pipe.load_lora_weights(...)` helper. Most community LoRAs are a handful of megabytes and can be hot-swapped at generation time, which is why the ecosystem moved to LoRAs over full Dreambooth finetunes.

Fine-Tune Stable Diffusion with LoRA: Custom Text-to-Image

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll build in this Stable Diffusion LoRA lab

Frequently asked questions

Why rank=4 and why only attention layers?

Why does the check require `mean_abs_diff > 0.5` instead of asserting the image looks 'better'?

Why overfit on a single image? That's usually a bug.

What's the difference between a style LoRA and a character / identity LoRA?

Why use the same seed for baseline and adapted generation?

Can I save this LoRA and share it like the ones on Civitai or Hugging Face?

Fine-Tune Stable Diffusion with LoRA: Custom Text-to-Image

What you'll learn

Prerequisites

Exam domains covered

Skills & technologies you'll practice

What you'll build in this Stable Diffusion LoRA lab

Frequently asked questions

Why rank=4 and why only attention layers?

Why does the check require mean_abs_diff > 0.5 instead of asserting the image looks 'better'?

Why overfit on a single image? That's usually a bug.

What's the difference between a style LoRA and a character / identity LoRA?

Why use the same seed for baseline and adapted generation?

Can I save this LoRA and share it like the ones on Civitai or Hugging Face?

Why does the check require `mean_abs_diff > 0.5` instead of asserting the image looks 'better'?