AI Foundations

Classifier-Free Guidance (CFG)

By Arpit Tripathi, Founder

Classifier-free guidance steers a diffusion model toward its prompt by combining conditional and unconditional noise predictions at each denoising step. A guidance scale controls the tradeoff between prompt adherence and sample diversity.

What is Classifier-Free Guidance?

Classifier-free guidance (CFG) is a sampling technique that makes a conditional diffusion model follow its prompt more strongly without using a separate classifier. Introduced by Jonathan Ho and Tim Salimans in 2022, it replaces the earlier classifier guidance, which required training an extra noise-aware classifier whose gradients pushed samples toward the target class.

Instead, CFG trains a single diffusion model that can run both conditionally (given a prompt) and unconditionally (with the prompt dropped). At each denoising step it computes both noise predictions and extrapolates away from the unconditional prediction toward the conditional one. A guidance scale sets how far to push.

  • No separate classifier is required, unlike earlier classifier guidance.
  • One model produces both conditional and unconditional noise estimates.
  • A guidance scale controls how strongly the prompt is followed.

The guidance formula

During training, the conditioning is randomly dropped a fraction of the time (commonly around 10 to 20 percent), so the same network learns both the conditional noise estimate and an unconditional one. At sampling time, the two estimates are combined.

The guided noise prediction starts from the unconditional estimate and moves in the direction of the conditional estimate, scaled by the guidance weight. A scale of 0 ignores the prompt entirely, a scale of 1 recovers the plain conditional model, and larger scales exaggerate prompt adherence.

ε̃_θ(z_t, c) = ε_θ(z_t, ∅) + s · ( ε_θ(z_t, c) − ε_θ(z_t, ∅) )
The guided noise estimate: start from the unconditional prediction ε_θ(z_t, ∅) and extrapolate toward the conditional one by guidance scale s. Here s = 1 gives the unguided conditional model; s > 1 strengthens the prompt.
  • One network is trained to be both conditional and unconditional via prompt dropout.
  • Guidance extrapolates from the unconditional toward the conditional prediction.
  • No separate classifier or its gradients are needed at any point.

The guidance scale: adherence vs diversity

The guidance scale (often labeled CFG scale, with typical values around 7 to 8 for text-to-image) is the central knob. Raising it makes outputs match the prompt more closely and look more saturated and contrast-heavy, but reduces variety across samples and can introduce artifacts or oversaturation when pushed too high.

Lowering the scale produces more diverse, sometimes more natural outputs that may drift from the prompt. The optimal value is task and model dependent: photorealistic models often prefer moderate scales, while some distilled or newer samplers work best at low scales or with no guidance at all.

  • Higher scale: stronger prompt adherence, lower diversity, risk of oversaturation.
  • Lower scale: more diverse and natural output, weaker prompt match.
  • Typical text-to-image values cluster around 7 to 8, but vary by model.
  • Each guided step costs two forward passes (conditional and unconditional).

Using CFG in Diffusers

In the Hugging Face Diffusers library, classifier-free guidance is exposed through the guidance_scale argument of a pipeline call. Internally the pipeline batches the conditional and unconditional prompts together, runs one forward pass, and combines the two predictions with the formula above at every step.

python
from diffusers import StableDiffusionPipeline
import torch

pipe = StableDiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
).to("cuda")

image = pipe(
    prompt="a watercolor painting of a lighthouse at dawn",
    negative_prompt="blurry, low quality",  # unconditional branch
    guidance_scale=7.5,                      # CFG strength
    num_inference_steps=30,
).images[0]
image.save("out.png")
Set the classifier-free guidance scale in a Diffusers text-to-image pipeline.
  • guidance_scale=1.0 disables guidance (plain conditional sampling).
  • The negative prompt supplies the unconditional branch in many pipelines.
  • Because each step needs two predictions, guidance roughly doubles step cost.

Why CFG replaced classifier guidance

Classifier guidance required a separate classifier trained on noisy images, adding complexity and a second model to maintain, and its quality was bounded by the classifier. CFG removes that dependency: the diffusion model itself supplies both terms, so guidance works for arbitrary conditioning such as free-form text where training a clean classifier would be impractical.

The cost is that each guided step runs the model twice, once with and once without the condition, roughly doubling inference compute per step. This tradeoff is widely accepted because the quality and controllability gains are large, which is why CFG is standard in modern text-to-image systems including Stable Diffusion.

  • No extra classifier to train or maintain, unlike classifier guidance.
  • Works naturally with free-form text conditioning.
  • Costs a second forward pass per denoising step.

Key takeaways

  • CFG steers diffusion samples toward the prompt by combining conditional and unconditional noise predictions.
  • A single model learns both modes by randomly dropping the condition during training.
  • The guidance scale trades prompt adherence (high) against diversity (low); typical values are around 7 to 8.
  • It replaced classifier guidance, removing the need for a separate noise-aware classifier.
  • Each guided denoising step runs the model twice, roughly doubling per-step compute.

Frequently asked questions

It controls how strongly the sample follows the prompt. A higher scale increases prompt adherence but lowers diversity and can oversaturate images; a lower scale yields more varied, natural output that may drift from the prompt. A scale of 1 means no guidance.
Values around 7 to 8 are common defaults for text-to-image, balancing prompt adherence and image quality. The best value depends on the model and sampler; some distilled models work well at much lower scales or none.
Classifier guidance uses a separate classifier trained on noisy images to push samples toward a class. Classifier-free guidance needs no separate classifier: the diffusion model produces both conditional and unconditional estimates itself, which suits free-form text prompts.
Each denoising step requires two forward passes, one conditional and one unconditional, to compute both predictions before combining them. This roughly doubles the compute per step compared to unguided sampling.
Outputs become oversaturated, high-contrast, and less diverse, and can develop artifacts. Very high scales overemphasize the prompt direction at the cost of natural appearance and sample variety.