AI Foundations

Latent Diffusion Models (LDM)

By Arpit Tripathi, Founder

Latent diffusion models run the diffusion process in the compressed latent space of a pretrained autoencoder instead of raw pixels. This cuts compute and memory dramatically while keeping image quality, and underpins Stable Diffusion.

What is a Latent Diffusion Model?

A latent diffusion model (LDM) is a diffusion model that operates in the lower-dimensional latent space of a pretrained autoencoder rather than directly on pixels. Introduced by Rombach and colleagues in High-Resolution Image Synthesis with Latent Diffusion Models (CVPR 2022), the design separates image generation into two stages: an autoencoder that compresses and reconstructs images, and a diffusion model that learns to generate the compressed latent representation.

Because the diffusion process runs on a much smaller tensor, training and sampling become far cheaper than pixel-space diffusion while perceptual quality is preserved. Stable Diffusion is the best-known LDM, built on this architecture.

  • Generation is split into a frozen autoencoder and a latent-space diffusion model.
  • Operating on latents rather than pixels is what makes LDMs efficient.
  • Stable Diffusion is the canonical implementation of this design.

Why the latent space saves so much compute

Pixel-space diffusion spends most of its compute modeling high-frequency detail that the human eye barely perceives. LDM offloads that work to a perceptual autoencoder trained once, which compresses an image by a spatial downsampling factor f. For a 512x512 RGB image at f=8, the latent grid is 64x64, a large reduction in the number of spatial elements the diffusion model must process.

The original paper studied factors such as f=4, f=8, and f=16, finding f=4 and f=8 to be a sweet spot between compression and reconstruction fidelity. Above this range, the autoencoder loses too much detail; below it, the compute savings shrink. Every denoising step then runs on the small latent grid, which is why a single GPU can sample high-resolution images in seconds.

L_LDM = E_{z, c, ε, t} [ || ε − ε_θ(z_t, t, c) ||² ]
The LDM training objective: a U-Net ε_θ predicts the noise added to a latent z_t at timestep t, conditioned on c, instead of predicting noise on pixels.
x ∈ ℝ^{H×W×3} → z = E(x) ∈ ℝ^{(H/f)×(W/f)×c}
The encoder E maps a pixel image to a latent compressed by factor f in each spatial dimension; the decoder D inverts this after sampling.
  • The autoencoder is trained once and frozen, then reused for all diffusion training.
  • Spatial downsampling factor f (4 or 8 in practice) sets the compression level.
  • Diffusion compute scales with the latent grid size, not the pixel resolution.
  • A perceptual plus patch-adversarial loss keeps reconstructions sharp.

Architecture: autoencoder, U-Net, and conditioning

An LDM has three components. First, an encoder E maps an image into a latent and a decoder D maps it back, trained with a combination of perceptual and adversarial losses plus a mild regularizer (KL or vector-quantization) to keep the latent space well behaved. Second, a time-conditioned U-Net denoiser learns to reverse the diffusion process in latent space. Third, a conditioning mechanism injects guidance such as text.

Conditioning is handled by cross-attention layers inside the U-Net. The condition (for example a text prompt encoded by a language model) becomes the keys and values, while the latent features provide the queries. This lets the same architecture accept text, semantic maps, bounding boxes, or other modalities as the conditioning signal.

  • Stage 1: a frozen perceptual autoencoder defining the latent space.
  • Stage 2: a U-Net trained to denoise latents over diffusion timesteps.
  • Cross-attention layers route text or other conditions into the U-Net.
  • At inference, sample a latent then decode once with D to get pixels.

LDM and Stable Diffusion

Stable Diffusion is a text-to-image LDM that pairs the latent diffusion U-Net with a text encoder and a variational autoencoder. The original v1.x release used a frozen CLIP ViT-L/14 text encoder to produce the conditioning embeddings fed through cross-attention. Generation runs the denoising loop in the 64x64 latent space for a typical 512x512 output, then decodes once.

Because only the final decode touches full resolution, an LDM produces high-resolution images with a fraction of the memory that pixel-space diffusion would require, which is what made open, consumer-GPU text-to-image generation practical.

  • Stable Diffusion is the canonical open LDM for text-to-image.
  • Classifier-free guidance is applied during the latent denoising loop.
  • Only a single decode step operates at full pixel resolution.

Tradeoffs and limitations

The compression that makes LDMs efficient is also their main limitation: anything the autoencoder discards cannot be recovered by the diffusion model. Very fine textures, small text, and precise high-frequency structure can degrade, which is why fine detail and legible text were long-standing weak points for early latent models.

The quality ceiling is therefore set jointly by the autoencoder and the diffusion U-Net. Improvements often come from better autoencoders, larger or higher-quality latent spaces, and stronger conditioning encoders rather than only scaling the denoiser.

  • Autoencoder compression caps recoverable detail; fine text and texture can suffer.
  • Choice of downsampling factor f trades speed against reconstruction fidelity.
  • Quality depends on both the autoencoder and the U-Net, not the denoiser alone.

Key takeaways

  • LDMs run diffusion in a compressed latent space from a pretrained autoencoder, not in pixel space.
  • A spatial downsampling factor of 4 or 8 yields large compute savings while preserving quality.
  • Cross-attention layers inject text or other conditioning into the denoising U-Net.
  • Stable Diffusion is the canonical LDM, enabling text-to-image on a single consumer GPU.
  • The autoencoder caps recoverable detail, so fine textures and small text can degrade.

Frequently asked questions

Pixel-space diffusion denoises full-resolution images directly, which is compute-heavy. Latent diffusion first compresses images with an autoencoder, runs the diffusion process on the small latent, then decodes once, cutting memory and time substantially.
The diffusion U-Net operates on a latent grid compressed by a factor of 4 or 8 per side, so each denoising step processes far fewer elements. Only the final decode runs at full pixel resolution.
Yes. Stable Diffusion is the most widely used latent diffusion model. It pairs the latent denoising U-Net with a variational autoencoder and a text encoder, running the denoising loop in latent space.
The factor f sets how aggressively the autoencoder compresses each spatial dimension. The original paper found f=4 and f=8 balance compute savings against reconstruction quality; larger f loses detail, smaller f saves less compute.
Whatever the autoencoder discards during compression cannot be regenerated, so fine textures and small legible text can degrade. Output quality is bounded jointly by the autoencoder and the denoising U-Net.