How Diffusion Models Create Images: From Noise to Art

Chapter 1

The Forward Process

To teach an AI how to build, we must first teach it how to destroy.

The model takes a crisp photograph and subjects it to a mathematical process of slow, deliberate degradation. This is the forward diffusion process.

At each timestep t, the model injects Gaussian noise according to a variance schedule β_t. The image becomes progressively noisier until it reaches pure entropy.

Chapter 1

Injecting the Noise

The noise injection follows a precise mathematical formula:

q(x_t | x_t-1) = N(x_t; √(1-β_t) · x_t-1, β_tI)

At t = 1, the image is slightly grainy. At t = 250, details blur. At t = 500, the subject is barely recognizable.

The variance schedule controls how aggressively noise is added. A linear schedule adds noise evenly. A cosine schedule (used in improved models) preserves structure longer, then accelerates destruction.

Chapter 1

The Reparameterization Trick

Instead of adding noise step-by-step, we can jump directly to any timestep t using the reparameterization trick:

x_t = √(ᾱ_t) · x₀ + √(1 - ᾱ_t) · ε

Where ᾱ_t is the cumulative product of (1 - β_i) and ε ~ N(0, I) is random Gaussian noise.

This trick is what makes training diffusion models computationally feasible. We can sample any noise level in a single step during training.

Chapter 1

Pure Entropy

By timestep T = 1000, the original photograph is completely obliterated. The canvas looks exactly like the static on a broken television screen — pure isotropic Gaussian noise.

At this point, the image contains zero information about the original. Every pixel is drawn independently from a standard normal distribution: x_T ~ N(0, I).

Why do this? Because by destroying the image step-by-step, the AI has mathematically mapped the exact path from "structured data" to "pure randomness." It has learned the geometry of image space.

Chapter 2

The Reverse Process

Now comes the actual generation. We give the AI a brand-new canvas of pure random static. Its job is to reverse the entropy.

The model learns the reverse distribution p_θ(x_t-1 | x_t) — the probability of a slightly less noisy image given a noisy one. This is the core of diffusion.

But it does not try to guess the final image all at once. The reverse process is iterative, stochastic, and fundamentally Bayesian.

Chapter 2

The Neural Network Engine

To predict the reverse step, the model uses a massive neural network. Traditionally this is a U-Net — an encoder-decoder architecture with skip connections.

The U-Net processes the noisy image through downsampling blocks (encoder), passes through a bottleneck with self-attention, then reconstructs through upsampling blocks (decoder). Skip connections preserve spatial detail.

Modern systems increasingly use a Diffusion Transformer (DiT) instead. DiT patches the image into tokens, applies transformer blocks with adaptive layer norm, and outputs denoised patches.

Chapter 2

Predicting the Noise

Here is the critical insight: instead of predicting the clean image directly, the network predicts the noise that was added.

ε_θ(x_t, t) ≈ ε

The network looks at the noisy image x_t and timestep t, and outputs its best guess of the noise vector ε.

This formulation is mathematically equivalent to score matching — the network learns the gradient of the log probability density (the score function) of the data distribution.

Chapter 2

The Denoising Step

Once the noise is predicted, we compute the estimated clean image:

x̂₀ = (x_t - √(1-ᾱ_t) · ε_θ) / √(ᾱ_t)

Then we compute the mean of the reverse distribution:

μ_θ = ( √(ᾱ_t-1) · β_t · x̂₀ + √(α_t) · (1-ᾱ_t-1) · x_t ) / (1 - ᾱ_t)

Finally, we sample the next step with a small amount of stochasticity: x_t-1 ~ N(μ_θ, σ_t²I).

Chapter 2

Iterative Recovery

The model feeds the slightly cleaner image back into itself, guesses the next layer of noise, and subtracts it again.

This loop repeats for dozens or hundreds of steps. Out of the static, vague shapes begin to form. Edges sharpen. Colors separate. What began as pure mathematical randomness collapses into a coherent image.

Each step reduces the Kullback-Leibler divergence between the predicted distribution and the true data distribution. The process is a form of Langevin dynamics — stochastic gradient descent on the data log-density.

Chapter 2

Accelerated Sampling

Standard DDPM sampling requires 1000 steps — far too slow for real-time use. Researchers developed faster methods.

DDIM (Denoising Diffusion Implicit Models) treats the diffusion as a non-Markovian process. It can skip timesteps and reach comparable quality in 50 steps.

Consistency Models learn to map any noise level directly to the clean image in a single step. This enables real-time generation but requires more training.

Stable Diffusion typically uses 20–50 steps with a scheduler like PNDM or Euler.

Chapter 3

Guiding the Dream

If we let the model reverse noise blindly, it would hallucinate a random image. How do we make it draw exactly what we want? We use conditioning.

The most common form is text conditioning: you type "a futuristic city at sunset," and a text encoder (like CLIP) converts those words into a vector embedding c.

This embedding is injected into the U-Net at every layer, steering the denoising toward images that match the text description.

Chapter 3

Cross-Attention Mechanism

The text vectors are injected using cross-attention layers inside the U-Net. This is the same attention mechanism from transformers — but now operating between text and image modalities.

Attention(Q, K, V) = softmax(QK^T / √d_k) · V

The image features serve as Queries. The text embeddings serve as Keys and Values. Where text and image features align strongly, the attention weight is high — and the image is "pulled" toward that semantic concept.

Chapter 3

Classifier-Free Guidance

Simple conditioning often produces bland, averaged results. To increase fidelity and adherence to the prompt, models use Classifier-Free Guidance (CFG).

During training, the model learns two behaviors: with conditioning (ε_θ(x, t, c)) and without (ε_θ(x, t, ∅)).

At inference, the predicted noise is extrapolated away from the unconditional prediction:

ε_guided = ε_unc + w · (ε_cond - ε_unc)

The guidance scale w (typically 7–12.5) controls how strongly the prompt influences the result. Higher w = more prompt fidelity but less diversity.

Chapter 3

The Latent Space

Doing this math on millions of high-resolution pixels is incredibly slow. A 1024×1024 RGB image has 3.1 million values.

Modern systems like Stable Diffusion solve this with a Variational Autoencoder (VAE). The VAE encoder compresses the image into a latent space — a compact, lower-dimensional representation (typically 64×964 channels).

The entire diffusion process — forward and reverse — happens in this compressed latent space. Only at the final step does the VAE decoder "unzip" the latent tensor back into visible pixels.

This reduces computation by roughly 48× while preserving perceptual quality.

Chapter 3

Timestep Embeddings

The network must know how noisy the current image is. This is communicated through timestep embeddings.

The timestep t is converted into a sinusoidal position encoding — the same technique used in transformers. This encoding is added to the network activations at multiple layers.

Without this, the network would have no way to distinguish a slightly noisy image from a heavily corrupted one. The timestep tells it: "be aggressive with denoising" at high t, and "be gentle and refine details" at low t.

Conclusion

Putting It All Together

Diffusion models are not simply "stitching together" parts of images they memorized. They are performing mathematical sculpting in high-dimensional space.

They combine: (1) a forward process that learns the geometry of data by destroying it, (2) a neural network that learns to reverse this destruction by predicting noise, (3) cross-attention mechanisms that steer generation with semantic text vectors, and (4) a latent space that makes the entire process computationally tractable.

This architecture powers the visual AI revolution — from generating marketing assets to designing novel proteins in structural biology.

How Diffusion Models Create Images

The Forward Process

Injecting the Noise

The Reparameterization Trick

Pure Entropy

The Reverse Process

The Neural Network Engine

Predicting the Noise

The Denoising Step

Iterative Recovery

Accelerated Sampling

Guiding the Dream

Cross-Attention Mechanism

Classifier-Free Guidance

The Latent Space

Timestep Embeddings

Putting It All Together

Exploring AI for your organization?

Location:

Email:

LinkedIn:

How Diffusion Models Create Images

The Forward Process

Injecting the Noise

The Reparameterization Trick

Pure Entropy

The Reverse Process

The Neural Network Engine

Predicting the Noise

The Denoising Step

Iterative Recovery

Accelerated Sampling

Guiding the Dream

Cross-Attention Mechanism

Classifier-Free Guidance

The Latent Space

Timestep Embeddings

Putting It All Together

Continue Reading

Exploring AI for your organization?

Location:

Email:

LinkedIn:

This website uses cookies

Required Cookies

Analytical Cookies