How Diffusion Models Create Images

From Gaussian noise to photorealistic art — the mathematics of visual AI.

By Milos · June 05, 2026 · Interactive Article

Diffusion models power the visual AI revolution: DALL-E, Stable Diffusion, Midjourney, and the image generators reshaping design, marketing, and biotechnology. But how do they actually create images from thin air?

This article explains the full mechanism through interactive visualizations. As you scroll, the graphics animate each step: from the forward process that destroys images with Gaussian noise, to the reverse process that reconstructs them through iterative denoising, to the conditioning mechanisms that steer generation with text prompts.

Tip: Scroll slowly. Each section triggers a new animation showing the math in action.

Chapter 1

The Forward Process

To teach an AI how to build, we must first teach it how to destroy.

The model takes a crisp photograph and subjects it to a mathematical process of slow, deliberate degradation. This is the forward diffusion process.

At each timestep t, the model injects Gaussian noise according to a variance schedule βt. The image becomes progressively noisier until it reaches pure entropy.

Chapter 1

Injecting the Noise

The noise injection follows a precise mathematical formula:

q(xt | xt-1) = N(xt; √(1-βt) · xt-1, βtI)

At t = 1, the image is slightly grainy. At t = 250, details blur. At t = 500, the subject is barely recognizable.

The variance schedule controls how aggressively noise is added. A linear schedule adds noise evenly. A cosine schedule (used in improved models) preserves structure longer, then accelerates destruction.

Chapter 1

The Reparameterization Trick

Instead of adding noise step-by-step, we can jump directly to any timestep t using the reparameterization trick:

xt = √(ᾱt) · x0 + √(1 - ᾱt) · ε

Where ᾱt is the cumulative product of (1 - βi) and ε ~ N(0, I) is random Gaussian noise.

This trick is what makes training diffusion models computationally feasible. We can sample any noise level in a single step during training.

Chapter 1

Pure Entropy

By timestep T = 1000, the original photograph is completely obliterated. The canvas looks exactly like the static on a broken television screen — pure isotropic Gaussian noise.

At this point, the image contains zero information about the original. Every pixel is drawn independently from a standard normal distribution: xT ~ N(0, I).

Why do this? Because by destroying the image step-by-step, the AI has mathematically mapped the exact path from "structured data" to "pure randomness." It has learned the geometry of image space.

Chapter 2

The Reverse Process

Now comes the actual generation. We give the AI a brand-new canvas of pure random static. Its job is to reverse the entropy.

The model learns the reverse distribution pθ(xt-1 | xt) — the probability of a slightly less noisy image given a noisy one. This is the core of diffusion.

But it does not try to guess the final image all at once. The reverse process is iterative, stochastic, and fundamentally Bayesian.

Chapter 2

The Neural Network Engine

To predict the reverse step, the model uses a massive neural network. Traditionally this is a U-Net — an encoder-decoder architecture with skip connections.

The U-Net processes the noisy image through downsampling blocks (encoder), passes through a bottleneck with self-attention, then reconstructs through upsampling blocks (decoder). Skip connections preserve spatial detail.

Modern systems increasingly use a Diffusion Transformer (DiT) instead. DiT patches the image into tokens, applies transformer blocks with adaptive layer norm, and outputs denoised patches.

Chapter 2

Predicting the Noise

Here is the critical insight: instead of predicting the clean image directly, the network predicts the noise that was added.

εθ(xt, t) ≈ ε

The network looks at the noisy image xt and timestep t, and outputs its best guess of the noise vector ε.

This formulation is mathematically equivalent to score matching — the network learns the gradient of the log probability density (the score function) of the data distribution.

Chapter 2

The Denoising Step

Once the noise is predicted, we compute the estimated clean image:

0 = (xt - √(1-ᾱt) · εθ) / √(ᾱt)

Then we compute the mean of the reverse distribution:

μθ = ( √(ᾱt-1) · βt · x̂0 + √(αt) · (1-ᾱt-1) · xt ) / (1 - ᾱt)

Finally, we sample the next step with a small amount of stochasticity: xt-1 ~ N(μθ, σt2I).

Chapter 2

Iterative Recovery

The model feeds the slightly cleaner image back into itself, guesses the next layer of noise, and subtracts it again.

This loop repeats for dozens or hundreds of steps. Out of the static, vague shapes begin to form. Edges sharpen. Colors separate. What began as pure mathematical randomness collapses into a coherent image.

Each step reduces the Kullback-Leibler divergence between the predicted distribution and the true data distribution. The process is a form of Langevin dynamics — stochastic gradient descent on the data log-density.

Chapter 2

Accelerated Sampling

Standard DDPM sampling requires 1000 steps — far too slow for real-time use. Researchers developed faster methods.

DDIM (Denoising Diffusion Implicit Models) treats the diffusion as a non-Markovian process. It can skip timesteps and reach comparable quality in 50 steps.

Consistency Models learn to map any noise level directly to the clean image in a single step. This enables real-time generation but requires more training.

Stable Diffusion typically uses 20–50 steps with a scheduler like PNDM or Euler.

Chapter 3

Guiding the Dream

If we let the model reverse noise blindly, it would hallucinate a random image. How do we make it draw exactly what we want? We use conditioning.

The most common form is text conditioning: you type "a futuristic city at sunset," and a text encoder (like CLIP) converts those words into a vector embedding c.

This embedding is injected into the U-Net at every layer, steering the denoising toward images that match the text description.

Chapter 3

Cross-Attention Mechanism

The text vectors are injected using cross-attention layers inside the U-Net. This is the same attention mechanism from transformers — but now operating between text and image modalities.

Attention(Q, K, V) = softmax(QKT / √dk) · V

The image features serve as Queries. The text embeddings serve as Keys and Values. Where text and image features align strongly, the attention weight is high — and the image is "pulled" toward that semantic concept.

Chapter 3

Classifier-Free Guidance

Simple conditioning often produces bland, averaged results. To increase fidelity and adherence to the prompt, models use Classifier-Free Guidance (CFG).

During training, the model learns two behaviors: with conditioning (εθ(x, t, c)) and without (εθ(x, t, ∅)).

At inference, the predicted noise is extrapolated away from the unconditional prediction:

εguided = εunc + w · (εcond - εunc)

The guidance scale w (typically 7–12.5) controls how strongly the prompt influences the result. Higher w = more prompt fidelity but less diversity.

Chapter 3

The Latent Space

Doing this math on millions of high-resolution pixels is incredibly slow. A 1024×1024 RGB image has 3.1 million values.

Modern systems like Stable Diffusion solve this with a Variational Autoencoder (VAE). The VAE encoder compresses the image into a latent space — a compact, lower-dimensional representation (typically 64×964 channels).

The entire diffusion process — forward and reverse — happens in this compressed latent space. Only at the final step does the VAE decoder "unzip" the latent tensor back into visible pixels.

This reduces computation by roughly 48× while preserving perceptual quality.

Chapter 3

Timestep Embeddings

The network must know how noisy the current image is. This is communicated through timestep embeddings.

The timestep t is converted into a sinusoidal position encoding — the same technique used in transformers. This encoding is added to the network activations at multiple layers.

Without this, the network would have no way to distinguish a slightly noisy image from a heavily corrupted one. The timestep tells it: "be aggressive with denoising" at high t, and "be gentle and refine details" at low t.

Conclusion

Putting It All Together

Diffusion models are not simply "stitching together" parts of images they memorized. They are performing mathematical sculpting in high-dimensional space.

They combine: (1) a forward process that learns the geometry of data by destroying it, (2) a neural network that learns to reverse this destruction by predicting noise, (3) cross-attention mechanisms that steer generation with semantic text vectors, and (4) a latent space that makes the entire process computationally tractable.

This architecture powers the visual AI revolution — from generating marketing assets to designing novel proteins in structural biology.

Diffusion models represent one of the most elegant ideas in modern machine learning: learning by destruction. By systematically adding noise to data and learning to reverse the process, these models capture the full statistical structure of image distributions — not just surface patterns, but the deep geometry that separates a coherent image from random static.

For organizations in regulated industries, understanding these fundamentals matters. Whether you are evaluating AI-generated content for regulatory submissions, assessing the reliability of synthetic training data, or building quality frameworks for generative AI, knowing how diffusion works helps you ask better questions and manage risk.

The math is approachable. The implications are enormous. And the field is evolving fast — from faster samplers to 3D generation to video diffusion. The principles you have just explored will remain the foundation.

Exploring AI for your organization?

We help regulated companies build AI governance, evaluate generative AI tools, and deploy safe, compliant AI systems.

Get in Touch

About the author: Milos is a consultant at Excellence Consulting by Mashup, specializing in AI governance, regulatory compliance, and generative AI strategy for regulated industries.