Kako difuzijski modeli ustvarjajo slike: Od šuma do umetnosti

Poglavje 1

Neposredni proces

To teach an AI how to build, we must first teach it how to destroy.

The model takes a crisp photograph and subjects it to a mathematical process of slow, deliberate degradation. This is the forward diffusion process.

At each timestep t, the model injects Gaussian noise according to a variance schedule β_t. The image becomes progressively noisier until it reaches pure entropy.

Poglavje 1

Vbrizgavanje šuma

The noise injection follows a precise mathematical formula:

q(x_t | x_t-1) = N(x_t; √(1-β_t) · x_t-1, β_tI)

At t = 1, the image is slightly grainy. At t = 250, details blur. At t = 500, the subject is barely recognizable.

The variance schedule controls how aggressively noise is added. A linear schedule adds noise evenly. A cosine schedule (used in improved models) preserves structure longer, then accelerates destruction.

Poglavje 1

Trik reparametrizacije

Instead of adding noise step-by-step, we can jump directly to any timestep t using the reparameterization trick:

x_t = √(ᾱ_t) · x₀ + √(1 - ᾱ_t) · ε

Where ᾱ_t is the cumulative product of (1 - β_i) and ε ~ N(0, I) is random Gaussian noise.

This trick is what makes training diffusion models computationally feasible. We can sample any noise level in a single step during training.

Poglavje 1

Čista entropija

By timestep T = 1000, the original photograph is completely obliterated. The canvas looks exactly like the static on a broken television screen — pure isotropic Gaussian noise.

At this point, the image contains zero information about the original. Every pixel is drawn independently from a standard normal distribution: x_T ~ N(0, I).

Why do this? Because by destroying the image step-by-step, the AI has mathematically mapped the exact path from "structured data" to "pure randomness." It has learned the geometry of image space.

Poglavje 2

Obratni proces

Now comes the actual generation. We give the AI a brand-new canvas of pure random static. Its job is to reverse the entropy.

The model learns the reverse distribution p_θ(x_t-1 | x_t) — the probability of a slightly less noisy image given a noisy one. This is the core of diffusion.

But it does not try to guess the final image all at once. The reverse process is iterative, stochastic, and fundamentally Bayesian.

Poglavje 2

Nevronski motor

To predict the reverse step, the model uses a massive neural network. Traditionally this is a U-Net — an encoder-decoder architecture with skip connections.

The U-Net processes the noisy image through downsampling blocks (encoder), passes through a bottleneck with self-attention, then reconstructs through upsampling blocks (decoder). Skip connections preserve spatial detail.

Modern systems increasingly use a Diffusion Transformer (DiT) instead. DiT patches the image into tokens, applies transformer blocks with adaptive layer norm, and outputs denoised patches.

Poglavje 2

Napovedovanje šuma

Here is the critical insight: instead of predicting the clean image directly, the network predicts the noise that was added.

ε_θ(x_t, t) ≈ ε

The network looks at the noisy image x_t and timestep t, and outputs its best guess of the noise vector ε.

This formulation is mathematically equivalent to score matching — the network learns the gradient of the log probability density (the score function) of the data distribution.

Poglavje 2

Korak denoisinga

Once the noise is predicted, we compute the estimated clean image:

x̂₀ = (x_t - √(1-ᾱ_t) · ε_θ) / √(ᾱ_t)

Then we compute the mean of the reverse distribution:

μ_θ = ( √(ᾱ_t-1) · β_t · x̂₀ + √(α_t) · (1-ᾱ_t-1) · x_t ) / (1 - ᾱ_t)

Finally, we sample the next step with a small amount of stochasticity: x_t-1 ~ N(μ_θ, σ_t²I).

Poglavje 2

Iterativno okrevanje

The model feeds the slightly cleaner image back into itself, guesses the next layer of noise, and subtracts it again.

This loop repeats for dozens or hundreds of steps. Out of the static, vague shapes begin to form. Edges sharpen. Colors separate. What began as pure mathematical randomness collapses into a coherent image.

Each step reduces the Kullback-Leibler divergence between the predicted distribution and the true data distribution. The process is a form of Langevin dynamics — stochastic gradient descent on the data log-density.

Poglavje 2

Pospešeno vzorčenje

Standard DDPM sampling requires 1000 steps — far too slow for real-time use. Researchers developed faster methods.

DDIM (Denoising Diffusion Implicit Models) treats the diffusion as a non-Markovian process. It can skip timesteps and reach comparable quality in 50 steps.

Consistency Models learn to map any noise level directly to the clean image in a single step. This enables real-time generation but requires more training.

Stable Diffusion typically uses 20–50 steps with a scheduler like PNDM or Euler.

Poglavje 3

Vodenje sanj

If we let the model reverse noise blindly, it would hallucinate a random image. How do we make it draw exactly what we want? We use conditioning.

The most common form is text conditioning: you type "a futuristic city at sunset," and a text encoder (like CLIP) converts those words into a vector embedding c.

This embedding is injected into the U-Net at every layer, steering the denoising toward images that match the text description.

Poglavje 3

Mehanizem križne pozornosti

The text vectors are injected using cross-attention layers inside the U-Net. This is the same attention mechanism from transformers — but now operating between text and image modalities.

Attention(Q, K, V) = softmax(QK^T / √d_k) · V

The image features serve as Queries. The text embeddings serve as Keys and Values. Where text and image features align strongly, the attention weight is high — and the image is "pulled" toward that semantic concept.

Poglavje 3

Vodenje brez klasifikatorja

Simple conditioning often produces bland, averaged results. To increase fidelity and adherence to the prompt, models use Classifier-Free Guidance (CFG).

During training, the model learns two behaviors: with conditioning (ε_θ(x, t, c)) and without (ε_θ(x, t, ∅)).

At inference, the predicted noise is extrapolated away from the unconditional prediction:

ε_guided = ε_unc + w · (ε_cond - ε_unc)

The guidance scale w (typically 7–12.5) controls how strongly the prompt influences the result. Higher w = more prompt fidelity but less diversity.

Poglavje 3

Latentni prostor

Doing this math on millions of high-resolution pixels is incredibly slow. A 1024×1024 RGB image has 3.1 million values.

Modern systems like Stable Diffusion solve this with a Variational Autoencoder (VAE). The VAE encoder compresses the image into a latent space — a compact, lower-dimensional representation (typically 64×964 channels).

The entire diffusion process — forward and reverse — happens in this compressed latent space. Only at the final step does the VAE decoder "unzip" the latent tensor back into visible pixels.

This reduces computation by roughly 48× while preserving perceptual quality.

Poglavje 3

Vgrajevanje časovnih korakov

The network must know how noisy the current image is. This is communicated through timestep embeddings.

The timestep t is converted into a sinusoidal position encoding — the same technique used in transformers. This encoding is added to the network activations at multiple layers.

Without this, the network would have no way to distinguish a slightly noisy image from a heavily corrupted one. The timestep tells it: "be aggressive with denoising" at high t, and "be gentle and refine details" at low t.

Zaključek

Združevanje vsega skupaj

Diffusion models are not simply "stitching together" parts of images they memorized. They are performing mathematical sculpting in high-dimensional space.

They combine: (1) a forward process that learns the geometry of data by destroying it, (2) a neural network that learns to reverse this destruction by predicting noise, (3) cross-attention mechanisms that steer generation with semantic text vectors, and (4) a latent space that makes the entire process computationally tractable.

This architecture powers the visual AI revolution — from generating marketing assets to designing novel proteins in structural biology.

Kako difuzijski modeli ustvarjajo slike

Neposredni proces

Vbrizgavanje šuma

Trik reparametrizacije

Čista entropija

Obratni proces

Nevronski motor

Napovedovanje šuma

Korak denoisinga

Iterativno okrevanje

Pospešeno vzorčenje

Vodenje sanj

Mehanizem križne pozornosti

Vodenje brez klasifikatorja

Latentni prostor

Vgrajevanje časovnih korakov

Združevanje vsega skupaj

Raziskujete UI za vašo organizacijo?

Lokacija:

E-pošta:

LinkedIn:

Kako difuzijski modeli ustvarjajo slike

Neposredni proces

Vbrizgavanje šuma

Trik reparametrizacije

Čista entropija

Obratni proces

Nevronski motor

Napovedovanje šuma

Korak denoisinga

Iterativno okrevanje

Pospešeno vzorčenje

Vodenje sanj

Mehanizem križne pozornosti

Vodenje brez klasifikatorja

Latentni prostor

Vgrajevanje časovnih korakov

Združevanje vsega skupaj

Nadaljujte z branjem

Raziskujete UI za vašo organizacijo?

Lokacija:

E-pošta:

LinkedIn:

Ta spletna stran uporablja piškotke

Nujni piškotki

Analitični piškotki