Kako difuzijski modeli ustvarjajo slike

Od Gaussovega šuma do fotorealistične umetnosti — matematika vizualne UI.

Milos · 05. junij 2026 · Interaktivni članek

Difuzijski modeli poganjajo vizualno UI revolucijo: DALL-E, Stable Diffusion, Midjourney in generatorji slik, ki preoblikujejo dizajn, trženje in biotehnologijo. Toda kako dejansko ustvarjajo slike iz nič?

Ta članek razloži celoten mehanizem z interaktivnimi vizualizacijami. Med drsenjem se grafike animirajo za vsak korak: od neposrednega procesa, ki uničuje slike z Gaussovim šumom, do obratnega procesa, ki jih rekonstruira z iterativnim denoisingom, do mehanizmov pogojevanja, ki vodijo generiranje z besedilnimi pozivi.

Tip: Scroll slowly. Each section triggers a new animation showing the math in action.

Poglavje 1

Neposredni proces

To teach an AI how to build, we must first teach it how to destroy.

The model takes a crisp photograph and subjects it to a mathematical process of slow, deliberate degradation. This is the forward diffusion process.

At each timestep t, the model injects Gaussian noise according to a variance schedule βt. The image becomes progressively noisier until it reaches pure entropy.

Poglavje 1

Vbrizgavanje šuma

The noise injection follows a precise mathematical formula:

q(xt | xt-1) = N(xt; √(1-βt) · xt-1, βtI)

At t = 1, the image is slightly grainy. At t = 250, details blur. At t = 500, the subject is barely recognizable.

The variance schedule controls how aggressively noise is added. A linear schedule adds noise evenly. A cosine schedule (used in improved models) preserves structure longer, then accelerates destruction.

Poglavje 1

Trik reparametrizacije

Instead of adding noise step-by-step, we can jump directly to any timestep t using the reparameterization trick:

xt = √(ᾱt) · x0 + √(1 - ᾱt) · ε

Where ᾱt is the cumulative product of (1 - βi) and ε ~ N(0, I) is random Gaussian noise.

This trick is what makes training diffusion models computationally feasible. We can sample any noise level in a single step during training.

Poglavje 1

Čista entropija

By timestep T = 1000, the original photograph is completely obliterated. The canvas looks exactly like the static on a broken television screen — pure isotropic Gaussian noise.

At this point, the image contains zero information about the original. Every pixel is drawn independently from a standard normal distribution: xT ~ N(0, I).

Why do this? Because by destroying the image step-by-step, the AI has mathematically mapped the exact path from "structured data" to "pure randomness." It has learned the geometry of image space.

Poglavje 2

Obratni proces

Now comes the actual generation. We give the AI a brand-new canvas of pure random static. Its job is to reverse the entropy.

The model learns the reverse distribution pθ(xt-1 | xt) — the probability of a slightly less noisy image given a noisy one. This is the core of diffusion.

But it does not try to guess the final image all at once. The reverse process is iterative, stochastic, and fundamentally Bayesian.

Poglavje 2

Nevronski motor

To predict the reverse step, the model uses a massive neural network. Traditionally this is a U-Net — an encoder-decoder architecture with skip connections.

The U-Net processes the noisy image through downsampling blocks (encoder), passes through a bottleneck with self-attention, then reconstructs through upsampling blocks (decoder). Skip connections preserve spatial detail.

Modern systems increasingly use a Diffusion Transformer (DiT) instead. DiT patches the image into tokens, applies transformer blocks with adaptive layer norm, and outputs denoised patches.

Poglavje 2

Napovedovanje šuma

Here is the critical insight: instead of predicting the clean image directly, the network predicts the noise that was added.

εθ(xt, t) ≈ ε

The network looks at the noisy image xt and timestep t, and outputs its best guess of the noise vector ε.

This formulation is mathematically equivalent to score matching — the network learns the gradient of the log probability density (the score function) of the data distribution.

Poglavje 2

Korak denoisinga

Once the noise is predicted, we compute the estimated clean image:

0 = (xt - √(1-ᾱt) · εθ) / √(ᾱt)

Then we compute the mean of the reverse distribution:

μθ = ( √(ᾱt-1) · βt · x̂0 + √(αt) · (1-ᾱt-1) · xt ) / (1 - ᾱt)

Finally, we sample the next step with a small amount of stochasticity: xt-1 ~ N(μθ, σt2I).

Poglavje 2

Iterativno okrevanje

The model feeds the slightly cleaner image back into itself, guesses the next layer of noise, and subtracts it again.

This loop repeats for dozens or hundreds of steps. Out of the static, vague shapes begin to form. Edges sharpen. Colors separate. What began as pure mathematical randomness collapses into a coherent image.

Each step reduces the Kullback-Leibler divergence between the predicted distribution and the true data distribution. The process is a form of Langevin dynamics — stochastic gradient descent on the data log-density.

Poglavje 2

Pospešeno vzorčenje

Standard DDPM sampling requires 1000 steps — far too slow for real-time use. Researchers developed faster methods.

DDIM (Denoising Diffusion Implicit Models) treats the diffusion as a non-Markovian process. It can skip timesteps and reach comparable quality in 50 steps.

Consistency Models learn to map any noise level directly to the clean image in a single step. This enables real-time generation but requires more training.

Stable Diffusion typically uses 20–50 steps with a scheduler like PNDM or Euler.

Poglavje 3

Vodenje sanj

If we let the model reverse noise blindly, it would hallucinate a random image. How do we make it draw exactly what we want? We use conditioning.

The most common form is text conditioning: you type "a futuristic city at sunset," and a text encoder (like CLIP) converts those words into a vector embedding c.

This embedding is injected into the U-Net at every layer, steering the denoising toward images that match the text description.

Poglavje 3

Mehanizem križne pozornosti

The text vectors are injected using cross-attention layers inside the U-Net. This is the same attention mechanism from transformers — but now operating between text and image modalities.

Attention(Q, K, V) = softmax(QKT / √dk) · V

The image features serve as Queries. The text embeddings serve as Keys and Values. Where text and image features align strongly, the attention weight is high — and the image is "pulled" toward that semantic concept.

Poglavje 3

Vodenje brez klasifikatorja

Simple conditioning often produces bland, averaged results. To increase fidelity and adherence to the prompt, models use Classifier-Free Guidance (CFG).

During training, the model learns two behaviors: with conditioning (εθ(x, t, c)) and without (εθ(x, t, ∅)).

At inference, the predicted noise is extrapolated away from the unconditional prediction:

εguided = εunc + w · (εcond - εunc)

The guidance scale w (typically 7–12.5) controls how strongly the prompt influences the result. Higher w = more prompt fidelity but less diversity.

Poglavje 3

Latentni prostor

Doing this math on millions of high-resolution pixels is incredibly slow. A 1024×1024 RGB image has 3.1 million values.

Modern systems like Stable Diffusion solve this with a Variational Autoencoder (VAE). The VAE encoder compresses the image into a latent space — a compact, lower-dimensional representation (typically 64×964 channels).

The entire diffusion process — forward and reverse — happens in this compressed latent space. Only at the final step does the VAE decoder "unzip" the latent tensor back into visible pixels.

This reduces computation by roughly 48× while preserving perceptual quality.

Poglavje 3

Vgrajevanje časovnih korakov

The network must know how noisy the current image is. This is communicated through timestep embeddings.

The timestep t is converted into a sinusoidal position encoding — the same technique used in transformers. This encoding is added to the network activations at multiple layers.

Without this, the network would have no way to distinguish a slightly noisy image from a heavily corrupted one. The timestep tells it: "be aggressive with denoising" at high t, and "be gentle and refine details" at low t.

Zaključek

Združevanje vsega skupaj

Diffusion models are not simply "stitching together" parts of images they memorized. They are performing mathematical sculpting in high-dimensional space.

They combine: (1) a forward process that learns the geometry of data by destroying it, (2) a neural network that learns to reverse this destruction by predicting noise, (3) cross-attention mechanisms that steer generation with semantic text vectors, and (4) a latent space that makes the entire process computationally tractable.

This architecture powers the visual AI revolution — from generating marketing assets to designing novel proteins in structural biology.

Difuzijski modeli predstavljajo eno najelegantnejših idej v sodobnem strojnem učenju: učenje z uničevanjem. S sistematičnim dodajanjem šuma podatkom in učenjem obratnega procesa ti modeli zajamejo celotno statistično strukturo porazdelitev slik — ne le površinske vzorce, temveč globoko geometrijo, ki ločuje koherentno sliko od naključnega šuma.

Za organizacije v reguliranih industrijah je razumevanje teh osnov pomembno. Ne glede na to, ali ocenjujete vsebino, ustvarjeno z UI, za regulatorne prijave, ocenjujete zanesljivost sintetičnih učnih podatkov ali gradite okvire kakovosti za generativno UI, vam poznavanje delovanja difuzije pomaga postavljati boljša vprašanja in upravljati tveganja.

Matematika je dostopna. Posledice so ogromne. In področje se hitro razvija — od hitrejših vzorčilnikov do 3D generiranja do video difuzije. Načela, ki ste jih pravkar raziskali, bodo ostala temelj.

Raziskujete UI za vašo organizacijo?

Pomagamo reguliranim podjetjem graditi upravljanje UI, ocenjevati generativne UI orodja in uvajati varne, skladne UI sisteme.

Stopite v stik

O avtorju: Milos je svetovalec pri Excellence Consulting by Mashup, specializiran za upravljanje UI, regulatorno skladnost in strategijo generativne UI za regulirane industrije.