Fun With Diffusion Models!

Yueheng Zeng @ Project 5

Overview

This project focuses on implementing a diffusion model to generate images. The diffusion model is a generative model that generates images by iteratively applying a series of transformations to a noise image. The model is trained to generate images that are similar to the training data. The project is divided into two parts. In Part 1, we implement the forward process of the diffusion model and explore different denoising techniques using a pre-trained diffusion model. In Part 2, we train a single-step denoising UNet and extend it to include time-conditioning and class-conditioning.

Course Logo Generated and Upsampled by Diffusion Model

Octo... Dog? 🤣

Table of Contents

  1. 5A Part 0: Setup
  2. 5A Part 1.1: Implementing the Forward Process
  3. 5A Part 1.2: Classical Denoising
  4. 5A Part 1.3: One-Step Denoising
  5. 5A Part 1.4: Iterative Denoising
  6. 5A Part 1.5: Diffusion Model Sampling
  7. 5A Part 1.6: Classifier-Free Guidance (CFG)
  8. 5A Part 1.7: Image-to-image Translation
  9. 5A Part 1.7.1: Editing Hand-Drawn and Web Images
  10. 5A Part 1.7.2: Inpainting
  11. 5A Part 1.7.3: Text-Conditional Image-to-image Translation
  12. 5A Part 1.8: Visual Anagrams
  13. 5A Part 1.9: Hybrid Images
  14. 5A Bells & Whistles: A Course Logo
  15. 5B Part 1: Training a Single-Step Denoising UNet
  16. 5B Part 2.1: Adding Time-Conditioning to UNet
  17. 5B Part 2.4: Adding Class-Conditioning to UNet
  18. 5B Bells & Whistles: Sampling Gifs

5A Part 0: Setup

In this part, we set up the environment and load the pre-trained diffusion model to generate images using seed 180.

Generated Images with num_inference_steps=20

Generated Images with num_inference_steps=40

We can see the quality of the generated images are quite good. They are highly correlated with the text prompts. However, there are some artifacts in the images. After increasing the number of inference steps, there is actually not much difference in the quality of the images. The defects in the images are still present.

5A Part 1.1: Implementing the Forward Process

In this part, we implement the forward process of the diffusion model.

Generated Images at t = 250, 500, and 750

5A Part 1.2: Classical Denoising

In this part, we use the classical Gaussian blur filter to denoise the images.

Noisy Image at t = 250 with Gaussian Blur Denoising

Noisy Image at t = 500 with Gaussian Blur Denoising

Noisy Image at t = 750 with Gaussian Blur Denoising

5A Part 1.3: One-Step Denoising

In this part, we use the pre-trained diffusion model to denoise the images with one step.

Noisy Image at t = 250 with One-Step Denoising

Noisy Image at t = 500 with One-Step Denoising

Noisy Image at t = 750 with One-Step Denoising

5A Part 1.4: Iterative Denoising

In this part, we iteratively denoise the images with the pre-trained diffusion model.

Predicted Noisy Images at t = 690, 540, and 390 with Iterative Denoising

Predicted Noisy Images at t = 240, 90 with Iterative Denoising

Test Image, Iterative Denoised Image, One-Step Denoised Image, and Gaussian Blur Denoised image

5A Part 1.5: Diffusion Model Sampling

In this part, we sample images from the diffusion model.

Generated Images

5A Part 1.6: Classifier-Free Guidance (CFG)

In this part, we use the classifier-free guidance to guide the diffusion model to generate images with the prompt "a high quality photo".

Generated Images with CFG

We can see that the generated images have higher quality than the images generated without CFG.

5A Part 1.7: Image-to-image Translation

In this part, we're going to take the original test image, noise it a little, and force it back onto the image manifold without any conditioning.

Campenile

SDEdit with i_start = 1, 3, 5, 7

SDEdit with i_start = 10, 20, 30, and Original Image

Mong Kok

SDEdit with Mong Kok and i_start = 1, 3, 5, 7

SDEdit with Mong Kok and i_start = 10, 20, 30, and Original Mong Kok Image

Victoria Harbour

SDEdit with Victoria Harbour and i_start = 1, 3, 5, 7

SDEdit with Victoria Harbour and i_start = 10, 20, 30, and Original Victoria Harbour Image

5A Part 1.7.1: Editing Hand-Drawn and Web Images

In this part, we're going to apply the SDEdit to nonrealistic images (e.g. painting, a sketch, some scribbles), making them look more realistic.

Hand-Drawn Image: Merry Cat-mas!

SDEdit with Merry Cat-mas! and i_start = 1, 3, 5, 7

SDEdit with Merry Cat-mas! and i_start = 10, 20, 30, and Original Merry Cat-mas! Image

Hand-Drawn Image: Monster

SDEdit with Monster and i_start = 1, 3, 5, 7

SDEdit with Monster and i_start = 10, 20, 30, and Original Monster Image

Web Image: Minecraft

SDEdit with Minecraft and i_start = 1, 3, 5, 7

SDEdit with Minecraft and i_start = 10, 20, 30, and Original Minecraft Image

5A Part 1.7.2: Inpainting

In this part, we're going to implement inpainting. To do this, we can run the diffusion denoising loop. But at every step, after obtaining the denoised image, we can force the pixels that we do not want to change back to the original image.

Campanile

Original Image, Mask, and Inpainted Image

Street

Original Street Image, Street Mask, and Inpainted Street Image

Victoria Harbour

Original Victoria Harbour Image, Victoria Harbour Mask, and Inpainted Victoria Harbour Image

5A Part 1.7.3: Text-Conditional Image-to-image Translation

In this part, we will do the same thing as SDEdit, but guide the projection with a text prompt. This is no longer pure "projection to the natural image manifold" but also adds control using language.

Campanile -> "a rocket ship"

SDEdit with "a rocket ship" and i_start = 1, 3, 5, 7

SDEdit with "a rocket ship" and i_start = 10, 20, 30, and Original Image

Octocat -> "a photo of a dog"

SDEdit with "a photo of a dog" and i_start = 1, 3, 5, 7

SDEdit with "a photo of a dog" and i_start = 10, 20, 30, and Original Octocat Image

Hoover Tower -> "a rocket ship"

SDEdit with "a rocket ship" and i_start = 1, 3, 5, 7

SDEdit with "a rocket ship" and i_start = 10, 20, 30, and Original Hoover Tower Image

5A Part 1.8: Visual Anagrams

In this part, we are going to implement Visual Anagrams. We will create an image that looks like something else when flipped upside down. The only modification to the original iterative denoising is that we will calculate \(\epsilon\) as follows: \[ \epsilon_1 = \text{UNet}(x_t, t, p_1) \] \[ \epsilon_2 = \text{flip}(\text{UNet}(\text{flip}(x_t), t, p_2)) \] \[ \epsilon = (\epsilon_1 + \epsilon_2) / 2 \]

An Oil Painting of People Around a Campfire and An Oil Painting of an Old Man

An Oil Painting of a Snowy Mountain Village and An Oil Painting of People Around a Campfire

A Dreamy Oil Painting of a Crescent Moon Cradling a Sleeping Figure and An Ocean Wave Crashing Against a Lighthouse

5A Part 1.9: Hybrid Images

In this part, we are going to implement Factorized Diffusion and create hybrid images that look like one thing up close and another thing from far away. The only modification to the original iterative denoising is that we will calculate \(\epsilon\) as follows: \[ \epsilon_1 = \text{UNet}(x_t, t, p_1) \] \[ \epsilon_2 = \text{UNet}(x_t, t, p_2) \] \[ \epsilon = f_\text{lowpass}(\epsilon_1) + f_\text{highpass}(\epsilon_2) \]

Hybrid Lithograph of a Skull (Low) and Waterfalls (High)

Hybrid Image of an Ancient Clock Face (Low) and Historical Moments (High)

Hybrid Oil Painting of an Old Man (Low) and a Snowy Mountain Village (High)

5A Bells & Whistles: A Course Logo

In this part, we are going to design a course logo using the diffusion model with prompt "A man whose head is a camera of brand CS180".

Course Logo (Upsampled)

The man in the logo looks cool! However, the CS180 brand is not on the camera (it might be caused by the word CS180 is not shown in the training data).

5B Part 1: Training a Single-Step Denoising UNet

In this part, we are going to train a single-step denoising UNet to denoise the digits in the MNIST dataset. Firstly, we will need to implement the noising process defined as follows: \[ z = x + \sigma \epsilon,\quad \text{where }\epsilon \sim N(0, I) \]

Varying levels of noise on MNIST digits

Then we will train a single-step denoising UNet to denoise the noisy digits at \(\sigma = 0.5\).

Training Loss per Batch

Results on digits from the test set after 1 epochs of training

Results on digits from the test set after 5 epochs of training

We can see that the denoising UNet can denoise the noisy digits effectively after 5 epochs of training. However, what if we let it denoise the digits with different levels of noise that it was not trained on?

Results on digits from the test set with varying noise levels

We can see that the denoising UNet cannot denoise the digits effectively with noise levels that it was not trained on, especially when the noise level is high.

5B Part 2.1: Adding Time-Conditioning to UNet

In this part, we are going to add time-conditioning to the UNet, making it a diffusion model. Firstly, we will add the noise with the following equation: \[ x_t = \sqrt{\bar\alpha_t} x_0 + \sqrt{1 - \bar\alpha_t} \epsilon \quad \text{where}~ \epsilon \sim N(0, 1) \] And our objective is to minimize the following loss function: \[ L = \mathbb{E}_{\epsilon,x_0,t} \|\epsilon_{\theta}(x_t, t) - \epsilon\|^2 \]

Training Loss per Batch

After training the diffusion model, we can sample high-quality digits from the model iteratively.

After 5 epochs
After 20 epochs

Sampling digits from time-conditioned UNet after 5 and 20 epochs iteratively

5B Part 2.4: Adding Class-Conditioning to UNet

In this part, we are going to add class-conditioning to the UNet, enabling us to specify the which digit we want to generate.

Training Loss per Batch

After training the class-conditioned diffusion model, we choose which digit to generate and sample high-quality digits from the model iteratively.

After 5 epochs
After 20 epochs

Sampling digits from class-conditioned UNet after 5 and 20 epochs iteratively

5B Bells & Whistles: Sampling Gifs

This part has already been completed in the previous parts.