Technical Report

Abstract

This paper walk investigates the method present in the paper <Boosting 3D Object Generation through PBR Materials> . Stable Diffusion model and its components has been discussed, then this report walks through implement details of Fine-tune Stable Diffusion model, UNet particularly, to predict Albedo layer from Beauty render image.

Literatures / Resources

Understand convolution layers, neural networks, positional encoding etc. : Deep Learning Specialization
Brief on Diffusion models with code: How Diffusion Models Work
Understand Autoencoder and VAE : From Autoencoder to Beta-VAE
Attention mechanism: Attention in Transformers: Concepts and Code in PyTorch
DIffusion models with annotations and code: Diffusion Models
DDPM equation: DDPM youtube
extra:
Understand GAN and many other loss functions, latent space, data agumentation etc. Broaden knowledge and compare GAN with SD for better understanding. GAN network is easier coding, familiar with PyTorch GAN Specialization

Introduction

If we want to implement the idea from the paper, understand Stable Diffusion model is a must.

Text Encoder

(deep learning specialization coursework)

VAE (Variational Autoencoder)

Autoencoder

Autoencoder is a neural network designed to learn an identity function in an unsupervised way to reconstruct the original input while compressing the data in the process so as to discover a more efficient and compressed representation.

It consists of two networks:

Encoder network: It translates the original high-dimension input into the latent low-dimensional code. The input size is larger than the output size.
Decoder network: The decoder network recovers the data from the code, likely with larger and larger output layers.

(from Lil’ Blog)

VAE

Mapping the input to a distribution.

Compare to autoencoder, two more features have been added to VAE:

Probabilistic latent space
KL divergence loss (The KL divergence loss encourages the learned distribution N(μ, σ²) to be close to a standard normal distribution )

Instead of outputing a z vector from the encoder, it outputs a distribution which consists of : mean vector $\mu$ and standard deviation vector $\sigma$ .
Then sample the latent vector $z$ from this distribution.

UNet

encoder - bottleneck - decoder

Scheduler

Data Preparation

Megascan Synthetic data

blender script
One script that automatically read in Megascan data, load into blender, automatically link Albedo, Roughness and Normal map. The environment light is implemented, the camera and HDRI are randomly rotate spherically.
The output are one final render image, which is called Beauty, one Albedo image, which only contains color information of the asset, one Normal image, which represent the object normal of the asset (mention: the paper seems using world normal, TBC distinguish the difference)

batch process
Set up the folder where contains all raw data, HDRI path, and output directory.

HyperSim dataset

Methodologies

For this study,I based on the training code from Marigold, and pipeline code from Paper’s code.

cat_latents = torch.cat([noisy_latents, beauty_latent], dim=1