Technical Report
Abstract
This paper walk investigates the method present in the paper <Boosting 3D Object Generation through PBR Materials> . Stable Diffusion model and its components has been discussed, then this report walks through implement details of Fine-tune Stable Diffusion model, UNet particularly, to predict Albedo layer from Beauty render image.
Literatures / Resources
- Understand convolution layers, neural networks, positional encoding etc. : Deep Learning Specialization
- Brief on Diffusion models with code: How Diffusion Models Work
- Understand Autoencoder and VAE : From Autoencoder to Beta-VAE
- Attention mechanism: Attention in Transformers: Concepts and Code in PyTorch
- DIffusion models with annotations and code: Diffusion Models
- DDPM equation: DDPM youtube
extra: - Understand GAN and many other loss functions, latent space, data agumentation etc. Broaden knowledge and compare GAN with SD for better understanding. GAN network is easier coding, familiar with PyTorch GAN Specialization
Introduction
If we want to implement the idea from the paper, understand Stable Diffusion model is a must.
Text Encoder
(deep learning specialization coursework)
VAE (Variational Autoencoder)
Autoencoder
Autoencoder is a neural network designed to learn an identity function in an unsupervised way to reconstruct the original input while compressing the data in the process so as to discover a more efficient and compressed representation.
It consists of two networks:
- Encoder network: It translates the original high-dimension input into the latent low-dimensional code. The input size is larger than the output size.
- Decoder network: The decoder network recovers the data from the code, likely with larger and larger output layers.
(from Lil’ Blog)
VAE
Mapping the input to a distribution.
Compare to autoencoder, two more features have been added to VAE:
- Probabilistic latent space
- KL divergence loss (The KL divergence loss encourages the learned distribution N(μ, σ²) to be close to a standard normal distribution )
Instead of outputing a z vector from the encoder, it outputs a distribution which consists of : mean vector $\mu$ and standard deviation vector $\sigma$ .
Then sample the latent vector $z$ from this distribution.
UNet
encoder - bottleneck - decoder
Scheduler
Data Preparation
Megascan Synthetic data
blender script
One script that automatically read in Megascan data, load into blender, automatically link Albedo, Roughness and Normal map. The environment light is implemented, the camera and HDRI are randomly rotate spherically.
The output are one final render image, which is called Beauty, one Albedo image, which only contains color information of the asset, one Normal image, which represent the object normal of the asset (mention: the paper seems using world normal, TBC distinguish the difference)
batch process
Set up the folder where contains all raw data, HDRI path, and output directory.
HyperSim dataset
Methodologies
For this study,I based on the training code from Marigold, and pipeline code from Paper’s code.
cat_latents = torch.cat([noisy_latents, beauty_latent], dim=1