Skip to content

Technical Report

Abstract

This paper walk investigates the method present in the paper <Boosting 3D Object Generation through PBR Materials> . Stable Diffusion model and its components has been discussed, then this report walks through implement details of Fine-tune Stable Diffusion model, UNet particularly, to predict Albedo layer from Beauty render image.


Literatures / Resources


Introduction

If we want to implement the idea from the paper, understand Stable Diffusion model is a must.

Text Encoder

(deep learning specialization coursework)

VAE (Variational Autoencoder)

Autoencoder

Autoencoder is a neural network designed to learn an identity function in an unsupervised way to reconstruct the original input while compressing the data in the process so as to discover a more efficient and compressed representation.

It consists of two networks:

  • Encoder network: It translates the original high-dimension input into the latent low-dimensional code. The input size is larger than the output size.
  • Decoder network: The decoder network recovers the data from the code, likely with larger and larger output layers.

(from Lil’ Blog)

VAE

Mapping the input to a distribution.

Compare to autoencoder, two more features have been added to VAE:

  1. Probabilistic latent space
  2. KL divergence loss (The KL divergence loss encourages the learned distribution N(μ, σ²) to be close to a standard normal distribution )

Instead of outputing a z vector from the encoder, it outputs a distribution which consists of : mean vector $\mu$ and standard deviation vector $\sigma$ .
Then sample the latent vector $z$ from this distribution.

UNet

encoder - bottleneck - decoder

Scheduler


Data Preparation

Megascan Synthetic data

blender script
One script that automatically read in Megascan data, load into blender, automatically link Albedo, Roughness and Normal map. The environment light is implemented, the camera and HDRI are randomly rotate spherically.
The output are one final render image, which is called Beauty, one Albedo image, which only contains color information of the asset, one Normal image, which represent the object normal of the asset (mention: the paper seems using world normal, TBC distinguish the difference)

batch process
Set up the folder where contains all raw data, HDRI path, and output directory.

HyperSim dataset


Methodologies

For this study,I based on the training code from Marigold, and pipeline code from Paper’s code.

cat_latents = torch.cat([noisy_latents, beauty_latent], dim=1

Analysis, Testing and Results


Conclusion