Technical Report
Introduction
- Overview: Fine Tune UNet part of Stable Diffusion model
- Goal: Use Beauty image to infer Albedo image
- Beauty: usually refer to the final render image from 3D software, which including lighting condition, textures, PBR shaders
- Albedo: the texture only contain color information
- Other info:
- Based on Marigold project this commit
- Pipeline code referred from paper’s code
Core Idea
VAE encode -> latent -> unet -> transformer -> unet -> VAE decode
What is VAE, difference between autoencoder
What is unet
What is transformer and attention blocks inside, math equation and where is the text embeddings
Process
这里写在做什么,然后令开一个文档,放代码 ,这样文字和代码对上。主要是思路必须是顺畅的
start the training
↓
loop each epoch
↓
loop each batch
↓
1. Forward
2. Loss
3. Backward propagate
4. update parameters(Optimizer.step)
↓
every certain steps → call: _train_step_callback()
↓
check if need:
- save checkpoint (save_period)
- do validation (val_period)
- do visualization (vis_period)
-
Seed:
Three different seed applied
- training seed
- validation seed
- dataloader seed
Dataset
Overview: Dataset is the place where you load your training data. I apply the normalization here (while others prefer to separate it)
Constructor:
When preparing the training dataset, a “filename_list” file is created, containing the paired dataset information (which beauty file -> which albedo file)
In the constructor, training data/images are read according to the “filename_list”.
In __getitem__
method :
Here we retrieves the corresponding beauty and albedo image paths, loads the image using PIL and then converts them to tensors. All tensors been normalized to [-1, 1]
Other info:
- valid mask: just set all 1, inherit from Marigold
Training starter
Overview: training.py, where to start the training. Here has all configurations set up for training process. (read from config)
- parser
- initialization (TBC)
- logging settings(TBC)
- device [[code#^da8c7d]]
- check if cuda available
- logging device information
- snapshot(TBC)
- set “effective batch size”, “accumulation steps” [[code#^b70a62]]
- seed (TBC)
- load dataset [[code#^d2fd95]]
- here we have validation dataset and training dataset same
- Models [[code#^438ba0]]
- load models (unet, vae, scheduler, text encoder, tokenizer)
- put all models together
- continue training if stop in middle (not check yet TBC)
- load check point
Training
Major Goal: Concatenate beauty image and noise (represent albedo), modify the input for unet
The input channels for unet must be 8 channels, because we have 4 channels for beauty and 4 channels for noise/albedo
_replace_unet_conv_in
[[code#_replace_unet_conv_in]]
==Modify the input convolution layer for unet, allow input 8 channels==
- double the weight and bias
_train_step_callback
[[code#^6c41d6]]
- save checkpoint
- validation [[Technical Report#validate]]
- save checkpoint
- visualization [[Technical Report#visualize]]
validate / visualize
validate
for loop dataloader
- set the dataset name
- set the metric dictionary
visualize
save the validation image, which is also visualization image. this calls “validate_single_dataset”
validate_single_dataset
[[code#^03be56]]
overview: load validation dataloader -> use pipeline to predict -> calculate metrics -> save the result as image
- set it to
torch.no_grad()
-> doesn’t need to update parameters - move model to device, reset metrics
- generate seed
- loop in validation dataloader
- load in beauty and albedo, valid mask
- random generator
- got predicted albedo numpy and PIL from MaterialOutput
batch size: how many training data will be used in one iteration
effective batch size: When GPU cannot fit “batch size” in one go, so split into several “effective batch size”. For example “batch size” is 64, and it can be spitted into 4 “effective batch size”=16.
- do forward -> loss -> backward -> accumulate the gradient, every “effective batch size” (not “batch size)
- after 4 round of “effective batch size” -> update gradient
optimizer.step()
-> zero gradients
The whole process called gradient accumulation
train loop
- move all to “device”
- if not finish last time, continue evaluation
- reset train metrics
- for epoch loop
- for dataloader/batch loop
- set the model training mode (training mode has Dropout and BatchNorm layers)
- generate seed
- first generate local seed
- second generate torch seed by using torch.Generator
- get data [[code#^getdata]]
- beauty from dataloader
- albedo from dataloader (as target)
- mask [[code#^validmask]]
- set batch_size = loader batch size
Config
Pipeline
- __call__
- _encode_empty_text
- _single_infer
- _encode_rgb
- _decode_albedo
- _ensemble_albedo
- _resize_max_res
- _find_batch_size
__call__
return MaterialOutput
- check if the input is Image or tensor
- infer albedo
- duplicate albedo “ensemble_size” times
- put them into a dataloader
- loop the dataloader, infer everytime
- collect all inferred albedo
- ensemble = average all inferred result
- resize back to the target size
- transfer from tensor to numpy