Skip to content

Generative models nano version for fun. No STOA here, nano first.

Notifications You must be signed in to change notification settings

rmgogogo/nano-aigc

Repository files navigation

AIGC Generative Models

For NLP generative, like GPT, please check https://github.com/rmgogogo/nano-transformers

Here this repo more on generatives. GPT still may be tried here.

This repo uses PyTorch.

VAE

python vae.py --train --epochs 10 --predict

Conditional VAE

python cvae.py --train --epochs 10 --predict

Diffusion

python diffusion.py --train --epochs 100 --predict

Mac Mini M1 takes around 1 hour (1:17:16).

Conditional Diffusion

python conditional_diffusion.py --train --epochs 100 --predict

CLIP

python clip.py --train --epochs 10 --predict

CLIP Pro

A pro version of CLIP. It uses the BERT text encoder with real text. Since this is a nano image VAE, while BERT encoder generates 768-d vector, and we only have 10 ditigals, it has high prob to contain same digital in one batch, then the CLIP's loss can't work well. Using small batch would help but small batch has its own problem. So the performance is not good. However it's good enough as a demo to tell the essience.

python clip_pro.py --train --epochs 10 --predict

VQ VAE

python vqvae.py --train --epochs 100 --predict

Codebook size is 32, here display the whole possibilites. This sample VQ the whole z, in real case, it VQ the parts.

The initial codebook:

The learned codebook:

DDIM (Faster Diffusion Generation)

50 times faster.

python diffusion.py --predict --ddim
python conditional_diffusion.py --predict --ddim

Latent Diffusion

Based on vae with latent 8, it do diffusion in latent space. However since the latent space already is noise-make-sense and high compressed (8 numbers), the diffusion in latent didn't work well as expected. It's mainly for demo purpose.

GAN

Gan with a simple conv net, so it's DCGAN.

Patches VQ VAE

Split image into 4x4 smaller images, so we have 7x7 patches.

Train VQ VAE for the patches.

It's like tokenizer to give each patch an identifier. So image can be represented as a 7x7 sequence. Later we can implement ViT based on it.

Compare the Patches VQ VAE with VQ-VAE or VAE, we would find that image is more sharp. However in the boundary of the two patches, we may need to do some additional low-band filtering to make it be more smooth.

The codebook is trained and looks good.

GPT2

GPT2 based on a toy dataset (simple math).

python gpt2.py --train --epochs 400 --predict --input "1 + 1 ="

LLaMA

python llama.py --train --epochs 400 --predict --input "1 + 1 ="

Gemma

python gemma.py --train --epochs 400 --predict --input "1 + 1 ="

DiT

(1)

Split image into patches, VQ the patch to tokenize the image into tokens (distinct) and then get its vector via embedding. Train GPT to predict the tokens, which finally generates the image. Diffusion Transformer.

https://arxiv.org/pdf/2212.09748.pdf

(2)

Split image into patches via using Conv to get token vectors directly.