Making images with in-scene text more realistic.
Original Stable Diffusion repository: https://github.com/CompVis/stable-diffusion
- bf16 – a lot of minior modifications and fp32 conversions for the operations that don't support bf16 (
interpolate
) - Deepspeed support
- Hand-written training loop (
main_nolightning
) that is easier to read and modify than lightning's one - Code restructuring to improve readability
- LoRa adapters support for UNet
A suitable conda environment named ldm
can be created and activated with:
conda env create -f environment.yaml
conda activate ldm
Stable Diffusion currently provide the following checkpoints:
sd-v1-1.ckpt
: 237k steps at resolution256x256
on laion2B-en. 194k steps at resolution512x512
on laion-high-resolution (170M examples from LAION-5B with resolution>= 1024x1024
).sd-v1-2.ckpt
: Resumed fromsd-v1-1.ckpt
. 515k steps at resolution512x512
on laion-aesthetics v2 5+ (a subset of laion2B-en with estimated aesthetics score> 5.0
, and additionally filtered to images with an original size>= 512x512
, and an estimated watermark probability< 0.5
. The watermark estimate is from the LAION-5B metadata, the aesthetics score is estimated using the LAION-Aesthetics Predictor V2).sd-v1-3.ckpt
: Resumed fromsd-v1-2.ckpt
. 195k steps at resolution512x512
on "laion-aesthetics v2 5+" and 10% dropping of the text-conditioning to improve classifier-free guidance sampling.sd-v1-4.ckpt
: Resumed fromsd-v1-2.ckpt
. 225k steps at resolution512x512
on "laion-aesthetics v2 5+" and 10% dropping of the text-conditioning to improve classifier-free guidance sampling.
Evaluations with different classifier-free guidance scales (1.5, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0) and 50 PLMS sampling steps show the relative improvements of the checkpoints:
Stable Diffusion is a latent diffusion model conditioned on the (non-pooled) text embeddings of a CLIP ViT-L/14 text encoder. We provide a reference script for sampling, but there also exists a diffusers integration, which we expect to see more active community development.
Mining from LAION-5B:
- Run download script like this:
python scripts/download_laion_sharded.py \
--input_dir "data/text-laion-20M" \
--output_dir "data/text-laion-20M-images" \
--shard_size 100000 \
--num_shards 100 \
- Run the following script to apply an OCR system to the images and filter out the ones that don't have text:
python scripts/ocr.py \
--input_dir "data/text-laion-20M-images" \
--output_dir "data/text-laion-20M-images-with-text" \
--num_shards 100
-
Stable Diffusion codebase for the diffusion models builds heavily on OpenAI's ADM codebase and https://github.com/lucidrains/denoising-diffusion-pytorch. Thanks for open-sourcing!
-
The implementation of the transformer encoder is from x-transformers by lucidrains.
@misc{rombach2021highresolution,
title={High-Resolution Image Synthesis with Latent Diffusion Models},
author={Robin Rombach and Andreas Blattmann and Dominik Lorenz and Patrick Esser and Björn Ommer},
year={2021},
eprint={2112.10752},
archivePrefix={arXiv},
primaryClass={cs.CV}
}