Skip to content

Text generation for Stable Diffusion by replacing the text backbone with T5-XL

License

Notifications You must be signed in to change notification settings

Guitaricet/texty-diffusion

 
 

Repository files navigation

Texty Diffusion

Making images with in-scene text more realistic.

Changes to original Stable Diffusion repository:

Original Stable Diffusion repository: https://github.com/CompVis/stable-diffusion

  • bf16 – a lot of minior modifications and fp32 conversions for the operations that don't support bf16 (interpolate)
  • Deepspeed support
  • Hand-written training loop (main_nolightning) that is easier to read and modify than lightning's one
  • Code restructuring to improve readability
  • LoRa adapters support for UNet

Requirements

A suitable conda environment named ldm can be created and activated with:

conda env create -f environment.yaml
conda activate ldm

Weights

Stable Diffusion currently provide the following checkpoints:

  • sd-v1-1.ckpt: 237k steps at resolution 256x256 on laion2B-en. 194k steps at resolution 512x512 on laion-high-resolution (170M examples from LAION-5B with resolution >= 1024x1024).
  • sd-v1-2.ckpt: Resumed from sd-v1-1.ckpt. 515k steps at resolution 512x512 on laion-aesthetics v2 5+ (a subset of laion2B-en with estimated aesthetics score > 5.0, and additionally filtered to images with an original size >= 512x512, and an estimated watermark probability < 0.5. The watermark estimate is from the LAION-5B metadata, the aesthetics score is estimated using the LAION-Aesthetics Predictor V2).
  • sd-v1-3.ckpt: Resumed from sd-v1-2.ckpt. 195k steps at resolution 512x512 on "laion-aesthetics v2 5+" and 10% dropping of the text-conditioning to improve classifier-free guidance sampling.
  • sd-v1-4.ckpt: Resumed from sd-v1-2.ckpt. 225k steps at resolution 512x512 on "laion-aesthetics v2 5+" and 10% dropping of the text-conditioning to improve classifier-free guidance sampling.

Evaluations with different classifier-free guidance scales (1.5, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0) and 50 PLMS sampling steps show the relative improvements of the checkpoints: sd evaluation results

Text-to-Image with Stable Diffusion

txt2img-stable2 txt2img-stable2

Stable Diffusion is a latent diffusion model conditioned on the (non-pooled) text embeddings of a CLIP ViT-L/14 text encoder. We provide a reference script for sampling, but there also exists a diffusers integration, which we expect to see more active community development.

Texty Caps Dataset

Mining from LAION-5B:

  1. Run download script like this:
python scripts/download_laion_sharded.py \
    --input_dir "data/text-laion-20M" \
    --output_dir "data/text-laion-20M-images" \
    --shard_size 100000 \
    --num_shards 100 \
  1. Run the following script to apply an OCR system to the images and filter out the ones that don't have text:
python scripts/ocr.py \
    --input_dir "data/text-laion-20M-images" \
    --output_dir "data/text-laion-20M-images-with-text" \
    --num_shards 100

Comments

BibTeX

@misc{rombach2021highresolution,
      title={High-Resolution Image Synthesis with Latent Diffusion Models}, 
      author={Robin Rombach and Andreas Blattmann and Dominik Lorenz and Patrick Esser and Björn Ommer},
      year={2021},
      eprint={2112.10752},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

About

Text generation for Stable Diffusion by replacing the text backbone with T5-XL

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 67.8%
  • Python 32.1%
  • Shell 0.1%