|
| 1 | +<!-- Copyright 2025 The HuggingFace Team. All rights reserved. |
| 2 | +# |
| 3 | +# Licensed under the Apache License, Version 2.0 (the "License"); |
| 4 | +# you may not use this file except in compliance with the License. |
| 5 | +# You may obtain a copy of the License at |
| 6 | +# |
| 7 | +# http://www.apache.org/licenses/LICENSE-2.0 |
| 8 | +# |
| 9 | +# Unless required by applicable law or agreed to in writing, software |
| 10 | +# distributed under the License is distributed on an "AS IS" BASIS, |
| 11 | +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| 12 | +# See the License for the specific language governing permissions and |
| 13 | +# limitations under the License. --> |
| 14 | + |
| 15 | +# Photon |
| 16 | + |
| 17 | + |
| 18 | +Photon generates high-quality images from text using a simplified MMDIT architecture where text tokens don't update through transformer blocks. It employs flow matching with discrete scheduling for efficient sampling and uses Google's T5Gemma-2B-2B-UL2 model for multi-language text encoding. The ~1.3B parameter transformer delivers fast inference without sacrificing quality. You can choose between Flux VAE (8x compression, 16 latent channels) for balanced quality and speed or DC-AE (32x compression, 32 latent channels) for latent compression and faster processing. |
| 19 | + |
| 20 | +## Available models |
| 21 | + |
| 22 | +Photon offers multiple variants with different VAE configurations, each optimized for specific resolutions. Base models excel with detailed prompts, capturing complex compositions and subtle details. Fine-tuned models trained on the [Alchemist dataset](https://huggingface.co/datasets/yandex/alchemist) improve aesthetic quality, especially with simpler prompts. |
| 23 | + |
| 24 | + |
| 25 | +| Model | Resolution | Fine-tuned | Distilled | Description | Suggested prompts | Suggested parameters | Recommended dtype | |
| 26 | +|:-----:|:-----------------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:| |
| 27 | +| [`Photoroom/photon-256-t2i`](https://huggingface.co/Photoroom/photon-256-t2i)| 256 | No | No | Base model pre-trained at 256 with Flux VAE|Works best with detailed prompts in natural language|28 steps, cfg=5.0| `torch.bfloat16` | |
| 28 | +| [`Photoroom/photon-256-t2i-sft`](https://huggingface.co/Photoroom/photon-256-t2i-sft)| 512 | Yes | No | Fine-tuned on the [Alchemist dataset](https://huggingface.co/datasets/yandex/alchemist) dataset with Flux VAE | Can handle less detailed prompts|28 steps, cfg=5.0| `torch.bfloat16` | |
| 29 | +| [`Photoroom/photon-512-t2i`](https://huggingface.co/Photoroom/photon-512-t2i)| 512 | No | No | Base model pre-trained at 512 with Flux VAE |Works best with detailed prompts in natural language|28 steps, cfg=5.0| `torch.bfloat16` | |
| 30 | +| [`Photoroom/photon-512-t2i-sft`](https://huggingface.co/Photoroom/photon-512-t2i-sft)| 512 | Yes | No | Fine-tuned on the [Alchemist dataset](https://huggingface.co/datasets/yandex/alchemist) dataset with Flux VAE | Can handle less detailed prompts in natural language|28 steps, cfg=5.0| `torch.bfloat16` | |
| 31 | +| [`Photoroom/photon-512-t2i-sft-distilled`](https://huggingface.co/Photoroom/photon-512-t2i-sft-distilled)| 512 | Yes | Yes | 8-step distilled model from [`Photoroom/photon-512-t2i-sft`](https://huggingface.co/Photoroom/photon-512-t2i-sft) | Can handle less detailed prompts in natural language|8 steps, cfg=1.0| `torch.bfloat16` | |
| 32 | +| [`Photoroom/photon-512-t2i-dc-ae`](https://huggingface.co/Photoroom/photon-512-t2i-dc-ae)| 512 | No | No | Base model pre-trained at 512 with [Deep Compression Autoencoder (DC-AE)](https://hanlab.mit.edu/projects/dc-ae)|Works best with detailed prompts in natural language|28 steps, cfg=5.0| `torch.bfloat16` | |
| 33 | +| [`Photoroom/photon-512-t2i-dc-ae-sft`](https://huggingface.co/Photoroom/photon-512-t2i-dc-ae-sft)| 512 | Yes | No | Fine-tuned on the [Alchemist dataset](https://huggingface.co/datasets/yandex/alchemist) dataset with [Deep Compression Autoencoder (DC-AE)](https://hanlab.mit.edu/projects/dc-ae) | Can handle less detailed prompts in natural language|28 steps, cfg=5.0| `torch.bfloat16` | |
| 34 | +| [`Photoroom/photon-512-t2i-dc-ae-sft-distilled`](https://huggingface.co/Photoroom/photon-512-t2i-dc-ae-sft-distilled)| 512 | Yes | Yes | 8-step distilled model from [`Photoroom/photon-512-t2i-dc-ae-sft-distilled`](https://huggingface.co/Photoroom/photon-512-t2i-dc-ae-sft-distilled) | Can handle less detailed prompts in natural language|8 steps, cfg=1.0| `torch.bfloat16` |s |
| 35 | + |
| 36 | +Refer to [this](https://huggingface.co/collections/Photoroom/photon-models-68e66254c202ebfab99ad38e) collection for more information. |
| 37 | + |
| 38 | +## Loading the pipeline |
| 39 | + |
| 40 | +Load the pipeline with [`~DiffusionPipeline.from_pretrained`]. |
| 41 | + |
| 42 | +```py |
| 43 | +from diffusers.pipelines.photon import PhotonPipeline |
| 44 | + |
| 45 | +# Load pipeline - VAE and text encoder will be loaded from HuggingFace |
| 46 | +pipe = PhotonPipeline.from_pretrained("Photoroom/photon-512-t2i-sft", torch_dtype=torch.bfloat16) |
| 47 | +pipe.to("cuda") |
| 48 | + |
| 49 | +prompt = "A front-facing portrait of a lion the golden savanna at sunset." |
| 50 | +image = pipe(prompt, num_inference_steps=28, guidance_scale=5.0).images[0] |
| 51 | +image.save("photon_output.png") |
| 52 | +``` |
| 53 | + |
| 54 | +### Manual Component Loading |
| 55 | + |
| 56 | +Load components individually to customize the pipeline for instance to use quantized models. |
| 57 | + |
| 58 | +```py |
| 59 | +import torch |
| 60 | +from diffusers.pipelines.photon import PhotonPipeline |
| 61 | +from diffusers.models import AutoencoderKL, AutoencoderDC |
| 62 | +from diffusers.models.transformers.transformer_photon import PhotonTransformer2DModel |
| 63 | +from diffusers.schedulers import FlowMatchEulerDiscreteScheduler |
| 64 | +from transformers import T5GemmaModel, GemmaTokenizerFast |
| 65 | +from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig |
| 66 | +from transformers import BitsAndBytesConfig as BitsAndBytesConfig |
| 67 | + |
| 68 | +quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True) |
| 69 | +# Load transformer |
| 70 | +transformer = PhotonTransformer2DModel.from_pretrained( |
| 71 | + "checkpoints/photon-512-t2i-sft", |
| 72 | + subfolder="transformer", |
| 73 | + quantization_config=quant_config, |
| 74 | + torch_dtype=torch.bfloat16, |
| 75 | +) |
| 76 | + |
| 77 | +# Load scheduler |
| 78 | +scheduler = FlowMatchEulerDiscreteScheduler.from_pretrained( |
| 79 | + "checkpoints/photon-512-t2i-sft", subfolder="scheduler" |
| 80 | +) |
| 81 | + |
| 82 | +# Load T5Gemma text encoder |
| 83 | +t5gemma_model = T5GemmaModel.from_pretrained("google/t5gemma-2b-2b-ul2", |
| 84 | + quantization_config=quant_config, |
| 85 | + torch_dtype=torch.bfloat16) |
| 86 | +text_encoder = t5gemma_model.encoder.to(dtype=torch.bfloat16) |
| 87 | +tokenizer = GemmaTokenizerFast.from_pretrained("google/t5gemma-2b-2b-ul2") |
| 88 | +tokenizer.model_max_length = 256 |
| 89 | + |
| 90 | +# Load VAE - choose either Flux VAE or DC-AE |
| 91 | +# Flux VAE |
| 92 | +vae = AutoencoderKL.from_pretrained("black-forest-labs/FLUX.1-dev", |
| 93 | + subfolder="vae", |
| 94 | + quantization_config=quant_config, |
| 95 | + torch_dtype=torch.bfloat16) |
| 96 | + |
| 97 | +pipe = PhotonPipeline( |
| 98 | + transformer=transformer, |
| 99 | + scheduler=scheduler, |
| 100 | + text_encoder=text_encoder, |
| 101 | + tokenizer=tokenizer, |
| 102 | + vae=vae |
| 103 | +) |
| 104 | +pipe.to("cuda") |
| 105 | +``` |
| 106 | + |
| 107 | + |
| 108 | +## Memory Optimization |
| 109 | + |
| 110 | +For memory-constrained environments: |
| 111 | + |
| 112 | +```py |
| 113 | +import torch |
| 114 | +from diffusers.pipelines.photon import PhotonPipeline |
| 115 | + |
| 116 | +pipe = PhotonPipeline.from_pretrained("Photoroom/photon-512-t2i-sft", torch_dtype=torch.bfloat16) |
| 117 | +pipe.enable_model_cpu_offload() # Offload components to CPU when not in use |
| 118 | + |
| 119 | +# Or use sequential CPU offload for even lower memory |
| 120 | +pipe.enable_sequential_cpu_offload() |
| 121 | +``` |
| 122 | + |
| 123 | +## PhotonPipeline |
| 124 | + |
| 125 | +[[autodoc]] PhotonPipeline |
| 126 | + - all |
| 127 | + - __call__ |
| 128 | + |
| 129 | +## PhotonPipelineOutput |
| 130 | + |
| 131 | +[[autodoc]] pipelines.photon.pipeline_output.PhotonPipelineOutput |
0 commit comments