Skip to content

Commit cefc2cf

Browse files
DavidBertdavidbdavid-PHRdavidbstevhliu
authored
Add Photon model and pipeline support (#12456)
* Add Photon model and pipeline support This commit adds support for the Photon image generation model: - PhotonTransformer2DModel: Core transformer architecture - PhotonPipeline: Text-to-image generation pipeline - Attention processor updates for Photon-specific attention mechanism - Conversion script for loading Photon checkpoints - Documentation and tests * just store the T5Gemma encoder * enhance_vae_properties if vae is provided only * remove autocast for text encoder forwad * BF16 example * conditioned CFG * remove enhance vae and use vae.config directly when possible * move PhotonAttnProcessor2_0 in transformer_photon * remove einops dependency and now inherits from AttentionMixin * unify the structure of the forward block * update doc * update doc * fix T5Gemma loading from hub * fix timestep shift * remove lora support from doc * Rename EmbedND for PhotoEmbedND * remove modulation dataclass * put _attn_forward and _ffn_forward logic in PhotonBlock's forward * renam LastLayer for FinalLayer * remove lora related code * rename vae_spatial_compression_ratio for vae_scale_factor * support prompt_embeds in call * move xattention conditionning out computation out of the denoising loop * add negative prompts * Use _import_structure for lazy loading * make quality + style * add pipeline test + corresponding fixes * utility function that determines the default resolution given the VAE * Refactor PhotonAttention to match Flux pattern * built-in RMSNorm * Revert accidental .gitignore change * parameter names match the standard diffusers conventions * renaming and remove unecessary attributes setting * Update docs/source/en/api/pipelines/photon.md Co-authored-by: Steven Liu <[email protected]> * quantization example * added doc to toctree * Update docs/source/en/api/pipelines/photon.md Co-authored-by: Steven Liu <[email protected]> * Update docs/source/en/api/pipelines/photon.md Co-authored-by: Steven Liu <[email protected]> * Update docs/source/en/api/pipelines/photon.md Co-authored-by: Steven Liu <[email protected]> * use dispatch_attention_fn for multiple attention backend support * naming changes * make fix copy * Update docs/source/en/api/pipelines/photon.md Co-authored-by: dg845 <[email protected]> * Add PhotonTransformer2DModel to TYPE_CHECKING imports * make fix-copies * Use Tuple instead of tuple Co-authored-by: dg845 <[email protected]> * restrict the version of transformers Co-authored-by: dg845 <[email protected]> * Update tests/pipelines/photon/test_pipeline_photon.py Co-authored-by: dg845 <[email protected]> * Update tests/pipelines/photon/test_pipeline_photon.py Co-authored-by: dg845 <[email protected]> * change | for Optional * fix nits. * use typing Dict --------- Co-authored-by: davidb <davidb@worker-10.soperator-worker-svc.soperator.svc.cluster.local> Co-authored-by: David Briand <[email protected]> Co-authored-by: davidb <[email protected]> Co-authored-by: Steven Liu <[email protected]> Co-authored-by: dg845 <[email protected]> Co-authored-by: sayakpaul <[email protected]>
1 parent b3e56e7 commit cefc2cf

File tree

16 files changed

+2501
-0
lines changed

16 files changed

+2501
-0
lines changed

docs/source/en/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -541,6 +541,8 @@
541541
title: PAG
542542
- local: api/pipelines/paint_by_example
543543
title: Paint by Example
544+
- local: api/pipelines/photon
545+
title: Photon
544546
- local: api/pipelines/pixart
545547
title: PixArt-α
546548
- local: api/pipelines/pixart_sigma
Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
<!-- Copyright 2025 The HuggingFace Team. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License. -->
14+
15+
# Photon
16+
17+
18+
Photon generates high-quality images from text using a simplified MMDIT architecture where text tokens don't update through transformer blocks. It employs flow matching with discrete scheduling for efficient sampling and uses Google's T5Gemma-2B-2B-UL2 model for multi-language text encoding. The ~1.3B parameter transformer delivers fast inference without sacrificing quality. You can choose between Flux VAE (8x compression, 16 latent channels) for balanced quality and speed or DC-AE (32x compression, 32 latent channels) for latent compression and faster processing.
19+
20+
## Available models
21+
22+
Photon offers multiple variants with different VAE configurations, each optimized for specific resolutions. Base models excel with detailed prompts, capturing complex compositions and subtle details. Fine-tuned models trained on the [Alchemist dataset](https://huggingface.co/datasets/yandex/alchemist) improve aesthetic quality, especially with simpler prompts.
23+
24+
25+
| Model | Resolution | Fine-tuned | Distilled | Description | Suggested prompts | Suggested parameters | Recommended dtype |
26+
|:-----:|:-----------------:|:----------:|:----------:|:----------:|:----------:|:----------:|:----------:|
27+
| [`Photoroom/photon-256-t2i`](https://huggingface.co/Photoroom/photon-256-t2i)| 256 | No | No | Base model pre-trained at 256 with Flux VAE|Works best with detailed prompts in natural language|28 steps, cfg=5.0| `torch.bfloat16` |
28+
| [`Photoroom/photon-256-t2i-sft`](https://huggingface.co/Photoroom/photon-256-t2i-sft)| 512 | Yes | No | Fine-tuned on the [Alchemist dataset](https://huggingface.co/datasets/yandex/alchemist) dataset with Flux VAE | Can handle less detailed prompts|28 steps, cfg=5.0| `torch.bfloat16` |
29+
| [`Photoroom/photon-512-t2i`](https://huggingface.co/Photoroom/photon-512-t2i)| 512 | No | No | Base model pre-trained at 512 with Flux VAE |Works best with detailed prompts in natural language|28 steps, cfg=5.0| `torch.bfloat16` |
30+
| [`Photoroom/photon-512-t2i-sft`](https://huggingface.co/Photoroom/photon-512-t2i-sft)| 512 | Yes | No | Fine-tuned on the [Alchemist dataset](https://huggingface.co/datasets/yandex/alchemist) dataset with Flux VAE | Can handle less detailed prompts in natural language|28 steps, cfg=5.0| `torch.bfloat16` |
31+
| [`Photoroom/photon-512-t2i-sft-distilled`](https://huggingface.co/Photoroom/photon-512-t2i-sft-distilled)| 512 | Yes | Yes | 8-step distilled model from [`Photoroom/photon-512-t2i-sft`](https://huggingface.co/Photoroom/photon-512-t2i-sft) | Can handle less detailed prompts in natural language|8 steps, cfg=1.0| `torch.bfloat16` |
32+
| [`Photoroom/photon-512-t2i-dc-ae`](https://huggingface.co/Photoroom/photon-512-t2i-dc-ae)| 512 | No | No | Base model pre-trained at 512 with [Deep Compression Autoencoder (DC-AE)](https://hanlab.mit.edu/projects/dc-ae)|Works best with detailed prompts in natural language|28 steps, cfg=5.0| `torch.bfloat16` |
33+
| [`Photoroom/photon-512-t2i-dc-ae-sft`](https://huggingface.co/Photoroom/photon-512-t2i-dc-ae-sft)| 512 | Yes | No | Fine-tuned on the [Alchemist dataset](https://huggingface.co/datasets/yandex/alchemist) dataset with [Deep Compression Autoencoder (DC-AE)](https://hanlab.mit.edu/projects/dc-ae) | Can handle less detailed prompts in natural language|28 steps, cfg=5.0| `torch.bfloat16` |
34+
| [`Photoroom/photon-512-t2i-dc-ae-sft-distilled`](https://huggingface.co/Photoroom/photon-512-t2i-dc-ae-sft-distilled)| 512 | Yes | Yes | 8-step distilled model from [`Photoroom/photon-512-t2i-dc-ae-sft-distilled`](https://huggingface.co/Photoroom/photon-512-t2i-dc-ae-sft-distilled) | Can handle less detailed prompts in natural language|8 steps, cfg=1.0| `torch.bfloat16` |s
35+
36+
Refer to [this](https://huggingface.co/collections/Photoroom/photon-models-68e66254c202ebfab99ad38e) collection for more information.
37+
38+
## Loading the pipeline
39+
40+
Load the pipeline with [`~DiffusionPipeline.from_pretrained`].
41+
42+
```py
43+
from diffusers.pipelines.photon import PhotonPipeline
44+
45+
# Load pipeline - VAE and text encoder will be loaded from HuggingFace
46+
pipe = PhotonPipeline.from_pretrained("Photoroom/photon-512-t2i-sft", torch_dtype=torch.bfloat16)
47+
pipe.to("cuda")
48+
49+
prompt = "A front-facing portrait of a lion the golden savanna at sunset."
50+
image = pipe(prompt, num_inference_steps=28, guidance_scale=5.0).images[0]
51+
image.save("photon_output.png")
52+
```
53+
54+
### Manual Component Loading
55+
56+
Load components individually to customize the pipeline for instance to use quantized models.
57+
58+
```py
59+
import torch
60+
from diffusers.pipelines.photon import PhotonPipeline
61+
from diffusers.models import AutoencoderKL, AutoencoderDC
62+
from diffusers.models.transformers.transformer_photon import PhotonTransformer2DModel
63+
from diffusers.schedulers import FlowMatchEulerDiscreteScheduler
64+
from transformers import T5GemmaModel, GemmaTokenizerFast
65+
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
66+
from transformers import BitsAndBytesConfig as BitsAndBytesConfig
67+
68+
quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True)
69+
# Load transformer
70+
transformer = PhotonTransformer2DModel.from_pretrained(
71+
"checkpoints/photon-512-t2i-sft",
72+
subfolder="transformer",
73+
quantization_config=quant_config,
74+
torch_dtype=torch.bfloat16,
75+
)
76+
77+
# Load scheduler
78+
scheduler = FlowMatchEulerDiscreteScheduler.from_pretrained(
79+
"checkpoints/photon-512-t2i-sft", subfolder="scheduler"
80+
)
81+
82+
# Load T5Gemma text encoder
83+
t5gemma_model = T5GemmaModel.from_pretrained("google/t5gemma-2b-2b-ul2",
84+
quantization_config=quant_config,
85+
torch_dtype=torch.bfloat16)
86+
text_encoder = t5gemma_model.encoder.to(dtype=torch.bfloat16)
87+
tokenizer = GemmaTokenizerFast.from_pretrained("google/t5gemma-2b-2b-ul2")
88+
tokenizer.model_max_length = 256
89+
90+
# Load VAE - choose either Flux VAE or DC-AE
91+
# Flux VAE
92+
vae = AutoencoderKL.from_pretrained("black-forest-labs/FLUX.1-dev",
93+
subfolder="vae",
94+
quantization_config=quant_config,
95+
torch_dtype=torch.bfloat16)
96+
97+
pipe = PhotonPipeline(
98+
transformer=transformer,
99+
scheduler=scheduler,
100+
text_encoder=text_encoder,
101+
tokenizer=tokenizer,
102+
vae=vae
103+
)
104+
pipe.to("cuda")
105+
```
106+
107+
108+
## Memory Optimization
109+
110+
For memory-constrained environments:
111+
112+
```py
113+
import torch
114+
from diffusers.pipelines.photon import PhotonPipeline
115+
116+
pipe = PhotonPipeline.from_pretrained("Photoroom/photon-512-t2i-sft", torch_dtype=torch.bfloat16)
117+
pipe.enable_model_cpu_offload() # Offload components to CPU when not in use
118+
119+
# Or use sequential CPU offload for even lower memory
120+
pipe.enable_sequential_cpu_offload()
121+
```
122+
123+
## PhotonPipeline
124+
125+
[[autodoc]] PhotonPipeline
126+
- all
127+
- __call__
128+
129+
## PhotonPipelineOutput
130+
131+
[[autodoc]] pipelines.photon.pipeline_output.PhotonPipelineOutput

0 commit comments

Comments
 (0)