[Model]: add FLUX.2-dev model#1629
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a53145a246
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com>
Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com>
Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com>
# Conflicts: # docs/user_guide/diffusion_acceleration.md
Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com>
Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com>
|
why oom but diffusers works fine? |
hsliuustc0106
left a comment
There was a problem hiding this comment.
Architectural Code Review
📋 Summary
| Item | Details |
|---|---|
| PR | [Model]: add FLUX.2-dev model |
| Author | @nuclearwu |
| Scale | +1871 lines (2 new files, 6 modified) |
| Status | Needs Changes |
✅ Strengths
1. Complete Model Implementation
- Full transformer architecture (764 lines)
- Complete pipeline (1081 lines)
- Proper registry integration
2. Performance Benchmarks
| Config | Time | vs diffusers |
|---|---|---|
| TP=2 | 39.1s/img | 2.7x faster |
| TP=4 | 29.1s/img | 3.6x faster |
3. Architecture Patterns
- Proper Mixin composition:
CFGParallelMixin,SupportImageInput - Fused QKV+MLP projection:
Flux2ParallelSelfAttention - RoPE integration with
RotaryEmbedding(is_neox_style=False)
🔴 Critical Issues
1. Zero Test Coverage
+1871 lines of new code
+0 test files
Risk: No regression protection for:
- Weight loading (
load_weightswithstacked_params_mapping) - TP sharding logic
- Image preprocessing pipeline
- Text encoder integration
Required:
# tests/diffusion/models/flux2/test_flux2_transformer.py
def test_weight_loading_tp2():
"""Verify weights load correctly with TP=2"""
def test_rope_position_embedding():
"""Verify RoPE produces correct embeddings for 4D coords"""
def test_packed_module_mapping():
"""Verify to_qkv packing matches HF checkpoint"""2. Weight Loading Typo
# flux2_transformer.py:716
if "to_qkvkv_mlp_proj" in name: # ❌ Typo: qkvkv
name = name.replace("to_qkvkv_mlp_proj", "to_qkv_mlp_proj")Questions:
- What HF checkpoint has this typo?
- Is this a diffusers bug or model-specific?
Fix: Add comment explaining the source, or fix upstream if possible.
3. TP=1 OOM Without Explanation
| Model/TP | TP=1 | TP=2 | TP=4 |
| Flux.2-dev | OOM | ✅ | ✅ |
Missing:
- Memory requirement estimate
- Minimum GPU memory for each TP config
gpu_memory_utilizationtuning guidance
🟡 Significant Concerns
4. Code Attribution from diffusers
# pipeline_flux2.py:27-33
from diffusers import AutoencoderKLFlux2, FlowMatchEulerDiscreteScheduler
from diffusers.pipelines.flux2.pipeline_flux2 import UPSAMPLING_MAX_IMAGE_SIZE
from diffusers.pipelines.flux2.system_messages import SYSTEM_MESSAGE, ...Large sections appear copied from diffusers:
retrieve_timesteps(67 lines) - copied with "Copied from diffusers"retrieve_latents(12 lines) - copied with "Copied from diffusers"_validate_and_process_images(33 lines) - "Adapted from diffusers"
Concerns:
- Are all copied sections properly attributed?
- License compatibility (Apache 2.0 vs diffusers license)?
Recommendation: Audit all copied code for proper attribution headers.
5. Hardcoded Magic Values
# pipeline_flux2.py:52-55
max_aspect_ratio: int = 8,
min_side_length: int = 64,
max_area: int = 1024 * 1024,# pipeline_flux2.py:304
max_length=2048, # ❌ Why 2048?# pipeline_flux2.py:454
scale: int = 10, # ❌ Why 10?Fix: Document each constant or make configurable.
6. Inconsistent Error Handling
# pipeline_flux2.py:560
if latents.dtype != self.vae.dtype:
latents = latents.to(self.vae.dtype)vs.
# pipeline_flux2.py:563
image = self.vae.decode(latents, return_dict=False)[0] # No dtype check🟢 Minor Suggestions
7. Missing Type Hints
# flux2_transformer.py:714
def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:vs.
# pipeline_flux2.py:733
def load_weights(self, weights): # ❌ No type hints8. Docstring Gaps
def _prepare_latent_ids(self, latents):
"""Missing docstring for complex coordinate generation"""🏗️ Architecture Impact Analysis
Registry Integration:
# registry.py - Correct pattern
_DIFFUSION_PIPELINES["Flux2Pipeline"] = ("flux2", "pipeline_flux2", "Flux2Pipeline")
_POST_PROCESS_FUNCS["Flux2Pipeline"] = "get_flux2_post_process_func"TP Sharding:
# Correct use of vLLM parallel layers
QKVParallelLinear, MergedColumnParallelLinear, RowParallelLinearAttention Backend:
- Uses
vllm_omni.diffusion.attention.layer.Attention - Properly integrates RoPE
📝 Required Changes
| Priority | Item |
|---|---|
| BLOCKER | Add unit tests (weight loading, TP sharding, preprocessing) |
| BLOCKER | Document/fix to_qkvkv_mlp_proj typo source |
| IMPORTANT | Add memory requirements documentation |
| IMPORTANT | Audit diffusers code attribution |
| SUGGESTED | Document magic constants |
Verdict
| Rating | Notes |
|---|---|
| CHANGES_REQUESTED |
Solid implementation, but zero tests is unacceptable |
Rationale:
- Good architecture patterns and performance
- Complete model integration
- But 1871 lines with 0 tests is a maintenance liability
Post-fix: Once tests are added, this is an APPROVE.
Reviewed by: vllm-omni-reviewer MCP tool 🦐
hsliuustc0106
left a comment
There was a problem hiding this comment.
Additional Feedback: Memory Profiling Required
Good point from review — this PR should include memory profiling for different TP configurations.
Why This Matters
- TP=1 OOM is unexplained — Users need to know minimum GPU memory requirements
- Capacity planning — Users need to choose the right GPU/TP config
gpu_memory_utilizationtuning — Users need guidance on memory fraction settings
Suggested Memory Report Format
## Memory Profiling (FLUX.2-dev, 1024x1024, 50 steps)
| Config | GPU Memory | Peak Memory | Status |
|--------|------------|-------------|--------|
| TP=1, 1x A100 80GB | OOM | - | ❌ Insufficient |
| TP=2, 2x A100 80GB | ~45 GB | ~52 GB | ✅ Works |
| TP=4, 4x A100 80GB | ~25 GB | ~30 GB | ✅ Works |
**Minimum Requirements:**
- TP=2: 2× A100 80GB or equivalent
- TP=4: 4× A100 40GB or equivalentHow to Profile
# Enable memory tracking
export VLLM_ATTENTION_BACKEND=FLASHINFER
nvidia-smi --query-gpu=memory.used,memory.total --format=csv -l 1 > memory.log &
# Run inference
python examples/offline_inference/text_to_image/text_to_image.py \
--model black-forest-labs/FLUX.2-dev \
--tensor-parallel-size 2 \
--height 1024 --width 1024 \
--num-inference-steps 50
# Analyze peak
python -c "
import pandas as pd
df = pd.read_csv('memory.log')
print(f'Peak memory: {df.iloc[:,0].max()} MB')
"Additional Metrics to Include
- Model weights memory (fixed overhead)
- Activation memory (depends on batch size, resolution)
- KV cache memory (if applicable)
- VAE encoder/decoder memory
This information is essential for users to decide if their hardware can run the model.
🦐 vllm-omni-reviewer
# Conflicts: # vllm_omni/diffusion/registry.py
Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com>
@hsliuustc0106 The OOM on a single A800 (80GB) at TP=1 is inevitable because the total size of FLUX.2-dev weights is approximately 112.6 GB (including ~64.3 GB for the Transformer and ~48.0 GB for the T5-XXL text encoder), which significantly exceeds the 80GB VRAM capacity. However, diffusers works fine because save some VRAM by offloading the model to CPU. |
Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com>
Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com>
Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com>
Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com>
lishunyang12
left a comment
There was a problem hiding this comment.
left a couple of comments
|
The PR is in good shape overall. Fix those and i will left maintainers with the remaining items. |
Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com>
|
Overall, it's good:
|
Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com>
|
fix dco and solve @wtomin's comments |
@wtomin Thank you for your review, I will consider supporting quantization in the future. The rest have all been revised. |
@hsliuustc0106 done |
# Conflicts: # docs/user_guide/diffusion/parallelism_acceleration.md
|
Solve the conflicts please. @nuclearwu |
Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com>
@wtomin done |
Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com>
|
Does VLLM Omni Support Flux2-dev Image to Image in the API Server? |
|
@jannikstdl Please take this image-to-image tutorial as reference, changing the model name to Flux2-dev . If you encountered some errors, feel free to raise an issue. |
|
Update: It does already support Image Editing with Flux2.dev in API Server. Running VLLM Omni 0.16.0 |
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
support https://huggingface.co/black-forest-labs/FLUX.2-dev
Test Plan
vLLM-Omni:
Text-to-Image:
Online Serving:
Memory Profile:
Image-to-Image:
Test Result
vLLM-Omni:
Reproduced with 4xA800.
Online Serving:
Memory Profiling (FLUX.2-dev, 1024x1024, 50 steps):
TP=1 OOM Explanation:
The OOM on a single A800 (80GB) at TP=1 is inevitable because the total size of FLUX.2-dev weights is approximately 112.6 GB (including ~64.3 GB for the Transformer and ~48.0 GB for the T5-XXL text encoder), which significantly exceeds the 80GB VRAM capacity. But we can enable cpu offload by
--enable-cpu-offloadat TP=1 to. run Flux.2-dev and it works well.Minimum Requirements:
Image-to-Image:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)