Skip to content

[Model]: add FLUX.2-dev model#1629

Merged
hsliuustc0106 merged 21 commits intovllm-project:mainfrom
nuclearwu:flux2
Mar 11, 2026
Merged

[Model]: add FLUX.2-dev model#1629
hsliuustc0106 merged 21 commits intovllm-project:mainfrom
nuclearwu:flux2

Conversation

@nuclearwu
Copy link
Copy Markdown
Contributor

@nuclearwu nuclearwu commented Mar 3, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

support https://huggingface.co/black-forest-labs/FLUX.2-dev

Test Plan

vLLM-Omni:
Text-to-Image:

python examples/offline_inference/text_to_image/text_to_image.py \
  --model /workspace/cache/ymttest/johnjan/models/black-forest-labs/FLUX___2-dev/ \
  --prompt "a lovely bunny holding a sign that says 'vllm-omni'" \
  --seed 42 \
  --tensor-parallel-size 2 \
  --num-images-per-prompt 1 \
  --num-inference-steps 50 \
  --guidance-scale 4.0 \
  --height 1024 \
  --width 1024 \
  --output outputs/flux2-dev.png

Online Serving:

MODEL_NAME_OR_PATH=/workspace/cache/ymttest/johnjan/models/black-forest-labs/FLUX___2-dev/

vllm serve ${MODEL_NAME_OR_PATH} \
   --omni \
   --port 8092 \
   --tensor-parallel-size 1 \
   --vae_use_slicing \
   --vae_use_tiling \
   --enable-cpu-offload

Memory Profile:

nvidia-smi --query-gpu=memory.used,memory.total --format=csv -l 1 > memory.log &
NVIDIA_SMI_PID=$!

echo "Memory monitoring started with PID: $NVIDIA_SMI_PID"

# Run inference
python examples/offline_inference/text_to_image/text_to_image.py \
  --model /workspace/cache/ymttest/johnjan/models/black-forest-labs/FLUX___2-dev/ \
  --prompt "a lovely bunny holding a sign that says 'vllm-omni'" \
  --seed 42 \
  --tensor-parallel-size 1 \
  --num-images-per-prompt 1 \
  --num-inference-steps 50 \
  --guidance-scale 4.0 \
  --height 1024 \
  --width 1024 \
  --output outputs/flux2-dev.png

kill -9 $NVIDIA_SMI_PID
echo "Memory monitoring stopped"

# Analyze peak
python -c "
import pandas as pd
df = pd.read_csv('memory.log')
df.iloc[:,0] = df.iloc[:,0].str.replace(' MiB', '').astype(float)
print(f'Peak memory: {df.iloc[:,0].max()} MB')
print(f'Total samples: {len(df)}')
"

Image-to-Image:

python examples/offline_inference/image_to_image/image_edit.py \
    --model /workspace/cache/ymttest/johnjan/models/black-forest-labs/FLUX___2-dev/ \
    --image outputs/flux2-dev.png \
    --prompt "replace the bunny in the image with dog." \
    --output outputs/flux2-dev-edit.png \
    --seed 42 \
    --tensor-parallel-size 2 \
    --num-inference-steps 50 \
    --guidance-scale 4.0

Test Result

vLLM-Omni:
Reproduced with 4xA800.

Model/TP diffusers TP=1 TP=1 & --enable-cpu-offload TP=2 TP=4
Flux.2-dev flux2-dev OOM flux2-dev flux2-dev flux2-dev
Time 104.9411s/img OOM 89.8067s/img 39.1087s/img 29.0770s/img

Online Serving:

curl -s http://localhost:8092/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "a lovely bunny holding a sign that says 'vllm-omni'"}
    ],
    "extra_body": {
      "height": 1024,
      "width": 1024,
      "num_inference_steps": 50,
      "guidance_scale": 4.0,
      "seed": 42
    }
  }' | jq -r '.choices[0].message.content[0].image_url.url' | cut -d',' -f2- | base64 -d > flux.2-dev.png
flux2-dev

Memory Profiling (FLUX.2-dev, 1024x1024, 50 steps):

Config GPU Memory Peak Memory Status
TP=1, 1x A800 80GB OOM - ❌ Insufficient
TP=1, 1x A800 80GB & --enable-cpu-offload 66696MiB 67352MiB ✅ Works
TP=2, 2x A800 80GB 81112MiB 81182MiB ✅ Works
TP=4, 4x A800 80GB 68160MiB 81116MiB ✅ Works

TP=1 OOM Explanation:
The OOM on a single A800 (80GB) at TP=1 is inevitable because the total size of FLUX.2-dev weights is approximately 112.6 GB (including ~64.3 GB for the Transformer and ~48.0 GB for the T5-XXL text encoder), which significantly exceeds the 80GB VRAM capacity. But we can enable cpu offload by --enable-cpu-offload at TP=1 to. run Flux.2-dev and it works well.

Minimum Requirements:

  • TP=1: 1x A800 80GB & --enable-cpu-offload or equivalent
  • TP=2: 2× A800 80GB or equivalent
  • TP=4: 4× A800 80GB or equivalent

Image-to-Image:

python examples/offline_inference/image_to_image/image_edit.py \
    --model /workspace/cache/ymttest/johnjan/models/black-forest-labs/FLUX___2-dev/ \
    --image outputs/flux2-dev.png \
    --prompt "replace the bunny in the image with dog." \
    --output outputs/flux2-dev-edit.png \
    --seed 42 \
    --tensor-parallel-size 2 \
    --num-inference-steps 50 \
    --guidance-scale 4.0
flux2-dev-edit
Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

@nuclearwu nuclearwu requested a review from hsliuustc0106 as a code owner March 3, 2026 07:03
@nuclearwu nuclearwu closed this Mar 3, 2026
@nuclearwu nuclearwu reopened this Mar 3, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a53145a246

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread vllm_omni/diffusion/models/flux2/pipeline_flux2.py Outdated
Comment thread vllm_omni/diffusion/models/flux2/pipeline_flux2.py
@mergify mergify Bot mentioned this pull request Mar 3, 2026
5 tasks
Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com>
Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com>
Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com>
# Conflicts:
#	docs/user_guide/diffusion_acceleration.md
Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com>
@nuclearwu
Copy link
Copy Markdown
Contributor Author

cc @hsliuustc0106 @ZJY0516 @wtomin

Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com>
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

why oom but diffusers works fine?

Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Architectural Code Review

📋 Summary

Item Details
PR [Model]: add FLUX.2-dev model
Author @nuclearwu
Scale +1871 lines (2 new files, 6 modified)
Status Needs Changes

✅ Strengths

1. Complete Model Implementation

  • Full transformer architecture (764 lines)
  • Complete pipeline (1081 lines)
  • Proper registry integration

2. Performance Benchmarks

Config Time vs diffusers
TP=2 39.1s/img 2.7x faster
TP=4 29.1s/img 3.6x faster

3. Architecture Patterns

  • Proper Mixin composition: CFGParallelMixin, SupportImageInput
  • Fused QKV+MLP projection: Flux2ParallelSelfAttention
  • RoPE integration with RotaryEmbedding(is_neox_style=False)

🔴 Critical Issues

1. Zero Test Coverage

+1871 lines of new code
+0 test files

Risk: No regression protection for:

  • Weight loading (load_weights with stacked_params_mapping)
  • TP sharding logic
  • Image preprocessing pipeline
  • Text encoder integration

Required:

# tests/diffusion/models/flux2/test_flux2_transformer.py
def test_weight_loading_tp2():
    """Verify weights load correctly with TP=2"""
    
def test_rope_position_embedding():
    """Verify RoPE produces correct embeddings for 4D coords"""

def test_packed_module_mapping():
    """Verify to_qkv packing matches HF checkpoint"""

2. Weight Loading Typo

# flux2_transformer.py:716
if "to_qkvkv_mlp_proj" in name:  # ❌ Typo: qkvkv
    name = name.replace("to_qkvkv_mlp_proj", "to_qkv_mlp_proj")

Questions:

  • What HF checkpoint has this typo?
  • Is this a diffusers bug or model-specific?

Fix: Add comment explaining the source, or fix upstream if possible.

3. TP=1 OOM Without Explanation

| Model/TP | TP=1 | TP=2 | TP=4 |
| Flux.2-dev | OOM | ✅ | ✅ |

Missing:

  • Memory requirement estimate
  • Minimum GPU memory for each TP config
  • gpu_memory_utilization tuning guidance

🟡 Significant Concerns

4. Code Attribution from diffusers

# pipeline_flux2.py:27-33
from diffusers import AutoencoderKLFlux2, FlowMatchEulerDiscreteScheduler
from diffusers.pipelines.flux2.pipeline_flux2 import UPSAMPLING_MAX_IMAGE_SIZE
from diffusers.pipelines.flux2.system_messages import SYSTEM_MESSAGE, ...

Large sections appear copied from diffusers:

  • retrieve_timesteps (67 lines) - copied with "Copied from diffusers"
  • retrieve_latents (12 lines) - copied with "Copied from diffusers"
  • _validate_and_process_images (33 lines) - "Adapted from diffusers"

Concerns:

  • Are all copied sections properly attributed?
  • License compatibility (Apache 2.0 vs diffusers license)?

Recommendation: Audit all copied code for proper attribution headers.

5. Hardcoded Magic Values

# pipeline_flux2.py:52-55
max_aspect_ratio: int = 8,
min_side_length: int = 64,
max_area: int = 1024 * 1024,
# pipeline_flux2.py:304
max_length=2048,  # ❌ Why 2048?
# pipeline_flux2.py:454
scale: int = 10,  # ❌ Why 10?

Fix: Document each constant or make configurable.

6. Inconsistent Error Handling

# pipeline_flux2.py:560
if latents.dtype != self.vae.dtype:
    latents = latents.to(self.vae.dtype)

vs.

# pipeline_flux2.py:563
image = self.vae.decode(latents, return_dict=False)[0]  # No dtype check

🟢 Minor Suggestions

7. Missing Type Hints

# flux2_transformer.py:714
def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:

vs.

# pipeline_flux2.py:733
def load_weights(self, weights):  # ❌ No type hints

8. Docstring Gaps

def _prepare_latent_ids(self, latents):
    """Missing docstring for complex coordinate generation"""

🏗️ Architecture Impact Analysis

Registry Integration:

# registry.py - Correct pattern
_DIFFUSION_PIPELINES["Flux2Pipeline"] = ("flux2", "pipeline_flux2", "Flux2Pipeline")
_POST_PROCESS_FUNCS["Flux2Pipeline"] = "get_flux2_post_process_func"

TP Sharding:

# Correct use of vLLM parallel layers
QKVParallelLinear, MergedColumnParallelLinear, RowParallelLinear

Attention Backend:

  • Uses vllm_omni.diffusion.attention.layer.Attention
  • Properly integrates RoPE

📝 Required Changes

Priority Item
BLOCKER Add unit tests (weight loading, TP sharding, preprocessing)
BLOCKER Document/fix to_qkvkv_mlp_proj typo source
IMPORTANT Add memory requirements documentation
IMPORTANT Audit diffusers code attribution
SUGGESTED Document magic constants

Verdict

Rating Notes
CHANGES_REQUESTED ⚠️ Solid implementation, but zero tests is unacceptable

Rationale:

  • Good architecture patterns and performance
  • Complete model integration
  • But 1871 lines with 0 tests is a maintenance liability

Post-fix: Once tests are added, this is an APPROVE.


Reviewed by: vllm-omni-reviewer MCP tool 🦐

Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Feedback: Memory Profiling Required

Good point from review — this PR should include memory profiling for different TP configurations.

Why This Matters

  1. TP=1 OOM is unexplained — Users need to know minimum GPU memory requirements
  2. Capacity planning — Users need to choose the right GPU/TP config
  3. gpu_memory_utilization tuning — Users need guidance on memory fraction settings

Suggested Memory Report Format

## Memory Profiling (FLUX.2-dev, 1024x1024, 50 steps)

| Config | GPU Memory | Peak Memory | Status |
|--------|------------|-------------|--------|
| TP=1, 1x A100 80GB | OOM | - | ❌ Insufficient |
| TP=2, 2x A100 80GB | ~45 GB | ~52 GB | ✅ Works |
| TP=4, 4x A100 80GB | ~25 GB | ~30 GB | ✅ Works |

**Minimum Requirements:**
- TP=2: 2× A100 80GB or equivalent
- TP=4: 4× A100 40GB or equivalent

How to Profile

# Enable memory tracking
export VLLM_ATTENTION_BACKEND=FLASHINFER
nvidia-smi --query-gpu=memory.used,memory.total --format=csv -l 1 > memory.log &

# Run inference
python examples/offline_inference/text_to_image/text_to_image.py \
  --model black-forest-labs/FLUX.2-dev \
  --tensor-parallel-size 2 \
  --height 1024 --width 1024 \
  --num-inference-steps 50

# Analyze peak
python -c "
import pandas as pd
df = pd.read_csv('memory.log')
print(f'Peak memory: {df.iloc[:,0].max()} MB')
"

Additional Metrics to Include

  • Model weights memory (fixed overhead)
  • Activation memory (depends on batch size, resolution)
  • KV cache memory (if applicable)
  • VAE encoder/decoder memory

This information is essential for users to decide if their hardware can run the model.


🦐 vllm-omni-reviewer

Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com>
@nuclearwu
Copy link
Copy Markdown
Contributor Author

why oom but diffusers works fine?

@hsliuustc0106 The OOM on a single A800 (80GB) at TP=1 is inevitable because the total size of FLUX.2-dev weights is approximately 112.6 GB (including ~64.3 GB for the Transformer and ~48.0 GB for the T5-XXL text encoder), which significantly exceeds the 80GB VRAM capacity. However, diffusers works fine because save some VRAM by offloading the model to CPU.

import torch
import time
from modelscope import Flux2Pipeline

device = "cuda"
dtype = torch.bfloat16

pipe = Flux2Pipeline.from_pretrained("/workspace/cache/ymttest/johnjan/models/black-forest-labs/FLUX___2-dev/", torch_dtype=dtype)
pipe.enable_model_cpu_offload()  # save some VRAM by offloading the model to CPU

prompt = "a lovely bunny holding a sign that says 'vllm-omni'"
image = pipe(
    prompt=prompt,
    height=1024,
    width=1024,
    guidance_scale=4.0,
    num_inference_steps=3,
    max_sequence_length=512,
    generator=torch.Generator(device=device).manual_seed(42)
).images[0]
generation_start = time.perf_counter()
image = pipe(
    prompt=prompt,
    height=1024,
    width=1024,
    guidance_scale=4.0,
    num_inference_steps=50,
    max_sequence_length=512,
    generator=torch.Generator(device=device).manual_seed(42)
).images[0]
generation_end = time.perf_counter()
generation_time = generation_end - generation_start
print(f"Total generation time: {generation_time:.4f} seconds ({generation_time * 1000:.2f} ms)")
image.save("outputs/flux2-dev-diffusers.png")

Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com>
Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com>
@nuclearwu nuclearwu requested a review from hsliuustc0106 March 4, 2026 09:03
Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com>
Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit inline

Comment thread vllm_omni/diffusion/models/flux2/pipeline_flux2.py
Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com>
@nuclearwu nuclearwu requested a review from lishunyang12 March 5, 2026 01:30
Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a couple of comments

Comment thread vllm_omni/diffusion/models/flux2/flux2_transformer.py
Comment thread vllm_omni/diffusion/models/flux2/pipeline_flux2.py
@lishunyang12
Copy link
Copy Markdown
Collaborator

The PR is in good shape overall. Fix those and i will left maintainers with the remaining items.

Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com>
@nuclearwu nuclearwu mentioned this pull request Mar 10, 2026
63 tasks
@nuclearwu
Copy link
Copy Markdown
Contributor Author

cc @hsliuustc0106

Comment thread tests/diffusion/models/flux2/test_flux2_transformer_tp.py
@wtomin
Copy link
Copy Markdown
Collaborator

wtomin commented Mar 10, 2026

Overall, it's good:

  • Comprehensive PR body — Best-in-class documentation with memory profiling, benchmarks, and clear minimum requirements
  • Clean implementation — No diffusers Mixin,pure vLLM-Omni abstractions
  • Tensor Parallel support — Properly implemented with QKVParallelLinear
  • CPU Offload support — Enables single-GPU deployment (80GB+)
  • Unit test coverage — focused TP unit tests

⚠️ Minor Suggestions:

  • Would be better if you can edit examples/offline_inference/text_to_image/README.md and add a CLI inference example of Flux.2-dev model. Especially mention about the memory constraint (>80GB), therefore cpu-offloading and other memory optimization methods are highly recommended.
  • Do you consider supporting other memory optimization methods (such as quantization) for FLUX.2-dev model?
  • Would be great if you can test its online serving functionality. Just to double check it's working.

@wtomin wtomin added the new model add new model label Mar 10, 2026
Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com>
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

fix dco and solve @wtomin's comments

@nuclearwu
Copy link
Copy Markdown
Contributor Author

Overall, it's good:

  • Comprehensive PR body — Best-in-class documentation with memory profiling, benchmarks, and clear minimum requirements
  • Clean implementation — No diffusers Mixin,pure vLLM-Omni abstractions
  • Tensor Parallel support — Properly implemented with QKVParallelLinear
  • CPU Offload support — Enables single-GPU deployment (80GB+)
  • Unit test coverage — focused TP unit tests

⚠️ Minor Suggestions:

  • Would be better if you can edit examples/offline_inference/text_to_image/README.md and add a CLI inference example of Flux.2-dev model. Especially mention about the memory constraint (>80GB), therefore cpu-offloading and other memory optimization methods are highly recommended.
  • Do you consider supporting other memory optimization methods (such as quantization) for FLUX.2-dev model?
  • Would be great if you can test its online serving functionality. Just to double check it's working.

Overall, it's good:

  • Comprehensive PR body — Best-in-class documentation with memory profiling, benchmarks, and clear minimum requirements
  • Clean implementation — No diffusers Mixin,pure vLLM-Omni abstractions
  • Tensor Parallel support — Properly implemented with QKVParallelLinear
  • CPU Offload support — Enables single-GPU deployment (80GB+)
  • Unit test coverage — focused TP unit tests

⚠️ Minor Suggestions:

  • Would be better if you can edit examples/offline_inference/text_to_image/README.md and add a CLI inference example of Flux.2-dev model. Especially mention about the memory constraint (>80GB), therefore cpu-offloading and other memory optimization methods are highly recommended.
  • Do you consider supporting other memory optimization methods (such as quantization) for FLUX.2-dev model?
  • Would be great if you can test its online serving functionality. Just to double check it's working.

@wtomin Thank you for your review, I will consider supporting quantization in the future. The rest have all been revised.

@nuclearwu
Copy link
Copy Markdown
Contributor Author

fix dco and solve @wtomin's comments

@hsliuustc0106 done

@nuclearwu nuclearwu requested a review from wtomin March 11, 2026 01:52
Copy link
Copy Markdown
Collaborator

@wtomin wtomin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@hsliuustc0106 hsliuustc0106 added the ready label to trigger buildkite CI label Mar 11, 2026
# Conflicts:
#	docs/user_guide/diffusion/parallelism_acceleration.md
@wtomin wtomin requested review from ZJY0516 and removed request for lishunyang12 March 11, 2026 02:54
@wtomin
Copy link
Copy Markdown
Collaborator

wtomin commented Mar 11, 2026

Solve the conflicts please. @nuclearwu

Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com>
@nuclearwu
Copy link
Copy Markdown
Contributor Author

Solve the conflicts please. @nuclearwu

@wtomin done

Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com>
@hsliuustc0106 hsliuustc0106 merged commit 4d89eba into vllm-project:main Mar 11, 2026
6 of 7 checks passed
@wtomin wtomin mentioned this pull request Mar 12, 2026
1 task
@jannikstdl
Copy link
Copy Markdown

Does VLLM Omni Support Flux2-dev Image to Image in the API Server?

@wtomin
Copy link
Copy Markdown
Collaborator

wtomin commented Mar 13, 2026

@jannikstdl Please take this image-to-image tutorial as reference, changing the model name to Flux2-dev . If you encountered some errors, feel free to raise an issue.

@jannikstdl
Copy link
Copy Markdown

Update: It does already support Image Editing with Flux2.dev in API Server. Running VLLM Omni 0.16.0
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new model add new model ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants