[Feature]: support Flux.2-dev CFG-Parallel by nuclearwu · Pull Request #2010 · vllm-project/vllm-omni

nuclearwu · 2026-03-19T09:24:41Z

Signed-off-by: wuzhongjian wuzhongjian_yewu@cmss.chinamobile.com

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

support Flux.2-dev CFG-Parallel

Test Plan

Reference #851 image generation:
The bash script to run all t2i tasks

#!/bin/bash

# Script to run text-to-image inference for all supported models
# Comparing with and without CFG parallel
# Logs are saved to individual txt files for each experiment
# If one task fails, other tasks will continue to run

PROMPT="a lovely bunny holding a sign that says 'vllm-omni'"
NEGATIVE_PROMPT="ugly, unclear, blurry, gray"

# Arrays to track success and failure
declare -a SUCCESS_TASKS
declare -a FAILED_TASKS

# Define models and their parameters
# Format: "model_name|model_path|scale_arg|scale_value"
declare -a MODELS=(
  "Flux.2-dev|/workspace/cache/ymttest/johnjan/models/black-forest-labs/FLUX___2-dev/|guidance-scale|4.0"
  # "Flux.2-klein-4B|/workspace/cache/ymttest/johnjan/models/black-forest-labs/FLUX___2-klein-4B/|guidance-scale|4.0"
)

# Eager mode configurations
declare -a EAGER_CONFIGS=(
  "no_eager|"
  "with_eager|--enforce-eager"
)

# CFG parallel configurations
declare -a CFG_CONFIGS=(
  "no_cfg_parallel|"
  "with_cfg_parallel|--cfg-parallel-size 2"
)

echo "=========================================="
echo "Starting text-to-image inference tests"
echo "Testing combinations of eager mode and CFG parallel"
echo "4 test cases per model:"
echo "  1. no_eager + no_cfg_parallel"
echo "  2. no_eager + with_cfg_parallel"
echo "  3. with_eager + no_cfg_parallel"
echo "  4. with_eager + with_cfg_parallel"
echo "Each model's outputs saved in its own directory"
echo "Note: If one task fails, others will continue"
echo "=========================================="
echo ""

TASK_NUM=0
TOTAL_TASKS=$((${#MODELS[@]} * ${#EAGER_CONFIGS[@]} * ${#CFG_CONFIGS[@]}))

# Run experiments for each model and configuration
for model_info in "${MODELS[@]}"; do
  IFS='|' read -r model_name model_path scale_arg scale_value <<< "$model_info"
  
  # Create directory for this model
  model_dir="${model_name// /_}"
  mkdir -p "$model_dir"
  
  for eager_info in "${EAGER_CONFIGS[@]}"; do
    IFS='|' read -r eager_label eager_args <<< "$eager_info"
    
    for cfg_info in "${CFG_CONFIGS[@]}"; do
      IFS='|' read -r cfg_label cfg_args <<< "$cfg_info"
      TASK_NUM=$((TASK_NUM + 1))
      
      # Generate filenames inside model directory
      base_name="${model_name,,}"
      base_name="${base_name// /_}"
      output_file="$model_dir/${base_name}_output_${eager_label}_${cfg_label}.png"
      log_file="$model_dir/${base_name}_${eager_label}_${cfg_label}.log"
      task_label="$model_name ($eager_label + $cfg_label)"
      
      echo "=========================================="
      echo "$TASK_NUM/$TOTAL_TASKS: Running $task_label..."
      echo "=========================================="
      
      # Build and execute command
      if python examples/offline_inference/text_to_image/text_to_image.py \
        --model "$model_path" \
        --${scale_arg} "$scale_value" \
        --prompt "$PROMPT" \
        --tensor-parallel-size 4 \
        --negative-prompt "$NEGATIVE_PROMPT" \
        --output "$output_file" \
        $eager_args \
        $cfg_args \
        2>&1 | tee "$log_file"; then
        echo "✓ $task_label completed."
        SUCCESS_TASKS+=("$task_label")
      else
        echo "✗ $task_label FAILED."
        FAILED_TASKS+=("$task_label")
      fi
      echo ""
    done
  done
done

echo "=========================================="
echo "All tasks completed!"
echo "=========================================="
echo "Summary: ${#SUCCESS_TASKS[@]}/$TOTAL_TASKS successful, ${#FAILED_TASKS[@]}/$TOTAL_TASKS failed"
echo ""

if [ ${#SUCCESS_TASKS[@]} -gt 0 ]; then
  echo "✓ Successful tasks:"
  for task in "${SUCCESS_TASKS[@]}"; do
    echo "  - $task"
  done
  echo ""
fi

if [ ${#FAILED_TASKS[@]} -gt 0 ]; then
  echo "✗ Failed tasks:"
  for task in "${FAILED_TASKS[@]}"; do
    echo "  - $task"
  done
  echo ""
  echo "Check model directories for error logs."
  echo ""
fi

echo "Output directories:"
for model_info in "${MODELS[@]}"; do
  IFS='|' read -r model_name _ _ _ <<< "$model_info"
  model_dir="${model_name// /_}"
  echo "  - $model_dir/ (images and logs for $model_name)"
done

# Exit with error code if any tasks failed
if [ ${#FAILED_TASKS[@]} -gt 0 ]; then
  exit 1
fi

Memory Profile:

nvidia-smi --query-gpu=memory.used,memory.total --format=csv -l 1 > memory.log &
NVIDIA_SMI_PID=$!

echo "Memory monitoring started with PID: $NVIDIA_SMI_PID"

python examples/offline_inference/text_to_image/text_to_image.py \
  --model /workspace/cache/ymttest/johnjan/models/black-forest-labs/FLUX___2-dev/ \
  --guidance-scale 4.0 \
  --prompt "a lovely bunny holding a sign that says 'vllm-omni'" \
  --tensor-parallel-size 4 \
  --negative-prompt "ugly, unclear, blurry, gray" \
  --output "Flux_2_dev/flux_2_dev_output_no_eager_no_cfg_parallel.png"

kill -9 $NVIDIA_SMI_PID
echo "Memory monitoring stopped"

# Analyze peak
python -c "
import pandas as pd
df = pd.read_csv('memory.log')
df.iloc[:,0] = df.iloc[:,0].str.replace(' MiB', '').astype(float)
print(f'Peak memory: {df.iloc[:,0].max()} MB')
print(f'Total samples: {len(df)}')
"

Test Result

Reproduced with 4xA800.
Text-To-Image:

model	tp	cfg_parallel_size	time (torch.compile)	time (eager)
Flux.2-dev	4	1	57.7082	76.0480
Flux.2-dev	4	2	29.2653	38.3901
Flux.2-dev(#2063)	4	2	28.9744	38.0465

Memory Profiling (FLUX.2-dev, 1024x1024, 50 steps):

Config	GPU Memory	Peak Memory	Status
TP=4, 4x A800 80GB & torch.compile & cfg-parallel=1	68563MiB	69078MiB	✅ Works
TP=4, 8x A800 80GB & torch.compile & cfg-parallel=2	68565MiB	69296MiB	✅ Works
TP=4, 4x A800 80GB & eager & cfg-parallel=1	68563MiB	69002MiB	✅ Works
TP=4, 8x A800 80GB & eager & cfg-parallel=2	68565MiB	69372MiB	✅ Works

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

nuclearwu · 2026-03-19T09:29:32Z

cc @wtomin @hsliuustc0106

hsliuustc0106 · 2026-03-19T09:30:48Z

does it apply to all flux.2 family models? what's the recommended parallel strategy if we have 2/4 devices?

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 395e644c12

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

wtomin · 2026-03-20T11:50:00Z

Does the CFG-parallel speed problem still persist, as you mentioned in the wechat group?

Can you show me the speed comparison of cfg-sequential plan before and after this PR? @nuclearwu

lishunyang12

Left a few comments. Main concern is that do_true_cfg activates unconditionally with the default guidance_scale=4.0, which means every request now pays for 2x transformer forward passes even when the user doesn't intend to use CFG.

hsliuustc0106

Review Summary

BLOCKER scan:

Category	Result
Correctness	PASS
Reliability/Safety	PASS
Breaking Changes	PASS
Test Coverage	PASS - Comprehensive test results in PR body with generated images
Documentation	PASS - Tables updated for CFG parallel
Security	PASS

OVERALL: 1 ISSUE (merge conflicts, need to be resolved by author)

VERDICT: COMMENT

Issues

[Gate Failure] Merge conflicts - PR cannot be merged. Please rebase on main and resolve conflicts.

Non-blocking observations

The PR provides comprehensive testing evidence with generated images across multiple configurations (eager mode, CFG parallel combinations)
Code follows the established patterns for CFGParallelMixin (similar to other models like FLUX.1-dev)
Documentation tables correctly updated to reflect CFG parallel support
No MRO issues: CFGParallelMixin uses met metaclass=ABCMeta without init

nuclearwu · 2026-03-27T05:47:35Z

Left a few comments. Main concern is that do_true_cfg activates unconditionally with the default guidance_scale=4.0, which means every request now pays for 2x transformer forward passes even when the user doesn't intend to use CFG.

@lishunyang12 Thank you for your review. I have made the necessary revisions based on the feedback provided.

nuclearwu · 2026-03-27T05:52:48Z

Does the CFG-parallel speed problem still persist, as you mentioned in the wechat group?

Can you show me the speed comparison of cfg-sequential plan before and after this PR? @nuclearwu

@wtomin You can compare this to #1629 without CFG-Parallel. The generation time of the same prompt is inconsistent, I considered that it might be due to the negative prompt for this.

nuclearwu · 2026-03-27T06:00:55Z

does it apply to all flux.2 family models? what's the recommended parallel strategy if we have 2/4 devices?

@hsliuustc0106 Currently, the machine resources are insufficient. Only single machine with multiple cards can be used for testing.

wtomin · 2026-03-31T04:05:18Z

Recently, a refactoring PR related CFG-Parallel got merged #2063. Please double check if this PR affects yours.

The online serving test script test_flux_2_dev_expansion.py does not cover the test case of cfg-parallel. Please add it. For L4 test, please refert to #1832

nuclearwu · 2026-04-02T01:16:00Z

Recently, a refactoring PR related CFG-Parallel got merged #2063. Please double check if this PR affects yours.

The online serving test script test_flux_2_dev_expansion.py does not cover the test case of cfg-parallel. Please add it. For L4 test, please refert to #1832

@wtomin I pulled the main branch and test again, the performance of cfg-parallel has declined.Specifically as shown in the above table.

lishunyang12

Core CFG logic looks correct

wtomin · 2026-04-02T03:22:51Z

I tried to test the performance before and after #2063 PR got merged, with tests/dfx/perf/scripts/run_diffusion_benchmark.py. The model I used for testing is Qwen-Image. Here is the result:

commit id	torch.compile	cfg-parallel-size	resolution	num-inference-steps	latency (mean
`ebc9a8d` (latest main branch, after #2063 got merged)	True	2	1536x1536	35	12.776495
`ebc9a8d` (latest main branch, after #2063 got merged)	False	2	1536x1536	35	15.493865
`1ca9429` (before #2063 got merged)	True	2	1536x1536	35	12.81318
`1ca9429` (before #2063 got merged)	False	2	1536x1536	35	15.69232

This table indicates that, at least for Qwen-Image, #2063 reduces the e2e latency by a small margin, instead of slowing it down. This is in line with the claims in #2063.

Regarding the problems you identified here, I have some suggestions:

Could you evaluate the performance using tests/dfx/perf/scripts/run_diffusion_benchmark.py? All you need to do is to create a config yaml file, and it automatically computes the e2e latency and per-stage durations over 10 runs (diffuse, vae decode, text encoding, etc.)
Could you verify if anything is special about Flux.2-dev CFG-Parallel, making it sensitive to [Diffusion] Refactor CFG parallel for extensibility and performance #2063?

Also cc @TKONIY for some suggestions.

TKONIY · 2026-04-02T04:03:50Z

I tried to test the performance before and after #2063 PR got merged, with tests/dfx/perf/scripts/run_diffusion_benchmark.py. The model I used for testing is Qwen-Image. Here is the result:

commit id torch.compile cfg-parallel-size resolution num-inference-steps latency (mean
ebc9a8d (latest main branch, after #2063 got merged) True 2 1536x1536 35 12.776495
ebc9a8d (latest main branch, after #2063 got merged) False 2 1536x1536 35 15.493865
1ca9429 (before #2063 got merged) True 2 1536x1536 35 12.81318
1ca9429 (before #2063 got merged) False 2 1536x1536 35 15.69232
This table indicates that, at least for Qwen-Image, #2063 reduces the e2e latency by a small margin, instead of slowing it down. This is in line with the claims in #2063.

Regarding the problems you identified here, I have some suggestions:

Could you evaluate the performance using tests/dfx/perf/scripts/run_diffusion_benchmark.py? All you need to do is to create a config yaml file, and it automatically computes the e2e latency and per-stage durations over 10 runs (diffuse, vae decode, text encoding, etc.)

Could you verify if anything is special about Flux.2-dev CFG-Parallel, making it sensitive to [Diffusion] Refactor CFG parallel for extensibility and performance #2063?

Also cc @TKONIY for some suggestions.

I will check it this week. Looks like some compatibility problem between #2063 and torch.compile

nuclearwu · 2026-04-02T06:36:25Z

Core CFG logic looks correct

@lishunyang12 Done, please review again.

nuclearwu · 2026-04-02T07:04:56Z

I tried to test the performance before and after #2063 PR got merged, with tests/dfx/perf/scripts/run_diffusion_benchmark.py. The model I used for testing is Qwen-Image. Here is the result:

commit id torch.compile cfg-parallel-size resolution num-inference-steps latency (mean
ebc9a8d (latest main branch, after #2063 got merged) True 2 1536x1536 35 12.776495
ebc9a8d (latest main branch, after #2063 got merged) False 2 1536x1536 35 15.493865
1ca9429 (before #2063 got merged) True 2 1536x1536 35 12.81318
1ca9429 (before #2063 got merged) False 2 1536x1536 35 15.69232
This table indicates that, at least for Qwen-Image, #2063 reduces the e2e latency by a small margin, instead of slowing it down. This is in line with the claims in #2063.

Regarding the problems you identified here, I have some suggestions:

Could you evaluate the performance using tests/dfx/perf/scripts/run_diffusion_benchmark.py? All you need to do is to create a config yaml file, and it automatically computes the e2e latency and per-stage durations over 10 runs (diffuse, vae decode, text encoding, etc.)

Could you verify if anything is special about Flux.2-dev CFG-Parallel, making it sensitive to [Diffusion] Refactor CFG parallel for extensibility and performance #2063?

Also cc @TKONIY for some suggestions.

@wtomin Sorry, I try again. #2063 reduces the e2e latency by a small margin and update results to the above table.

wtomin · 2026-04-02T08:15:10Z

Missing e2e test for CFG parallelism. Please add a test that covers --cfg-parallel-size=2. For L4 test, please refer to #1832 .

Feature support table not updated for Flux.2-dev CFG parallel in docs/user_guide/diffusion_features.md. BTW, in examples/offline_inference/text_to_image/README.md, there is a deprecated hyperlink ../../../docs/user_guide/diffusion_acceleration.md#using-cfg-parallel. Can you help to update it to the correct path to docs/user_guide/diffusion/parallelism/cfg_parallel.md?

Can you also report the peak VRAM usage in your PR body?

nuclearwu · 2026-04-03T02:31:05Z

Missing e2e test for CFG parallelism. Please add a test that covers --cfg-parallel-size=2. For L4 test, please refer to #1832 .

Feature support table not updated for Flux.2-dev CFG parallel in docs/user_guide/diffusion_features.md. BTW, in examples/offline_inference/text_to_image/README.md, there is a deprecated hyperlink ../../../docs/user_guide/diffusion_acceleration.md#using-cfg-parallel. Can you help to update it to the correct path to docs/user_guide/diffusion/parallelism/cfg_parallel.md?

Can you also report the peak VRAM usage in your PR body?

@wtomin Done, the peak VRAM usage as shown in the above tables, PTAL.

nuclearwu · 2026-04-07T02:07:57Z

cc @hsliuustc0106 @wtomin @lishunyang12

Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com>

wtomin

LGTM.

nuclearwu · 2026-04-10T03:05:32Z

cc @hsliuustc0106

fixed

nuclearwu requested a review from hsliuustc0106 as a code owner March 19, 2026 09:24

nuclearwu changed the title ~~[feature]: support Flux.2-dev CFG-Parallel~~ [Feature]: support Flux.2-dev CFG-Parallel Mar 19, 2026

chatgpt-codex-connector Bot reviewed Mar 19, 2026

View reviewed changes

Comment thread vllm_omni/diffusion/models/flux2/pipeline_flux2.py Outdated

wtomin mentioned this pull request Mar 19, 2026

[RFC]: Continuous Diffusion Model Acceleration Support #1217

Open

1 task

lishunyang12 reviewed Mar 22, 2026

View reviewed changes

hsliuustc0106 reviewed Mar 23, 2026

View reviewed changes

nuclearwu force-pushed the cfg-parallel branch from 395e644 to 9d8f02b Compare March 27, 2026 05:42

nuclearwu requested review from hsliuustc0106 and lishunyang12 March 27, 2026 06:01

nuclearwu force-pushed the cfg-parallel branch from 9d8f02b to 4378b42 Compare March 30, 2026 07:57

lishunyang12 previously requested changes Apr 2, 2026

View reviewed changes

Comment thread vllm_omni/diffusion/models/flux2/pipeline_flux2.py Outdated

Comment thread vllm_omni/diffusion/models/flux2/pipeline_flux2.py

Comment thread vllm_omni/diffusion/models/flux2/pipeline_flux2.py

Comment thread vllm_omni/diffusion/models/flux2/pipeline_flux2.py Outdated

nuclearwu requested a review from lishunyang12 April 2, 2026 06:36

nuclearwu force-pushed the cfg-parallel branch from 2e06fb5 to 4a72914 Compare April 2, 2026 06:42

nuclearwu mentioned this pull request Apr 3, 2026

[RFC]: FLUX.2-dev Model Acceleration Support #1806

Open

1 task

wtomin reviewed Apr 8, 2026

View reviewed changes

Comment thread tests/e2e/online_serving/test_flux_2_dev_expansion.py Outdated

[Feature]: support Flux.2-dev CFG-Parallel

7aeda37

Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com>

nuclearwu force-pushed the cfg-parallel branch from 08a9dfc to 7aeda37 Compare April 8, 2026 08:39

nuclearwu requested a review from wtomin April 8, 2026 08:44

wtomin approved these changes Apr 8, 2026

View reviewed changes

wtomin reviewed Apr 8, 2026

View reviewed changes

Comment thread examples/offline_inference/text_to_image/README.md

wtomin added the ready label to trigger buildkite CI label Apr 8, 2026

Merge branch 'main' of github.com:vllm-project/vllm-omni

855f931

nuclearwu force-pushed the cfg-parallel branch from 71b50cf to 855f931 Compare April 9, 2026 01:44

nuclearwu added 2 commits April 9, 2026 17:37

Merge branch 'main' of github.com:vllm-project/vllm-omni

569c867

Merge branch 'main' of github.com:vllm-project/vllm-omni

590706e

hsliuustc0106 merged commit f3f2dc5 into vllm-project:main Apr 10, 2026
8 checks passed

Sy0307 pushed a commit to Sy0307/vllm-omni that referenced this pull request Apr 10, 2026

[Feature]: support Flux.2-dev CFG-Parallel (vllm-project#2010)

03cd0e5

yuanheng-zhao mentioned this pull request Apr 11, 2026

[Bugfix] Restore user config/runtime stage init timeout #2519

Merged

5 tasks

daixinning pushed a commit to daixinning/vllm-omni that referenced this pull request Apr 13, 2026

[Feature]: support Flux.2-dev CFG-Parallel (vllm-project#2010)

b02645b

nuclearwu mentioned this pull request Apr 14, 2026

[Model] Add TP-aware MistralEncoder for FLUX.2-dev TP #2465

Open

5 tasks

Conversation

nuclearwu commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

nuclearwu commented Mar 19, 2026

Uh oh!

hsliuustc0106 commented Mar 19, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

wtomin commented Mar 20, 2026

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Review Summary

Issues

Non-blocking observations

Uh oh!

nuclearwu commented Mar 27, 2026

Uh oh!

nuclearwu commented Mar 27, 2026

Uh oh!

nuclearwu commented Mar 27, 2026

Uh oh!

wtomin commented Mar 31, 2026

Uh oh!

nuclearwu commented Apr 2, 2026

Uh oh!

lishunyang12 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wtomin commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TKONIY commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nuclearwu commented Apr 2, 2026

Uh oh!

nuclearwu commented Apr 2, 2026

Uh oh!

wtomin commented Apr 2, 2026

Uh oh!

nuclearwu commented Apr 3, 2026

Uh oh!

nuclearwu commented Apr 7, 2026

Uh oh!

Uh oh!

wtomin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nuclearwu commented Apr 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

nuclearwu commented Mar 19, 2026 •

edited

Loading

lishunyang12 left a comment •

edited

Loading

wtomin commented Apr 2, 2026 •

edited

Loading

TKONIY commented Apr 2, 2026 •

edited

Loading