Skip to content

[Feature]: support Flux.2-dev CFG-Parallel#2010

Merged
hsliuustc0106 merged 4 commits intovllm-project:mainfrom
nuclearwu:cfg-parallel
Apr 10, 2026
Merged

[Feature]: support Flux.2-dev CFG-Parallel#2010
hsliuustc0106 merged 4 commits intovllm-project:mainfrom
nuclearwu:cfg-parallel

Conversation

@nuclearwu
Copy link
Copy Markdown
Contributor

@nuclearwu nuclearwu commented Mar 19, 2026

Signed-off-by: wuzhongjian wuzhongjian_yewu@cmss.chinamobile.com

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

support Flux.2-dev CFG-Parallel

Test Plan

Reference #851 image generation:
The bash script to run all t2i tasks

#!/bin/bash

# Script to run text-to-image inference for all supported models
# Comparing with and without CFG parallel
# Logs are saved to individual txt files for each experiment
# If one task fails, other tasks will continue to run

PROMPT="a lovely bunny holding a sign that says 'vllm-omni'"
NEGATIVE_PROMPT="ugly, unclear, blurry, gray"

# Arrays to track success and failure
declare -a SUCCESS_TASKS
declare -a FAILED_TASKS

# Define models and their parameters
# Format: "model_name|model_path|scale_arg|scale_value"
declare -a MODELS=(
  "Flux.2-dev|/workspace/cache/ymttest/johnjan/models/black-forest-labs/FLUX___2-dev/|guidance-scale|4.0"
  # "Flux.2-klein-4B|/workspace/cache/ymttest/johnjan/models/black-forest-labs/FLUX___2-klein-4B/|guidance-scale|4.0"
)

# Eager mode configurations
declare -a EAGER_CONFIGS=(
  "no_eager|"
  "with_eager|--enforce-eager"
)

# CFG parallel configurations
declare -a CFG_CONFIGS=(
  "no_cfg_parallel|"
  "with_cfg_parallel|--cfg-parallel-size 2"
)

echo "=========================================="
echo "Starting text-to-image inference tests"
echo "Testing combinations of eager mode and CFG parallel"
echo "4 test cases per model:"
echo "  1. no_eager + no_cfg_parallel"
echo "  2. no_eager + with_cfg_parallel"
echo "  3. with_eager + no_cfg_parallel"
echo "  4. with_eager + with_cfg_parallel"
echo "Each model's outputs saved in its own directory"
echo "Note: If one task fails, others will continue"
echo "=========================================="
echo ""

TASK_NUM=0
TOTAL_TASKS=$((${#MODELS[@]} * ${#EAGER_CONFIGS[@]} * ${#CFG_CONFIGS[@]}))

# Run experiments for each model and configuration
for model_info in "${MODELS[@]}"; do
  IFS='|' read -r model_name model_path scale_arg scale_value <<< "$model_info"
  
  # Create directory for this model
  model_dir="${model_name// /_}"
  mkdir -p "$model_dir"
  
  for eager_info in "${EAGER_CONFIGS[@]}"; do
    IFS='|' read -r eager_label eager_args <<< "$eager_info"
    
    for cfg_info in "${CFG_CONFIGS[@]}"; do
      IFS='|' read -r cfg_label cfg_args <<< "$cfg_info"
      TASK_NUM=$((TASK_NUM + 1))
      
      # Generate filenames inside model directory
      base_name="${model_name,,}"
      base_name="${base_name// /_}"
      output_file="$model_dir/${base_name}_output_${eager_label}_${cfg_label}.png"
      log_file="$model_dir/${base_name}_${eager_label}_${cfg_label}.log"
      task_label="$model_name ($eager_label + $cfg_label)"
      
      echo "=========================================="
      echo "$TASK_NUM/$TOTAL_TASKS: Running $task_label..."
      echo "=========================================="
      
      # Build and execute command
      if python examples/offline_inference/text_to_image/text_to_image.py \
        --model "$model_path" \
        --${scale_arg} "$scale_value" \
        --prompt "$PROMPT" \
        --tensor-parallel-size 4 \
        --negative-prompt "$NEGATIVE_PROMPT" \
        --output "$output_file" \
        $eager_args \
        $cfg_args \
        2>&1 | tee "$log_file"; then
        echo "✓ $task_label completed."
        SUCCESS_TASKS+=("$task_label")
      else
        echo "✗ $task_label FAILED."
        FAILED_TASKS+=("$task_label")
      fi
      echo ""
    done
  done
done

echo "=========================================="
echo "All tasks completed!"
echo "=========================================="
echo "Summary: ${#SUCCESS_TASKS[@]}/$TOTAL_TASKS successful, ${#FAILED_TASKS[@]}/$TOTAL_TASKS failed"
echo ""

if [ ${#SUCCESS_TASKS[@]} -gt 0 ]; then
  echo "✓ Successful tasks:"
  for task in "${SUCCESS_TASKS[@]}"; do
    echo "  - $task"
  done
  echo ""
fi

if [ ${#FAILED_TASKS[@]} -gt 0 ]; then
  echo "✗ Failed tasks:"
  for task in "${FAILED_TASKS[@]}"; do
    echo "  - $task"
  done
  echo ""
  echo "Check model directories for error logs."
  echo ""
fi

echo "Output directories:"
for model_info in "${MODELS[@]}"; do
  IFS='|' read -r model_name _ _ _ <<< "$model_info"
  model_dir="${model_name// /_}"
  echo "  - $model_dir/ (images and logs for $model_name)"
done

# Exit with error code if any tasks failed
if [ ${#FAILED_TASKS[@]} -gt 0 ]; then
  exit 1
fi

Memory Profile:

nvidia-smi --query-gpu=memory.used,memory.total --format=csv -l 1 > memory.log &
NVIDIA_SMI_PID=$!

echo "Memory monitoring started with PID: $NVIDIA_SMI_PID"

python examples/offline_inference/text_to_image/text_to_image.py \
  --model /workspace/cache/ymttest/johnjan/models/black-forest-labs/FLUX___2-dev/ \
  --guidance-scale 4.0 \
  --prompt "a lovely bunny holding a sign that says 'vllm-omni'" \
  --tensor-parallel-size 4 \
  --negative-prompt "ugly, unclear, blurry, gray" \
  --output "Flux_2_dev/flux_2_dev_output_no_eager_no_cfg_parallel.png"

kill -9 $NVIDIA_SMI_PID
echo "Memory monitoring stopped"

# Analyze peak
python -c "
import pandas as pd
df = pd.read_csv('memory.log')
df.iloc[:,0] = df.iloc[:,0].str.replace(' MiB', '').astype(float)
print(f'Peak memory: {df.iloc[:,0].max()} MB')
print(f'Total samples: {len(df)}')
"

Test Result

Reproduced with 4xA800.
Text-To-Image:

model tp cfg_parallel_size time (torch.compile) time (eager) generated image
Flux.2-dev 4 1 57.7082 76.0480 flux 2-dev_output_no_eager_with_cfg_parallel
Flux.2-dev 4 2 29.2653 38.3901 flux 2-dev_output_no_eager_with_cfg_parallel
Flux.2-dev(#2063) 4 2 28.9744 38.0465 flux 2-dev_output_no_eager_with_cfg_parallel

Memory Profiling (FLUX.2-dev, 1024x1024, 50 steps):

Config GPU Memory Peak Memory Status
TP=4, 4x A800 80GB & torch.compile & cfg-parallel=1 68563MiB 69078MiB ✅ Works
TP=4, 8x A800 80GB & torch.compile & cfg-parallel=2 68565MiB 69296MiB ✅ Works
TP=4, 4x A800 80GB & eager & cfg-parallel=1 68563MiB 69002MiB ✅ Works
TP=4, 8x A800 80GB & eager & cfg-parallel=2 68565MiB 69372MiB ✅ Works

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

@nuclearwu nuclearwu changed the title [feature]: support Flux.2-dev CFG-Parallel [Feature]: support Flux.2-dev CFG-Parallel Mar 19, 2026
@nuclearwu
Copy link
Copy Markdown
Contributor Author

cc @wtomin @hsliuustc0106

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

does it apply to all flux.2 family models? what's the recommended parallel strategy if we have 2/4 devices?

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 395e644c12

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread vllm_omni/diffusion/models/flux2/pipeline_flux2.py Outdated
@wtomin
Copy link
Copy Markdown
Collaborator

wtomin commented Mar 20, 2026

Does the CFG-parallel speed problem still persist, as you mentioned in the wechat group?

Can you show me the speed comparison of cfg-sequential plan before and after this PR? @nuclearwu

Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments. Main concern is that do_true_cfg activates unconditionally with the default guidance_scale=4.0, which means every request now pays for 2x transformer forward passes even when the user doesn't intend to use CFG.

Comment thread vllm_omni/diffusion/models/flux2/pipeline_flux2.py Outdated
Comment thread vllm_omni/diffusion/models/flux2/pipeline_flux2.py Outdated
Comment thread vllm_omni/diffusion/models/flux2/pipeline_flux2.py Outdated
Comment thread vllm_omni/diffusion/models/flux2/pipeline_flux2.py Outdated
Comment thread vllm_omni/diffusion/models/flux2/pipeline_flux2.py Outdated
Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

BLOCKER scan:

Category Result
Correctness PASS
Reliability/Safety PASS
Breaking Changes PASS
Test Coverage PASS - Comprehensive test results in PR body with generated images
Documentation PASS - Tables updated for CFG parallel
Security PASS

OVERALL: 1 ISSUE (merge conflicts, need to be resolved by author)

VERDICT: COMMENT

Issues

  1. [Gate Failure] Merge conflicts - PR cannot be merged. Please rebase on main and resolve conflicts.

Non-blocking observations

  • The PR provides comprehensive testing evidence with generated images across multiple configurations (eager mode, CFG parallel combinations)
  • Code follows the established patterns for CFGParallelMixin (similar to other models like FLUX.1-dev)
  • Documentation tables correctly updated to reflect CFG parallel support
  • No MRO issues: CFGParallelMixin uses met metaclass=ABCMeta without init

@nuclearwu
Copy link
Copy Markdown
Contributor Author

Left a few comments. Main concern is that do_true_cfg activates unconditionally with the default guidance_scale=4.0, which means every request now pays for 2x transformer forward passes even when the user doesn't intend to use CFG.

@lishunyang12 Thank you for your review. I have made the necessary revisions based on the feedback provided.

@nuclearwu
Copy link
Copy Markdown
Contributor Author

Does the CFG-parallel speed problem still persist, as you mentioned in the wechat group?

Can you show me the speed comparison of cfg-sequential plan before and after this PR? @nuclearwu

@wtomin You can compare this to #1629 without CFG-Parallel. The generation time of the same prompt is inconsistent, I considered that it might be due to the negative prompt for this.

@nuclearwu
Copy link
Copy Markdown
Contributor Author

does it apply to all flux.2 family models? what's the recommended parallel strategy if we have 2/4 devices?

@hsliuustc0106 Currently, the machine resources are insufficient. Only single machine with multiple cards can be used for testing.

@wtomin
Copy link
Copy Markdown
Collaborator

wtomin commented Mar 31, 2026

Recently, a refactoring PR related CFG-Parallel got merged #2063. Please double check if this PR affects yours.

The online serving test script test_flux_2_dev_expansion.py does not cover the test case of cfg-parallel. Please add it. For L4 test, please refert to #1832

@nuclearwu
Copy link
Copy Markdown
Contributor Author

Recently, a refactoring PR related CFG-Parallel got merged #2063. Please double check if this PR affects yours.

The online serving test script test_flux_2_dev_expansion.py does not cover the test case of cfg-parallel. Please add it. For L4 test, please refert to #1832

@wtomin I pulled the main branch and test again, the performance of cfg-parallel has declined.Specifically as shown in the above table.

Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Core CFG logic looks correct

Comment thread vllm_omni/diffusion/models/flux2/pipeline_flux2.py Outdated
Comment thread vllm_omni/diffusion/models/flux2/pipeline_flux2.py
Comment thread vllm_omni/diffusion/models/flux2/pipeline_flux2.py
Comment thread vllm_omni/diffusion/models/flux2/pipeline_flux2.py Outdated
@wtomin
Copy link
Copy Markdown
Collaborator

wtomin commented Apr 2, 2026

I tried to test the performance before and after #2063 PR got merged, with tests/dfx/perf/scripts/run_diffusion_benchmark.py. The model I used for testing is Qwen-Image. Here is the result:

commit id torch.compile cfg-parallel-size resolution num-inference-steps latency (mean
ebc9a8d (latest main branch, after #2063 got merged) True 2 1536x1536 35 12.776495
ebc9a8d (latest main branch, after #2063 got merged) False 2 1536x1536 35 15.493865
1ca9429 (before #2063 got merged) True 2 1536x1536 35 12.81318
1ca9429 (before #2063 got merged) False 2 1536x1536 35 15.69232

This table indicates that, at least for Qwen-Image, #2063 reduces the e2e latency by a small margin, instead of slowing it down. This is in line with the claims in #2063.

Regarding the problems you identified here, I have some suggestions:

  1. Could you evaluate the performance using tests/dfx/perf/scripts/run_diffusion_benchmark.py? All you need to do is to create a config yaml file, and it automatically computes the e2e latency and per-stage durations over 10 runs (diffuse, vae decode, text encoding, etc.)
  2. Could you verify if anything is special about Flux.2-dev CFG-Parallel, making it sensitive to [Diffusion] Refactor CFG parallel for extensibility and performance #2063?

Also cc @TKONIY for some suggestions.

@TKONIY
Copy link
Copy Markdown
Contributor

TKONIY commented Apr 2, 2026

I tried to test the performance before and after #2063 PR got merged, with tests/dfx/perf/scripts/run_diffusion_benchmark.py. The model I used for testing is Qwen-Image. Here is the result:

commit id torch.compile cfg-parallel-size resolution num-inference-steps latency (mean
ebc9a8d (latest main branch, after #2063 got merged) True 2 1536x1536 35 12.776495
ebc9a8d (latest main branch, after #2063 got merged) False 2 1536x1536 35 15.493865
1ca9429 (before #2063 got merged) True 2 1536x1536 35 12.81318
1ca9429 (before #2063 got merged) False 2 1536x1536 35 15.69232
This table indicates that, at least for Qwen-Image, #2063 reduces the e2e latency by a small margin, instead of slowing it down. This is in line with the claims in #2063.

Regarding the problems you identified here, I have some suggestions:

  1. Could you evaluate the performance using tests/dfx/perf/scripts/run_diffusion_benchmark.py? All you need to do is to create a config yaml file, and it automatically computes the e2e latency and per-stage durations over 10 runs (diffuse, vae decode, text encoding, etc.)
  2. Could you verify if anything is special about Flux.2-dev CFG-Parallel, making it sensitive to [Diffusion] Refactor CFG parallel for extensibility and performance #2063?

Also cc @TKONIY for some suggestions.

I will check it this week. Looks like some compatibility problem between #2063 and torch.compile

@nuclearwu
Copy link
Copy Markdown
Contributor Author

Core CFG logic looks correct

@lishunyang12 Done, please review again.

@nuclearwu
Copy link
Copy Markdown
Contributor Author

I tried to test the performance before and after #2063 PR got merged, with tests/dfx/perf/scripts/run_diffusion_benchmark.py. The model I used for testing is Qwen-Image. Here is the result:

commit id torch.compile cfg-parallel-size resolution num-inference-steps latency (mean
ebc9a8d (latest main branch, after #2063 got merged) True 2 1536x1536 35 12.776495
ebc9a8d (latest main branch, after #2063 got merged) False 2 1536x1536 35 15.493865
1ca9429 (before #2063 got merged) True 2 1536x1536 35 12.81318
1ca9429 (before #2063 got merged) False 2 1536x1536 35 15.69232
This table indicates that, at least for Qwen-Image, #2063 reduces the e2e latency by a small margin, instead of slowing it down. This is in line with the claims in #2063.

Regarding the problems you identified here, I have some suggestions:

  1. Could you evaluate the performance using tests/dfx/perf/scripts/run_diffusion_benchmark.py? All you need to do is to create a config yaml file, and it automatically computes the e2e latency and per-stage durations over 10 runs (diffuse, vae decode, text encoding, etc.)
  2. Could you verify if anything is special about Flux.2-dev CFG-Parallel, making it sensitive to [Diffusion] Refactor CFG parallel for extensibility and performance #2063?

Also cc @TKONIY for some suggestions.

@wtomin Sorry, I try again. #2063 reduces the e2e latency by a small margin and update results to the above table.

@wtomin
Copy link
Copy Markdown
Collaborator

wtomin commented Apr 2, 2026

Missing e2e test for CFG parallelism. Please add a test that covers --cfg-parallel-size=2. For L4 test, please refer to #1832 .

Feature support table not updated for Flux.2-dev CFG parallel in docs/user_guide/diffusion_features.md. BTW, in examples/offline_inference/text_to_image/README.md, there is a deprecated hyperlink ../../../docs/user_guide/diffusion_acceleration.md#using-cfg-parallel. Can you help to update it to the correct path to docs/user_guide/diffusion/parallelism/cfg_parallel.md?

Can you also report the peak VRAM usage in your PR body?

@nuclearwu
Copy link
Copy Markdown
Contributor Author

Missing e2e test for CFG parallelism. Please add a test that covers --cfg-parallel-size=2. For L4 test, please refer to #1832 .

Feature support table not updated for Flux.2-dev CFG parallel in docs/user_guide/diffusion_features.md. BTW, in examples/offline_inference/text_to_image/README.md, there is a deprecated hyperlink ../../../docs/user_guide/diffusion_acceleration.md#using-cfg-parallel. Can you help to update it to the correct path to docs/user_guide/diffusion/parallelism/cfg_parallel.md?

Can you also report the peak VRAM usage in your PR body?

@wtomin Done, the peak VRAM usage as shown in the above tables, PTAL.

@nuclearwu
Copy link
Copy Markdown
Contributor Author

Comment thread tests/e2e/online_serving/test_flux_2_dev_expansion.py Outdated
Signed-off-by: wuzhongjian <wuzhongjian_yewu@cmss.chinamobile.com>
Copy link
Copy Markdown
Collaborator

@wtomin wtomin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Comment thread examples/offline_inference/text_to_image/README.md
@wtomin wtomin added the ready label to trigger buildkite CI label Apr 8, 2026
@nuclearwu
Copy link
Copy Markdown
Contributor Author

cc @hsliuustc0106

@hsliuustc0106 hsliuustc0106 merged commit f3f2dc5 into vllm-project:main Apr 10, 2026
8 checks passed
Sy0307 pushed a commit to Sy0307/vllm-omni that referenced this pull request Apr 10, 2026
daixinning pushed a commit to daixinning/vllm-omni that referenced this pull request Apr 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants