Skip to content

[Bugfix]pass TP size to diffusion config#2867

Open
natureofnature wants to merge 2 commits intovllm-project:mainfrom
natureofnature:bugfix/cli_diffusion_args
Open

[Bugfix]pass TP size to diffusion config#2867
natureofnature wants to merge 2 commits intovllm-project:mainfrom
natureofnature:bugfix/cli_diffusion_args

Conversation

@natureofnature
Copy link
Copy Markdown
Contributor

@natureofnature natureofnature commented Apr 17, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

  1. Solve issue [CI Failure]: Diffusion X2I(&A&T) · Function Test with H100, test_bagel_expansion.py, openai.InternalServerError: Error code: 500 #2862.
  2. When TP size >1, the CI still uses 1 gpu device, which causes device usage error. This PR uses number of TP_size GPUs for testing.

Reason:

Fix CLI --tensor-parallel-size being silently dropped by the diffusion engine, causing a shape mismatch error during KV cache transfer when running with TP > 1.

Root cause:

OmniDiffusionConfig.from_kwargs() filters kwargs to only include fields defined directly on OmniDiffusionConfig. Since tensor_parallel_size is a field of the nested DiffusionParallelConfig (not a top-level field), it was silently discarded. This meant the DiT stage always ran with TP=1 regardless of the CLI argument.
After PR #2705 introduced _inject_inferred_kv_tp_topology, the KV transfer manager correctly inferred a heterogeneous topology (e.g. from_tp=1, to_tp=2) based on the configured TP sizes. However, the DiT stage was actually running with TP=1 due to the parameter dropping, so it expected full KV heads (e.g. 4) while receiving sliced heads (e.g. 2), resulting in:
shape mismatch: value tensor of shape [47, 2, 128] cannot be broadcast to indexing result of shape [47, 4, 128]

Fix

In OmniDiffusionConfig.from_kwargs(), forward the top-level tensor_parallel_size into parallel_config before field filtering, so CLI arguments propagate correctly to the diffusion engine. If parallel_config already explicitly sets tensor_parallel_size (e.g. from YAML), the existing value is preserved.

Test Plan

In my local environment, I changed DIT devices to 1,2 because the default yaml settings will cause OOM on my GPU if stage 0 and 1 on the same device. The default settings sets the starting offset of DIT to GPU 0.

CUDA_VISIBLE_DEVICES=0,1,2,3 pytest tests/e2e/online_serving/test_bagel_expansion.py::test_bagel[parallel_tp_2]   --run-level advanced_model   -v -s   2>&1 | tee /tmp/test_bagel_tp2.log

Test Result

Screenshot from 2026-04-17 11-15-07

@yenuo26 @princepride


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@yenuo26 yenuo26 added the diffusion-x2iat-test label to trigger buildkite x2image + x2audio + x2text series of diffusion models test in nightly CI label Apr 17, 2026
@natureofnature natureofnature force-pushed the bugfix/cli_diffusion_args branch from e76b07a to f3f86b5 Compare April 17, 2026 03:30
@natureofnature
Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Keep it up!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@natureofnature
Copy link
Copy Markdown
Contributor Author

@NumberWan PTAL

@natureofnature
Copy link
Copy Markdown
Contributor Author

natureofnature commented Apr 17, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 596da6a029

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

"""
# Dit devices start from 0, due to CI GPU usage constraint,
# for those GPUs that encountered OOM, adjust the offset accordingly.
devices = ",".join(str(i + offset) for i in range(tp_size))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Derive TP devices from visible GPU count

Hardcoding offset=1 makes the TP=2 case request runtime.devices as "1,2"; with exactly two visible GPUs, this resolves to only one mapped device and stage initialization then fails because TP still requires 2 ranks. Since this test is still marked for num_cards=2, it can run (not skip) on 2-GPU environments and fail for infrastructure reasons rather than model behavior.

Useful? React with 👍 / 👎.

"stage_args": {
1: {
"runtime.devices": devices,
"engine_args.parallel_config.tensor_parallel_size": tp_size,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep Bagel TP regression on CLI tensor-parallel path

This TP case now sets engine_args.parallel_config.tensor_parallel_size directly in YAML, which bypasses the --tensor-parallel-size CLI flow that this commit is intended to fix. That means CI can pass even if top-level CLI TP propagation regresses again, because the test no longer exercises the user-facing path where tensor_parallel_size was previously dropped.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please solve it

@Gaohan123
Copy link
Copy Markdown
Collaborator

Please resolve conflict and recover the skip test in PR #2883

Signed-off-by: natureofnature <wzliu@connect.hku.hk>
Signed-off-by: natureofnature <wzliu@connect.hku.hk>
@natureofnature natureofnature force-pushed the bugfix/cli_diffusion_args branch from 596da6a to a41e8f0 Compare April 24, 2026 02:00

# This test uses the default Bagel YAML, and CLI does not control devices.We modify yaml file directly.
_BAGEL_DEFAULT_YAML = str(
Path(__file__).resolve().parents[3] / "vllm_omni" / "model_executor" / "stage_configs" / "bagel.yaml"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use get_deploy_config_path()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no bagel yaml in the deploy folder

@yenuo26 yenuo26 added ready label to trigger buildkite CI and removed diffusion-x2iat-test label to trigger buildkite x2image + x2audio + x2text series of diffusion models test in nightly CI labels Apr 24, 2026
@Gaohan123
Copy link
Copy Markdown
Collaborator

Please resolve some comments from the bot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants