[Bugfix]pass TP size to diffusion config by natureofnature · Pull Request #2867 · vllm-project/vllm-omni

natureofnature · 2026-04-17T03:26:21Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Solve issue [CI Failure]: Diffusion X2I(&A&T) · Function Test with H100, test_bagel_expansion.py, openai.InternalServerError: Error code: 500 #2862.
When TP size >1, the CI still uses 1 gpu device, which causes device usage error. This PR uses number of TP_size GPUs for testing.

Reason:

Fix CLI --tensor-parallel-size being silently dropped by the diffusion engine, causing a shape mismatch error during KV cache transfer when running with TP > 1.

Root cause:

OmniDiffusionConfig.from_kwargs() filters kwargs to only include fields defined directly on OmniDiffusionConfig. Since tensor_parallel_size is a field of the nested DiffusionParallelConfig (not a top-level field), it was silently discarded. This meant the DiT stage always ran with TP=1 regardless of the CLI argument.
After PR #2705 introduced _inject_inferred_kv_tp_topology, the KV transfer manager correctly inferred a heterogeneous topology (e.g. from_tp=1, to_tp=2) based on the configured TP sizes. However, the DiT stage was actually running with TP=1 due to the parameter dropping, so it expected full KV heads (e.g. 4) while receiving sliced heads (e.g. 2), resulting in:
shape mismatch: value tensor of shape [47, 2, 128] cannot be broadcast to indexing result of shape [47, 4, 128]

Fix

In OmniDiffusionConfig.from_kwargs(), forward the top-level tensor_parallel_size into parallel_config before field filtering, so CLI arguments propagate correctly to the diffusion engine. If parallel_config already explicitly sets tensor_parallel_size (e.g. from YAML), the existing value is preserved.

Test Plan

In my local environment, I changed DIT devices to 1,2 because the default yaml settings will cause OOM on my GPU if stage 0 and 1 on the same device. The default settings sets the starting offset of DIT to GPU 0.

CUDA_VISIBLE_DEVICES=0,1,2,3 pytest tests/e2e/online_serving/test_bagel_expansion.py::test_bagel[parallel_tp_2]   --run-level advanced_model   -v -s   2>&1 | tee /tmp/test_bagel_tp2.log

Test Result

@yenuo26 @princepride

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

chatgpt-codex-connector · 2026-04-17T03:26:30Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

natureofnature · 2026-04-17T04:09:13Z

@codex review

chatgpt-codex-connector · 2026-04-17T04:12:58Z

Codex Review: Didn't find any major issues. Keep it up!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

natureofnature · 2026-04-17T07:55:36Z

@NumberWan PTAL

natureofnature · 2026-04-17T10:14:01Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 596da6a029

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-17T10:19:09Z

+    """
+    # Dit devices start from 0, due to CI GPU usage constraint,
+    # for those GPUs that encountered OOM, adjust the offset accordingly.
+    devices = ",".join(str(i + offset) for i in range(tp_size))


Derive TP devices from visible GPU count

Hardcoding offset=1 makes the TP=2 case request runtime.devices as "1,2"; with exactly two visible GPUs, this resolves to only one mapped device and stage initialization then fails because TP still requires 2 ranks. Since this test is still marked for num_cards=2, it can run (not skip) on 2-GPU environments and fail for infrastructure reasons rather than model behavior.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-17T10:19:09Z

+                        "stage_args": {
+                            1: {
+                                "runtime.devices": devices,
+                                "engine_args.parallel_config.tensor_parallel_size": tp_size,


Keep Bagel TP regression on CLI tensor-parallel path

This TP case now sets engine_args.parallel_config.tensor_parallel_size directly in YAML, which bypasses the --tensor-parallel-size CLI flow that this commit is intended to fix. That means CI can pass even if top-level CLI TP propagation regresses again, because the test no longer exercises the user-facing path where tensor_parallel_size was previously dropped.

Useful? React with 👍 / 👎.

Please solve it

Gaohan123 · 2026-04-20T15:57:24Z

Please resolve conflict and recover the skip test in PR #2883

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

yenuo26 · 2026-04-24T06:35:23Z


+# This test uses the default Bagel YAML, and CLI does not control devices.We modify yaml file directly.
+_BAGEL_DEFAULT_YAML = str(
+    Path(__file__).resolve().parents[3] / "vllm_omni" / "model_executor" / "stage_configs" / "bagel.yaml"


please use get_deploy_config_path()

There's no bagel yaml in the deploy folder

Gaohan123 · 2026-04-24T08:39:21Z

Please resolve some comments from the bot

natureofnature requested a review from hsliuustc0106 as a code owner April 17, 2026 03:26

yenuo26 added the diffusion-x2iat-test label to trigger buildkite x2image + x2audio + x2text series of diffusion models test in nightly CI label Apr 17, 2026

natureofnature force-pushed the bugfix/cli_diffusion_args branch from e76b07a to f3f86b5 Compare April 17, 2026 03:30

chatgpt-codex-connector Bot reviewed Apr 17, 2026

View reviewed changes

natureofnature mentioned this pull request Apr 20, 2026

[RFC]: Bagel Performance Optimization - CFG/TP Mooncake TE Support JiusiServe/vllm-omni#207

Open

1 task

Gaohan123 added this to the v0.20.0 milestone Apr 20, 2026

natureofnature added 2 commits April 24, 2026 09:56

pass tp size to diffusion config

2daa573

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

update bagel CI to use real tp devices

a41e8f0

Signed-off-by: natureofnature <wzliu@connect.hku.hk>

natureofnature force-pushed the bugfix/cli_diffusion_args branch from 596da6a to a41e8f0 Compare April 24, 2026 02:00

yenuo26 reviewed Apr 24, 2026

View reviewed changes

yenuo26 added ready label to trigger buildkite CI and removed diffusion-x2iat-test label to trigger buildkite x2image + x2audio + x2text series of diffusion models test in nightly CI labels Apr 24, 2026

Conversation

natureofnature commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Reason:

Root cause:

Fix

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector Bot commented Apr 17, 2026

Uh oh!

natureofnature commented Apr 17, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 17, 2026

Uh oh!

natureofnature commented Apr 17, 2026

Uh oh!

natureofnature commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Gaohan123 Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Gaohan123 commented Apr 20, 2026

Uh oh!

yenuo26 Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

natureofnature Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Gaohan123 commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

natureofnature commented Apr 17, 2026 •

edited

Loading

natureofnature commented Apr 17, 2026 •

edited

Loading