[Bugfix]pass TP size to diffusion config#2867
[Bugfix]pass TP size to diffusion config#2867natureofnature wants to merge 2 commits intovllm-project:mainfrom
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
e76b07a to
f3f86b5
Compare
|
@codex review |
|
Codex Review: Didn't find any major issues. Keep it up! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
@NumberWan PTAL |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 596da6a029
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| """ | ||
| # Dit devices start from 0, due to CI GPU usage constraint, | ||
| # for those GPUs that encountered OOM, adjust the offset accordingly. | ||
| devices = ",".join(str(i + offset) for i in range(tp_size)) |
There was a problem hiding this comment.
Derive TP devices from visible GPU count
Hardcoding offset=1 makes the TP=2 case request runtime.devices as "1,2"; with exactly two visible GPUs, this resolves to only one mapped device and stage initialization then fails because TP still requires 2 ranks. Since this test is still marked for num_cards=2, it can run (not skip) on 2-GPU environments and fail for infrastructure reasons rather than model behavior.
Useful? React with 👍 / 👎.
| "stage_args": { | ||
| 1: { | ||
| "runtime.devices": devices, | ||
| "engine_args.parallel_config.tensor_parallel_size": tp_size, |
There was a problem hiding this comment.
Keep Bagel TP regression on CLI tensor-parallel path
This TP case now sets engine_args.parallel_config.tensor_parallel_size directly in YAML, which bypasses the --tensor-parallel-size CLI flow that this commit is intended to fix. That means CI can pass even if top-level CLI TP propagation regresses again, because the test no longer exercises the user-facing path where tensor_parallel_size was previously dropped.
Useful? React with 👍 / 👎.
|
Please resolve conflict and recover the skip test in PR #2883 |
Signed-off-by: natureofnature <wzliu@connect.hku.hk>
Signed-off-by: natureofnature <wzliu@connect.hku.hk>
596da6a to
a41e8f0
Compare
|
|
||
| # This test uses the default Bagel YAML, and CLI does not control devices.We modify yaml file directly. | ||
| _BAGEL_DEFAULT_YAML = str( | ||
| Path(__file__).resolve().parents[3] / "vllm_omni" / "model_executor" / "stage_configs" / "bagel.yaml" |
There was a problem hiding this comment.
please use get_deploy_config_path()
There was a problem hiding this comment.
There's no bagel yaml in the deploy folder
|
Please resolve some comments from the bot |
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Reason:
Fix CLI --tensor-parallel-size being silently dropped by the diffusion engine, causing a shape mismatch error during KV cache transfer when running with TP > 1.
Root cause:
OmniDiffusionConfig.from_kwargs() filters kwargs to only include fields defined directly on OmniDiffusionConfig. Since tensor_parallel_size is a field of the nested DiffusionParallelConfig (not a top-level field), it was silently discarded. This meant the DiT stage always ran with TP=1 regardless of the CLI argument.
After PR #2705 introduced _inject_inferred_kv_tp_topology, the KV transfer manager correctly inferred a heterogeneous topology (e.g. from_tp=1, to_tp=2) based on the configured TP sizes. However, the DiT stage was actually running with TP=1 due to the parameter dropping, so it expected full KV heads (e.g. 4) while receiving sliced heads (e.g. 2), resulting in:
shape mismatch: value tensor of shape [47, 2, 128] cannot be broadcast to indexing result of shape [47, 4, 128]
Fix
In OmniDiffusionConfig.from_kwargs(), forward the top-level tensor_parallel_size into parallel_config before field filtering, so CLI arguments propagate correctly to the diffusion engine. If parallel_config already explicitly sets tensor_parallel_size (e.g. from YAML), the existing value is preserved.
Test Plan
In my local environment, I changed DIT devices to 1,2 because the default yaml settings will cause OOM on my GPU if stage 0 and 1 on the same device. The default settings sets the starting offset of DIT to GPU 0.
Test Result
@yenuo26 @princepride
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)