[Bugfix] align Bagel diffusion parallel config docs and stage YAMLs#2636
[Bugfix] align Bagel diffusion parallel config docs and stage YAMLs#2636xiaohajiayou wants to merge 2 commits intovllm-project:mainfrom
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
04fb23d to
be1c649
Compare
Signed-off-by: xiaohajiayou <923390377@qq.com>
be1c649 to
e861d03
Compare
ianliuy
left a comment
There was a problem hiding this comment.
LGTM overall the YAML fixes are correct and the code path confirms top-level tensor_parallel_size was silently dropped by OmniDiffusionConfig.from_kwargs(). One minor nit below.
| @@ -35,6 +35,25 @@ For larger models or multi-GPU environments, you can enable Tensor Parallelism ( | |||
|
|
|||
| 1. **Modify Stage Config**: Create or modify a stage configuration yaml (e.g., [`bagel.yaml`](https://github.com/vllm-project/vllm-omni/tree/main/vllm_omni/model_executor/stage_configs/bagel.yaml)). Set `tensor_parallel_size` to `2` (or more) and update `devices` to include multiple GPU IDs (e.g., `"0,1"`). | |||
There was a problem hiding this comment.
Nit: This line still says "Set tensor_parallel_size to 2 (or more)" without distinguishing LLM vs diffusion stage, which is the whole point of this PR. Consider updating to:
Set the appropriate TP config field for your stage type (see details below) and update
devicesto include multiple GPU IDs.
lishunyang12
left a comment
There was a problem hiding this comment.
Review: [Bugfix] align Bagel diffusion parallel config docs and stage YAMLs
YAML changes (stage configs) -- looks good
The YAML changes across all four files are correct and consistent:
bagel.yaml,bagel_multiconnector.yaml, andxpu/bagel.yamlall movetensor_parallel_sizefrom a top-levelengine_argsfield intoengine_args.parallel_config.tensor_parallel_sizefor the diffusion (DiT) stage.bagel_usp2.yamlcorrectly movestensor_parallel_size: 1inside the existingparallel_configblock alongsideulysses_degree.- LLM stage configs are left unchanged, which is the correct behavior.
Documentation -- needs a fix in online_serving/bagel.md
Issue: stale text in online_serving/bagel.md (line 36)
The PR inserts the new LLM-vs-diffusion explanation block after the existing step-1 text, but that existing text was not updated. Currently line 36 still reads:
- Modify Stage Config: ... Set
tensor_parallel_sizeto2(or more) and updatedevicesto include multiple GPU IDs (e.g.,"0,1").
This gives the old undifferentiated advice (tensor_parallel_size at top level) and contradicts the new distinction introduced immediately below it. The code block on lines 38-44 (the pre-existing LLM-style snippet) also lacks any label like "Example for the LLM stage" to match the structure of the newly inserted diffusion example.
Suggestion: Either (a) rewrite step 1 to be a generic intro (e.g., "Modify Stage Config: Create or modify a stage configuration yaml ... See below for TP config details for each stage type.") and remove the now-unlabeled code block, or (b) replace the existing step 1 + code block entirely with the new structured text -- the same way the offline doc was cleaned up. The offline doc (offline_inference/bagel.md) handles this cleanly; the online doc should match.
Minor: offline_inference/bagel.md
The new block starting with "In multi-stage omni models..." is inserted right after the intro paragraph with no transition. Consider adding a brief connecting sentence or a blank line + heading to improve readability. This is a nit, not a blocker.
Summary
YAML changes are correct. The offline doc update is clean. The online serving doc has a leftover stale instruction that should be updated for consistency. Requesting a small fix there before merge.
|
What is this relationship with #2936? |
|
Closing this since #2936 already completed this refactor. |
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Fix #2635
Bagel docs and several Bagel stage YAMLs were still using or describing diffusion-stage tensor parallelism with top-level
tensor_parallel_size, which is inconsistent with the current diffusion runtime path.This PR aligns Bagel multi-stage docs and stage configs with the current diffusion parallel config path.
In multi-stage omni models:
engine_args.tensor_parallel_sizeengine_args.parallel_config.*Changes
engine_args.parallel_config.tensor_parallel_sizeengine_args.tensor_parallel_size)parallel_configFiles changed
vllm_omni/model_executor/stage_configs/bagel.yamlvllm_omni/model_executor/stage_configs/bagel_multiconnector.yamlvllm_omni/model_executor/stage_configs/bagel_usp2.yamlvllm_omni/platforms/xpu/stage_configs/bagel.yamldocs/user_guide/examples/online_serving/bagel.mddocs/user_guide/examples/offline_inference/bagel.mdEssential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)