Skip to content

[Bugfix] align Bagel diffusion parallel config docs and stage YAMLs#2636

Closed
xiaohajiayou wants to merge 2 commits intovllm-project:mainfrom
xiaohajiayou:docs-bagel-parallel-config
Closed

[Bugfix] align Bagel diffusion parallel config docs and stage YAMLs#2636
xiaohajiayou wants to merge 2 commits intovllm-project:mainfrom
xiaohajiayou:docs-bagel-parallel-config

Conversation

@xiaohajiayou
Copy link
Copy Markdown
Contributor

@xiaohajiayou xiaohajiayou commented Apr 9, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Fix #2635
Bagel docs and several Bagel stage YAMLs were still using or describing diffusion-stage tensor parallelism with top-level tensor_parallel_size, which is inconsistent with the current diffusion runtime path.
This PR aligns Bagel multi-stage docs and stage configs with the current diffusion parallel config path.

In multi-stage omni models:

  • LLM stages use top-level engine args such as engine_args.tensor_parallel_size
  • Diffusion stages use engine_args.parallel_config.*

Changes

  • Update Bagel diffusion stage YAMLs to use engine_args.parallel_config.tensor_parallel_size
  • Keep Bagel LLM stage TP config unchanged (engine_args.tensor_parallel_size)
  • Update Bagel online/offline docs to explicitly distinguish:
    • LLM stage TP config
    • diffusion stage TP config
  • Normalize the Bagel Ulysses stage config to keep diffusion parallel settings under parallel_config

Files changed

  • vllm_omni/model_executor/stage_configs/bagel.yaml
  • vllm_omni/model_executor/stage_configs/bagel_multiconnector.yaml
  • vllm_omni/model_executor/stage_configs/bagel_usp2.yaml
  • vllm_omni/platforms/xpu/stage_configs/bagel.yaml
  • docs/user_guide/examples/online_serving/bagel.md
  • docs/user_guide/examples/offline_inference/bagel.md

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@xiaohajiayou xiaohajiayou changed the title Docs bagel parallel config [Bugfix] align Bagel diffusion parallel config docs and stage YAMLs Apr 9, 2026
@xiaohajiayou xiaohajiayou force-pushed the docs-bagel-parallel-config branch from 04fb23d to be1c649 Compare April 9, 2026 08:28
Signed-off-by: xiaohajiayou <923390377@qq.com>
@xiaohajiayou xiaohajiayou force-pushed the docs-bagel-parallel-config branch from be1c649 to e861d03 Compare April 9, 2026 08:32
@princepride princepride enabled auto-merge (squash) April 9, 2026 08:43
Copy link
Copy Markdown
Collaborator

@princepride princepride left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lishunyang12 lishunyang12 disabled auto-merge April 9, 2026 13:16
Copy link
Copy Markdown
Contributor

@ianliuy ianliuy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall the YAML fixes are correct and the code path confirms top-level tensor_parallel_size was silently dropped by OmniDiffusionConfig.from_kwargs(). One minor nit below.

@@ -35,6 +35,25 @@ For larger models or multi-GPU environments, you can enable Tensor Parallelism (

1. **Modify Stage Config**: Create or modify a stage configuration yaml (e.g., [`bagel.yaml`](https://github.com/vllm-project/vllm-omni/tree/main/vllm_omni/model_executor/stage_configs/bagel.yaml)). Set `tensor_parallel_size` to `2` (or more) and update `devices` to include multiple GPU IDs (e.g., `"0,1"`).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This line still says "Set tensor_parallel_size to 2 (or more)" without distinguishing LLM vs diffusion stage, which is the whole point of this PR. Consider updating to:

Set the appropriate TP config field for your stage type (see details below) and update devices to include multiple GPU IDs.

Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: [Bugfix] align Bagel diffusion parallel config docs and stage YAMLs

YAML changes (stage configs) -- looks good

The YAML changes across all four files are correct and consistent:

  • bagel.yaml, bagel_multiconnector.yaml, and xpu/bagel.yaml all move tensor_parallel_size from a top-level engine_args field into engine_args.parallel_config.tensor_parallel_size for the diffusion (DiT) stage.
  • bagel_usp2.yaml correctly moves tensor_parallel_size: 1 inside the existing parallel_config block alongside ulysses_degree.
  • LLM stage configs are left unchanged, which is the correct behavior.

Documentation -- needs a fix in online_serving/bagel.md

Issue: stale text in online_serving/bagel.md (line 36)

The PR inserts the new LLM-vs-diffusion explanation block after the existing step-1 text, but that existing text was not updated. Currently line 36 still reads:

  1. Modify Stage Config: ... Set tensor_parallel_size to 2 (or more) and update devices to include multiple GPU IDs (e.g., "0,1").

This gives the old undifferentiated advice (tensor_parallel_size at top level) and contradicts the new distinction introduced immediately below it. The code block on lines 38-44 (the pre-existing LLM-style snippet) also lacks any label like "Example for the LLM stage" to match the structure of the newly inserted diffusion example.

Suggestion: Either (a) rewrite step 1 to be a generic intro (e.g., "Modify Stage Config: Create or modify a stage configuration yaml ... See below for TP config details for each stage type.") and remove the now-unlabeled code block, or (b) replace the existing step 1 + code block entirely with the new structured text -- the same way the offline doc was cleaned up. The offline doc (offline_inference/bagel.md) handles this cleanly; the online doc should match.

Minor: offline_inference/bagel.md

The new block starting with "In multi-stage omni models..." is inserted right after the intro paragraph with no transition. Consider adding a brief connecting sentence or a blank line + heading to improve readability. This is a nit, not a blocker.

Summary

YAML changes are correct. The offline doc update is clean. The online serving doc has a leftover stale instruction that should be updated for consistency. Requesting a small fix there before merge.

@lishunyang12
Copy link
Copy Markdown
Collaborator

What is this relationship with #2936?

@xiaohajiayou
Copy link
Copy Markdown
Contributor Author

Closing this since #2936 already completed this refactor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: bagel model still use inconsistent parallel config fields in docs/YAML

4 participants