Skip to content

[BugFix] Fix diffusion parallel_config YAML override and add deploy config field allowlist#3483

Merged
lishunyang12 merged 9 commits into
vllm-project:mainfrom
xiaohajiayou:fix/diffusion-parallel-overrides-v2
May 27, 2026
Merged

[BugFix] Fix diffusion parallel_config YAML override and add deploy config field allowlist#3483
lishunyang12 merged 9 commits into
vllm-project:mainfrom
xiaohajiayou:fix/diffusion-parallel-overrides-v2

Conversation

@xiaohajiayou
Copy link
Copy Markdown
Contributor

@xiaohajiayou xiaohajiayou commented May 10, 2026

Purpose

Diffusion stages in omni multi-stage configs consume parallel settings from engine_args.parallel_config, but CLI overrides were being applied as flat top-level engine_args fields.

This could produce a resolved stage config where:

  • diffusion parallel_config still kept the old values
  • CLI overrides appeared only at the top level
  • LLM and diffusion stages did not handle the same override fields consistently

This change fixes that at stage-config materialization time:

  • for diffusion stages, fields defined by DiffusionParallelConfig are normalized into engine_args.parallel_config
  • those fields are no longer duplicated at the top level for diffusion stages
  • for LLM stages, the existing top-level override behavior is preserved

Test Plan

  • Add a unit test covering diffusion stages with an existing parallel_config, verifying that CLI overrides replace the nested values
  • Add a unit test covering diffusion stages without an existing parallel_config, verifying that the nested config is created from CLI overrides
  • Add a unit test covering LLM stages, verifying that shared parallel fields remain top-level and do not create parallel_config
  • cmd
pytest -q \
  tests/entrypoints/test_async_omni_diffusion_config.py \
  tests/entrypoints/test_omni_entrypoints.py \
  tests/entrypoints/test_serve.py \
  tests/entrypoints/test_utils.py \
  -m "core_model and (cpu or cuda)"

Test Result

............................................................................................                                                                                                                                                                                                                                 
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
--- Running Summary
92 passed, 21 warnings in 7.07s```
---
<details>
<summary> Essential Elements of an Effective PR Description Checklist </summary>

- [ ] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
- [ ] The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the [test style doc](https://docs.vllm.ai/projects/vllm-omni/en/latest/contributing/ci/tests_style/)
- [ ] The test results. Please paste the results comparison before and after, or the e2e results.
- [ ] (Optional) The necessary documentation update, such as updating `supported_models.md` and `examples` for a new model. **Please run `mkdocs serve` to sync the documentation editions to `./docs`.**
- [ ] (Optional) Release notes update. If your change is user-facing, please update the release notes draft.
</details>

**BEFORE SUBMITTING, PLEASE READ <https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md>** (anything written below this line will be removed by GitHub Actions)

@xiaohajiayou
Copy link
Copy Markdown
Contributor Author

A bit more context on why this fix is structured this way:

  1. PR [Fix] Bridge flat CLI parallel args into DiffusionParallelConfig before YAML stage-config merge #2264 addressed the issue on the legacy stage-config path.

  2. However, I think the better fix for the current codebase is to handle diffusion parallel override fields at the stage-config materialization layer itself, specifically when applying runtime_overrides for diffusion stages.

The core issue is that these override fields were still being carried as flat top-level engine args, while diffusion stages actually consume them from parallel_config. With the current config pipeline, normalizing those fields during StageConfig.to_omegaconf() keeps the resolved stage config consistent before it reaches downstream diffusion config consumers.

Could you take another look? Thanks! @lishunyang12 @hsliuustc0106

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0c1573f9e0

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

continue
if parallel_config_dict is None:
parallel_config_dict = {}
parallel_config_dict[key] = runtime_overrides.pop(key)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Recompute sequence size when moving degree overrides

When a diffusion stage already has a YAML parallel_config with sequence_parallel_size: 1 (for example hunyuan_image3_moe.yaml) and the user only overrides --ulysses-degree or --ring-degree, this writes the new degree into the nested config but leaves the stale sequence_parallel_size. The resulting DiffusionParallelConfig then fails its sequence_parallel_size == ulysses_degree * ring_degree validation unless users also know to pass --sequence-parallel-size; the flat normalization path computes that default automatically, so the nested override path should clear or recompute it when the degrees change.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if users explicitly override --ulysses-degree or --ring-degree, they should also make sure the related fields are set correctly.

Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 May 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, but the flat path in _create_default_diffusion_stage_cfg auto-derives sequence_parallel_size = ulysses_degree * ring_degree when it's unset, so today overriding only --ulysses-degree/--ring-degree works without touching SP. With this change that silently breaks for existing configs like hunyuan_image3_moe.yaml (YAML pins sequence_parallel_size: 1, the override bumps the degree, and validation then fails). Can we mirror that derivation here when a degree is overridden and SP wasn't explicitly set, so the two paths stay consistent?

Copy link
Copy Markdown
Contributor Author

@xiaohajiayou xiaohajiayou May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d06c7f0:

  • if ulysses_degree or ring_degree is overridden and sequence_parallel_size is not explicitly set, we recompute sequence_parallel_size = ulysses_degree * ring_degree in the nested parallel_config override path.
  • Also added regression tests for both the recompute case and the explicit-override case.

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

this PR looks better than the previous one :)

@xiaohajiayou
Copy link
Copy Markdown
Contributor Author

this PR looks better than the previous one :)

Haha sorry, ran into some local branch issues on the previous one

@xiaohajiayou xiaohajiayou changed the title Fix diffusion parallel_config YAML override and add deploy config field allowlist [BugFix] Fix diffusion parallel_config YAML override and add deploy config field allowlist May 10, 2026
@xiaohajiayou xiaohajiayou requested a review from tzhouam as a code owner May 10, 2026 17:22
@xiaohajiayou
Copy link
Copy Markdown
Contributor Author

xiaohajiayou commented May 10, 2026

Added two follow-up fixes in 3a5d176 and 9fc0ba3:

  1. Parser-nullified HSDP override fields now fall back to diffusion defaults, so None no longer propagates into DiffusionParallelConfig.
  2. Updated the related entrypoint tests to match the new nullify semantics, including replacing the old hsdp_shard_size control field in the legacy nullify checks.
  • Test Result
............................................................................................                                                                                                                                                                                                                                 
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
--- Running Summary
92 passed, 21 warnings in 7.07s```
---

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

LGTM. This is a solid bugfix with comprehensive test coverage.

The fix correctly normalizes diffusion parallel overrides into nested while preserving LLM stage behavior. The test cases are thorough and verify both scenarios.

@lishunyang12 lishunyang12 added the ready label to trigger buildkite CI label May 17, 2026
xiaohajiayou and others added 6 commits May 26, 2026 19:26
Co-authored-by: zzhuoxin1508 <zzhuoxin1508@users.noreply.github.com>
Signed-off-by: xiaohajiayou <923390377@qq.com>
Signed-off-by: xiaohajiayou <923390377@qq.com>
Signed-off-by: xiaohajiayou <923390377@qq.com>
Signed-off-by: xiaohajiayou <923390377@qq.com>
Signed-off-by: xiaohajiayou <923390377@qq.com>
Signed-off-by: xiaohajiayou <923390377@qq.com>
@xiaohajiayou xiaohajiayou force-pushed the fix/diffusion-parallel-overrides-v2 branch from b324593 to 583af4c Compare May 26, 2026 11:37
Signed-off-by: xiaohajiayou <923390377@qq.com>
@xiaohajiayou xiaohajiayou force-pushed the fix/diffusion-parallel-overrides-v2 branch from 583af4c to d06c7f0 Compare May 26, 2026 11:38
@xiaohajiayou
Copy link
Copy Markdown
Contributor Author

xiaohajiayou commented May 26, 2026

Updated this PR since the earlier review. Main changes are:

  • Normalize diffusion parallel deploy/runtime overrides into nested engine_args.parallel_config, while keeping LLM stages on the existing top-level path. This also fixes the diffusion-stage override issue mentioned in [RFC]: Model-Aware Argument Default Resolution #3735.
  • Add the diffusion parallel override fields to the deploy schema. This is still needed even if [Config Refactor] Introduce unified VllmOmniConfig and consolidate OmniEngineArgs #3672 later moves fields into VllmOmniConfig, since the YAML/default override mechanism depends on these fields being present.
  • Preserve the legacy SP behavior called out in review:
    • if ulysses_degree or ring_degree is overridden and sequence_parallel_size is not explicitly set, recompute sequence_parallel_size = ulysses_degree * ring_degree.
  • Add regression tests for the nested override path and the SP recompute/preserve-explicit cases.

Could you take another look when you have time? A review here would help unblock the follow-up cleanup on the same config path.

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

@alex-jw-brooks @wuhang2014 PTAL for the final check

@xiaohajiayou
Copy link
Copy Markdown
Contributor Author

xiaohajiayou commented May 27, 2026

Considering the need for #3819 and since the CI has already passed, could you please take a look and see if this PR can be merged when you have time?
@lishunyang12 @hsliuustc0106

@lishunyang12
Copy link
Copy Markdown
Collaborator

Pulled this onto a local multi-GPU box and gave it a proper run-through, both the static config path and a real generation.

On the config side I went through all 13 DiffusionParallelConfig fields one by one and across the diffusion deploy YAMLs — each override lands in engine_args.parallel_config instead of leaking at top level, the LLM stages stay untouched, and sequence_parallel_size recomputes as ulysses × ring (with an explicit value still winning when set). The --usp / --ring aliases parse through vllm serve --omni fine too.

Then ran Wan2.2-TI2V-5B for real on 4 GPUs with ulysses_degree=2 + cfg_parallel_size=2 + HSDP, passing the flags individually so your normalization actually gets exercised. Worker logs match the flags:

  • Building SP subgroups (sp_size=2, ulysses=2, ring=1)
  • CFG splits into [0,1] / [2,3]
  • HSDP Inference: replicate_size=1, shard_size=4, world_size=4

Generation finished without issues, so the override path does what it says. LGTM from me.

One unrelated thing I ran into: with vae_patch_parallel_size < world_size (I had 2 < 4) I hit an IndexError at distributed_vae_executor.py:128. Your change carries the value through correctly — the executor just doesn't handle the asymmetric layout. Probably worth a separate issue. I reran with vae_patch_parallel_size=1 to confirm everything else passes.

@lishunyang12 lishunyang12 merged commit 37eebff into vllm-project:main May 27, 2026
8 checks passed
@xiaohajiayou xiaohajiayou mentioned this pull request May 28, 2026
5 tasks
david6666666 pushed a commit to lishunyang12/vllm-omni that referenced this pull request May 31, 2026
…onfig field allowlist (vllm-project#3483)

Signed-off-by: xiaohajiayou <923390377@qq.com>
Co-authored-by: zzhuoxin1508 <zzhuoxin1508@users.noreply.github.com>

Signed-off-by: WeiQing Chen <david6666666@users.noreply.github.com>
zengchuang-hw pushed a commit to zengchuang-hw/vllm-omni that referenced this pull request Jun 1, 2026
…onfig field allowlist (vllm-project#3483)

Signed-off-by: xiaohajiayou <923390377@qq.com>
Co-authored-by: zzhuoxin1508 <zzhuoxin1508@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants