Skip to content

[Config] Remove invalid LLM-only engine_args from diffusion stage configs#2622

Merged
hsliuustc0106 merged 1 commit into
vllm-project:mainfrom
ianliuy:fix/cleanup-diffusion-stage-configs
Apr 10, 2026
Merged

[Config] Remove invalid LLM-only engine_args from diffusion stage configs#2622
hsliuustc0106 merged 1 commit into
vllm-project:mainfrom
ianliuy:fix/cleanup-diffusion-stage-configs

Conversation

@ianliuy
Copy link
Copy Markdown
Contributor

@ianliuy ianliuy commented Apr 9, 2026

Purpose

Fix for: #2563

Remove dead engine_args fields from diffusion stage configs (stage_type: diffusion). These fields were copy-pasted from LLM stage configs and are silently dropped by OmniDiffusionConfig.from_kwargs().

Note: Generation/AR stages (worker_type: generation, worker_type: ar) use OmniEngineArgs where these fields are actively consumed they are intentionally left unchanged.

Changes

Diffusion config cleanup (11 YAMLs, 51 lines)

Main configs (9 files in vllm_omni/model_executor/stage_configs/):

File Fields removed
bagel.yaml gpu_memory_utilization, engine_output_type, enable_prefix_caching, max_num_batched_tokens, tensor_parallel_size
bagel_multiconnector.yaml same 5
bagel_think.yaml same 5
bagel_single_stage.yaml same 5
bagel_usp2.yaml same 5
hunyuan_image_3_moe.yaml 4 (no top-level tensor_parallel_size)
hunyuan_image3_moe_dit.yaml 4
hunyuan_image3_moe_dit_2gpu_fp8.yaml 4
omnivoice.yaml 2 (gpu_memory_utilization, engine_output_type)

Test configs (2 files in tests/e2e/offline_inference/stage_configs/):

File Fields removed
bagel_mooncake_ci.yaml same 5 + load_format (OmniDiffusionConfig uses diffusion_load_format)
bagel_sharedmemory_ci.yaml same 5 + load_format

Regression test (new file)

tests/test_diffusion_config_fields.py scans all YAML configs (main + test dirs) and asserts diffusion stage engine_args only contain valid OmniDiffusionConfig fields.

Allowlisted fields consumed outside the dataclass:

  • model_stage stage init layer
  • model_arch diffusion model class resolution
  • quantization mapped to quantization_config via backwards-compat in from_kwargs()

Why these fields are safe to remove

OmniDiffusionConfig.from_kwargs() explicitly filters unknown fields:

\\python
valid_fields = {f.name for f in fields(cls)}
filtered_kwargs = {k: v for k, v in kwargs.items() if k in valid_fields}
return cls(**filtered_kwargs)
\\

No behavioral change all removed fields were already silently dropped at runtime.

Test Plan

  • Regression test passes
  • Existing CI tests pass (no behavioral change)

@ianliuy ianliuy requested a review from hsliuustc0106 as a code owner April 9, 2026 05:28
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@ianliuy ianliuy force-pushed the fix/cleanup-diffusion-stage-configs branch 2 times, most recently from de91632 to f567554 Compare April 9, 2026 05:34
@hsliuustc0106 hsliuustc0106 added the ready label to trigger buildkite CI label Apr 9, 2026
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

let me help run the ci tests

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

I think you also need to make changes to the yamls under tests

@ianliuy ianliuy force-pushed the fix/cleanup-diffusion-stage-configs branch from f567554 to d9ad8a7 Compare April 10, 2026 01:53
@ianliuy
Copy link
Copy Markdown
Contributor Author

ianliuy commented Apr 10, 2026

Thanks for running the CI and the feedback @hsliuustc0106!

Updated here's what changed:

1. Fixed the CI failure (regression test)

The test was flagging model_arch and quantization as invalid. Both are actually consumed:

  • quantization mapped to quantization_config via backwards-compat in from_kwargs() (data.py L679-682)
  • model_arch consumed by the stage init layer for model class resolution

Added both to the test's allowlist.

2. Cleaned test YAMLs (14 files, 65 lines)

Removed the same dead fields from diffusion/generation stages in:

Directory Files
tests/e2e/stage_configs/ dynin_omni_ci, mimo_audio_ci, qwen2_5_omni_ci, qwen3_omni_ci
tests/e2e/stage_configs/rocm/ qwen2_5_omni_ci, qwen3_omni_ci
tests/e2e/stage_configs/xpu/ qwen2_5_omni_ci, qwen3_omni_ci
tests/e2e/offline_inference/stage_configs/ bagel_mooncake_ci, bagel_sharedmemory_ci
tests/e2e/offline_inference/stage_configs/npu/ qwen2_5_omni_ci
tests/dfx/perf/stage_configs/ qwen3_omni, qwen3_tts
tests/dfx/stability/stage_configs/ qwen3_omni

3. Rebased on latest main

Total: 24 files changed, +56 104.

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

please check whether the ci failure is related to this PR

@ianliuy ianliuy force-pushed the fix/cleanup-diffusion-stage-configs branch from d9ad8a7 to 6470bf4 Compare April 10, 2026 02:43
Remove fields not part of OmniDiffusionConfig from diffusion stages:
- gpu_memory_utilization (9 files)
- enable_prefix_caching (8 files)
- engine_output_type (9 files)
- max_num_batched_tokens (8 files)
- tensor_parallel_size at top-level (5 bagel files)

These fields were copy-pasted from LLM stage configs and silently
dropped by OmniDiffusionConfig.from_kwargs(). Removing them for clarity.

Also adds a regression test to prevent future copy-paste of invalid
fields into diffusion stage configs.

Fixes vllm-project#2563

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Yiyang Liu <yiyangliu@microsoft.com>
@ianliuy ianliuy force-pushed the fix/cleanup-diffusion-stage-configs branch from 6470bf4 to be25a9c Compare April 10, 2026 02:51
@ianliuy
Copy link
Copy Markdown
Contributor Author

ianliuy commented Apr 10, 2026

Pushed another update narrowed the scope after investigating more carefully:

Only diffusion-stage configs are cleaned. The test YAMLs with worker_type: generation (qwen3, qwen2.5, dynin, mimo, etc.) use OmniEngineArgs where these fields are actively consumed, so those are left unchanged.

Changes in this push:

  • Removed dead fields from 2 diffusion test YAMLs (bagel_mooncake_ci, bagel_sharedmemory_ci)
  • Also removed load_format: dummy from their diffusion stages (also dead OmniDiffusionConfig uses diffusion_load_format instead)
  • Extended the regression test to scan tests/**/*.yaml recursively, still only flagging stage_type: diffusion stages

Total: 12 files, +68 51.

distributed_executor_backend: mp
enable_prefix_caching: false
max_num_batched_tokens: 32768
tensor_parallel_size: 1
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we remove tensor_parallel_size? Seems someone just edit tensor_parallel_size place, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good observation! Yes tensor_parallel_size was effectively "moved" to parallel_config.tensor_parallel_size when DiffusionParallelConfig was introduced in PR #189. Since then, OmniDiffusionConfig no longer has a top-level tensor_parallel_size field, so from_kwargs() silently drops it (data.py L692-694). The value here is also 1, which matches the DiffusionParallelConfig default. Issue #2635 also tracks this inconsistency.

distributed_executor_backend: "mp"
enable_prefix_caching: false
max_num_batched_tokens: 32768
tensor_parallel_size: 1
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same reasoning as above.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If future users add new special fields, how should we maintain this UT, considering these new fields might only apply to a specific model?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Two cases:

  • If the new field is added to the OmniDiffusionConfig dataclass, the test picks it up automatically via fields(OmniDiffusionConfig) no changes needed.
  • If it's consumed outside the dataclass (like model_stage by the stage init layer, or quantization via from_kwargs() backwards-compat), add it to the allowlist in this test.

I'll add a comment in the test to document this but before I do, I'd love to hear your thoughts on whether this approach works for you.

gpu_memory_utilization: 0.5
enforce_eager: true
trust_remote_code: true
engine_output_type: audio
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should maintain it, correct me if I am wrong

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are safe to remove this is a stage_type: diffusion stage, and OmniDiffusionConfig has neither field. from_kwargs() silently drops them (data.py L692-694). For engine_output_type specifically, extract_stage_metadata() also hardcodes it to None for all diffusion stages (stage_init_utils.py L171). Audio routing is handled by the stage-level final_output_type: audio (which is preserved) and the SupportAudioOutput interface on OmniVoicePipeline.

Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, verified the dead-field claims against data.py / stage_init_utils.py. cc @princepride for re-review.

# model_arch is consumed by the stage init layer for diffusion model class resolution
valid_fields.add("model_arch")
# "quantization" is mapped to "quantization_config" by from_kwargs() backwards-compat
valid_fields.add("quantization")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: worth a short block comment above the allowlist explaining the maintenance policy — entries here are fields consumed outside OmniDiffusionConfig (e.g. by extract_stage_metadata or the backwards-compat path in from_kwargs). Saves the next maintainer a git blame.

@hsliuustc0106 hsliuustc0106 merged commit 687405c into vllm-project:main Apr 10, 2026
8 checks passed
daixinning pushed a commit to daixinning/vllm-omni that referenced this pull request Apr 13, 2026
…figs (vllm-project#2622)

Signed-off-by: Yiyang Liu <yiyangliu@microsoft.com>
Co-authored-by: Yiyang Liu <yiyangliu@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
lengrongfu pushed a commit to lengrongfu/vllm-omni that referenced this pull request May 1, 2026
…figs (vllm-project#2622)

Signed-off-by: Yiyang Liu <yiyangliu@microsoft.com>
Co-authored-by: Yiyang Liu <yiyangliu@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
…figs (vllm-project#2622)

Signed-off-by: Yiyang Liu <yiyangliu@microsoft.com>
Co-authored-by: Yiyang Liu <yiyangliu@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants