[Config] Remove invalid LLM-only engine_args from diffusion stage configs#2622
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
de91632 to
f567554
Compare
|
let me help run the ci tests |
|
I think you also need to make changes to the yamls under tests |
f567554 to
d9ad8a7
Compare
|
Thanks for running the CI and the feedback @hsliuustc0106! Updated here's what changed: 1. Fixed the CI failure (regression test) The test was flagging
Added both to the test's allowlist. 2. Cleaned test YAMLs (14 files, 65 lines) Removed the same dead fields from diffusion/generation stages in:
3. Rebased on latest main Total: 24 files changed, +56 104. |
|
please check whether the ci failure is related to this PR |
d9ad8a7 to
6470bf4
Compare
Remove fields not part of OmniDiffusionConfig from diffusion stages: - gpu_memory_utilization (9 files) - enable_prefix_caching (8 files) - engine_output_type (9 files) - max_num_batched_tokens (8 files) - tensor_parallel_size at top-level (5 bagel files) These fields were copy-pasted from LLM stage configs and silently dropped by OmniDiffusionConfig.from_kwargs(). Removing them for clarity. Also adds a regression test to prevent future copy-paste of invalid fields into diffusion stage configs. Fixes vllm-project#2563 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Yiyang Liu <yiyangliu@microsoft.com>
6470bf4 to
be25a9c
Compare
|
Pushed another update narrowed the scope after investigating more carefully: Only diffusion-stage configs are cleaned. The test YAMLs with Changes in this push:
Total: 12 files, +68 51. |
| distributed_executor_backend: mp | ||
| enable_prefix_caching: false | ||
| max_num_batched_tokens: 32768 | ||
| tensor_parallel_size: 1 |
There was a problem hiding this comment.
Why we remove tensor_parallel_size? Seems someone just edit tensor_parallel_size place, right?
There was a problem hiding this comment.
Good observation! Yes tensor_parallel_size was effectively "moved" to parallel_config.tensor_parallel_size when DiffusionParallelConfig was introduced in PR #189. Since then, OmniDiffusionConfig no longer has a top-level tensor_parallel_size field, so from_kwargs() silently drops it (data.py L692-694). The value here is also 1, which matches the DiffusionParallelConfig default. Issue #2635 also tracks this inconsistency.
| distributed_executor_backend: "mp" | ||
| enable_prefix_caching: false | ||
| max_num_batched_tokens: 32768 | ||
| tensor_parallel_size: 1 |
There was a problem hiding this comment.
Same reasoning as above.
There was a problem hiding this comment.
If future users add new special fields, how should we maintain this UT, considering these new fields might only apply to a specific model?
There was a problem hiding this comment.
Good question. Two cases:
- If the new field is added to the
OmniDiffusionConfigdataclass, the test picks it up automatically viafields(OmniDiffusionConfig)no changes needed. - If it's consumed outside the dataclass (like
model_stageby the stage init layer, orquantizationviafrom_kwargs()backwards-compat), add it to the allowlist in this test.
I'll add a comment in the test to document this but before I do, I'd love to hear your thoughts on whether this approach works for you.
| gpu_memory_utilization: 0.5 | ||
| enforce_eager: true | ||
| trust_remote_code: true | ||
| engine_output_type: audio |
There was a problem hiding this comment.
I think we should maintain it, correct me if I am wrong
There was a problem hiding this comment.
These are safe to remove this is a stage_type: diffusion stage, and OmniDiffusionConfig has neither field. from_kwargs() silently drops them (data.py L692-694). For engine_output_type specifically, extract_stage_metadata() also hardcodes it to None for all diffusion stages (stage_init_utils.py L171). Audio routing is handled by the stage-level final_output_type: audio (which is preserved) and the SupportAudioOutput interface on OmniVoicePipeline.
lishunyang12
left a comment
There was a problem hiding this comment.
LGTM, verified the dead-field claims against data.py / stage_init_utils.py. cc @princepride for re-review.
| # model_arch is consumed by the stage init layer for diffusion model class resolution | ||
| valid_fields.add("model_arch") | ||
| # "quantization" is mapped to "quantization_config" by from_kwargs() backwards-compat | ||
| valid_fields.add("quantization") |
There was a problem hiding this comment.
nit: worth a short block comment above the allowlist explaining the maintenance policy — entries here are fields consumed outside OmniDiffusionConfig (e.g. by extract_stage_metadata or the backwards-compat path in from_kwargs). Saves the next maintainer a git blame.
…figs (vllm-project#2622) Signed-off-by: Yiyang Liu <yiyangliu@microsoft.com> Co-authored-by: Yiyang Liu <yiyangliu@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…figs (vllm-project#2622) Signed-off-by: Yiyang Liu <yiyangliu@microsoft.com> Co-authored-by: Yiyang Liu <yiyangliu@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…figs (vllm-project#2622) Signed-off-by: Yiyang Liu <yiyangliu@microsoft.com> Co-authored-by: Yiyang Liu <yiyangliu@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Purpose
Fix for: #2563
Remove dead
engine_argsfields from diffusion stage configs (stage_type: diffusion). These fields were copy-pasted from LLM stage configs and are silently dropped byOmniDiffusionConfig.from_kwargs().Note: Generation/AR stages (
worker_type: generation,worker_type: ar) useOmniEngineArgswhere these fields are actively consumed they are intentionally left unchanged.Changes
Diffusion config cleanup (11 YAMLs, 51 lines)
Main configs (9 files in
vllm_omni/model_executor/stage_configs/):bagel.yamlgpu_memory_utilization,engine_output_type,enable_prefix_caching,max_num_batched_tokens,tensor_parallel_sizebagel_multiconnector.yamlbagel_think.yamlbagel_single_stage.yamlbagel_usp2.yamlhunyuan_image_3_moe.yamltensor_parallel_size)hunyuan_image3_moe_dit.yamlhunyuan_image3_moe_dit_2gpu_fp8.yamlomnivoice.yamlgpu_memory_utilization,engine_output_type)Test configs (2 files in
tests/e2e/offline_inference/stage_configs/):bagel_mooncake_ci.yamlload_format(OmniDiffusionConfigusesdiffusion_load_format)bagel_sharedmemory_ci.yamlload_formatRegression test (new file)
tests/test_diffusion_config_fields.pyscans all YAML configs (main + test dirs) and asserts diffusion stageengine_argsonly contain validOmniDiffusionConfigfields.Allowlisted fields consumed outside the dataclass:
model_stagestage init layermodel_archdiffusion model class resolutionquantizationmapped toquantization_configvia backwards-compat infrom_kwargs()Why these fields are safe to remove
OmniDiffusionConfig.from_kwargs()explicitly filters unknown fields:\\python
valid_fields = {f.name for f in fields(cls)}
filtered_kwargs = {k: v for k, v in kwargs.items() if k in valid_fields}
return cls(**filtered_kwargs)
\\
No behavioral change all removed fields were already silently dropped at runtime.
Test Plan