[Bugfix] Fix default diffusion stage config generator drops runtime engine args#2559
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
Signed-off-by: xiaohajiayou <923390377@qq.com>
942ecd7 to
da80d5d
Compare
lishunyang12
left a comment
There was a problem hiding this comment.
Review: [Bugfix] Fix default diffusion stage config generator drops runtime engine args
Thanks for the PR and the clear description of the problem. The fix for trust_remote_code is correct and necessary -- it was genuinely missing from the dict literal. However, I have concerns about the other four fields.
Issues
1. boundary_ratio and flow_shift are already propagated (redundant overwrites)
Lines 1216-1217 of the existing code already include these in the stage_engine_args dict:
"boundary_ratio": kwargs.get("boundary_ratio", None),
"flow_shift": kwargs.get("flow_shift", None),The conditional blocks added after the dict literal will overwrite these keys with the exact same values. This is dead code. If the intent was to avoid setting them when they are None, note that OmniDiffusionConfig.from_kwargs already filters kwargs to valid dataclass fields and constructs the config -- having None values for optional fields is the normal path and matches the dataclass defaults.
Please either:
- Remove lines 1216-1217 from the dict literal and keep only the conditional blocks (if you want to avoid passing
Noneexplicitly), or - Remove the conditional blocks for these two fields (since they are already handled).
2. num_gpus is overwritten downstream and the addition has no effect
In stage_init_utils.py line 536:
od_config.num_gpus = num_devices_per_stageThis unconditionally overwrites num_gpus after OmniDiffusionConfig.from_kwargs(), deriving it from parallel_config.world_size. So even if you inject num_gpus into stage_engine_args, it will be overridden. Adding it here gives a false sense that the CLI value is being respected, when in reality it is not. This should either be removed, or the downstream override should be fixed if the intent is to let users control num_gpus directly.
3. distributed_executor_backend -- the conditional style is inconsistent
distributed_executor_backend is a legitimate fix: it was missing from the dict literal. But the conditional if key in kwargs and kwargs[key] is not None pattern is inconsistent with how every other field is handled in this function (using kwargs.get(key, default) inside the dict literal). Using kwargs.get("distributed_executor_backend", "mp") inline would be simpler and consistent -- the dataclass default is "mp", so that aligns.
Suggestion
A cleaner approach would be to add trust_remote_code and distributed_executor_backend directly in the dict literal (like all the other fields), remove the redundant conditional blocks for boundary_ratio/flow_shift, and drop num_gpus since it is overridden downstream. Something like:
stage_engine_args = {
...
"trust_remote_code": kwargs.get("trust_remote_code", False),
"distributed_executor_backend": kwargs.get("distributed_executor_backend", "mp"),
...
}No conditional blocks needed.
Test
The test is well-written and covers the right fields. It will need minor adjustment once the redundant parts are removed.
Overall this is a real bug fix for trust_remote_code and distributed_executor_backend, but needs cleanup to avoid redundancy and misleading num_gpus propagation.
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
As #2539 #2544, and discussed in #2076, when loading a model without a default stage config and no stage config YAML is explicitly provided via CLI arguments, the current
AsyncOmniEngineconstructsself.stage_configsthroughself._create_default_diffusion_stage_cfg. The resulting config is then passed into:which is later consumed by the diffusion components, such as
StageDiffusionClient.However, the current implementation of
self._create_default_diffusion_stage_cfgdoes not fully propagate CLI arguments into the constructed config.The affected arguments include:
distributed_executor_backendvllm-omni/vllm_omni/diffusion/executor/abstract.py
Lines 22 to 24 in 408365f
boundary_ratiovllm-omni/vllm_omni/diffusion/models/wan2_2/pipeline_wan2_2.py
Line 236 in 408365f
flow_shiftvllm-omni/vllm_omni/diffusion/models/wan2_2/pipeline_wan2_2.py
Line 300 in 408365f
trust_remote_codevllm-omni/vllm_omni/config/stage_config.py
Lines 270 to 271 in 408365f
num_gpusvllm-omni/vllm_omni/diffusion/executor/multiproc_executor.py
Lines 67 to 70 in 408365f
Although these fields have default values defined in
OmniDiffusionConfig, the CLI-provided values are not injected intood_config, resulting in the user-specified arguments being ignored and default values always being used.Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)