Skip to content

[Bugfix] Fix default diffusion stage config generator drops runtime engine args#2559

Open
xiaohajiayou wants to merge 1 commit intovllm-project:mainfrom
xiaohajiayou:fix/diffusion-default-factory-engine-args
Open

[Bugfix] Fix default diffusion stage config generator drops runtime engine args#2559
xiaohajiayou wants to merge 1 commit intovllm-project:mainfrom
xiaohajiayou:fix/diffusion-default-factory-engine-args

Conversation

@xiaohajiayou
Copy link
Copy Markdown
Contributor

@xiaohajiayou xiaohajiayou commented Apr 7, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

As #2539 #2544, and discussed in #2076, when loading a model without a default stage config and no stage config YAML is explicitly provided via CLI arguments, the current AsyncOmniEngine constructs self.stage_configs through self._create_default_diffusion_stage_cfg. The resulting config is then passed into:

od_config = OmniDiffusionConfig.from_kwargs(
    model=model,
    **_to_dict(stage_cfg.engine_args),
)

which is later consumed by the diffusion components, such as StageDiffusionClient.

However, the current implementation of self._create_default_diffusion_stage_cfg does not fully propagate CLI arguments into the constructed config.
The affected arguments include:

Although these fields have default values defined in OmniDiffusionConfig, the CLI-provided values are not injected into od_config, resulting in the user-specified arguments being ignored and default values always being used.

Test Plan

python -m pytest  tests/entrypoints/test_async_omni_diffusion_config.py

Test Result

(vllm-omni) root@autodl-container-2201459dc8-f8f44d5f:~/vllm-omni# python -m pytest  tests/entrypoints/test_async_omni_diffusion_config.py
============================================================================ test session starts ==========================================================================================================================
platform linux -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0
rootdir: /root/vllm-omni
configfile: pyproject.toml
plugins: anyio-4.13.0, asyncio-1.3.0
asyncio: mode=Mode.AUTO, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 6 items                                                                                                                                                                                                                                                       

tests/entrypoints/test_async_omni_diffusion_config.py ......                                                                                                                                                                                                      [100%]

===================================================================== 6 passed, 19 warnings in 0.70s =====================================================================================================================

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Signed-off-by: xiaohajiayou <923390377@qq.com>
@xiaohajiayou xiaohajiayou force-pushed the fix/diffusion-default-factory-engine-args branch from 942ecd7 to da80d5d Compare April 7, 2026 14:34
@xiaohajiayou xiaohajiayou changed the title [Bugfix] Propagate diffusion fallback engine args [Bugfix] Fix default diffusion stage config generator drops runtime engine args Apr 7, 2026
Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: [Bugfix] Fix default diffusion stage config generator drops runtime engine args

Thanks for the PR and the clear description of the problem. The fix for trust_remote_code is correct and necessary -- it was genuinely missing from the dict literal. However, I have concerns about the other four fields.

Issues

1. boundary_ratio and flow_shift are already propagated (redundant overwrites)

Lines 1216-1217 of the existing code already include these in the stage_engine_args dict:

"boundary_ratio": kwargs.get("boundary_ratio", None),
"flow_shift": kwargs.get("flow_shift", None),

The conditional blocks added after the dict literal will overwrite these keys with the exact same values. This is dead code. If the intent was to avoid setting them when they are None, note that OmniDiffusionConfig.from_kwargs already filters kwargs to valid dataclass fields and constructs the config -- having None values for optional fields is the normal path and matches the dataclass defaults.

Please either:

  • Remove lines 1216-1217 from the dict literal and keep only the conditional blocks (if you want to avoid passing None explicitly), or
  • Remove the conditional blocks for these two fields (since they are already handled).

2. num_gpus is overwritten downstream and the addition has no effect

In stage_init_utils.py line 536:

od_config.num_gpus = num_devices_per_stage

This unconditionally overwrites num_gpus after OmniDiffusionConfig.from_kwargs(), deriving it from parallel_config.world_size. So even if you inject num_gpus into stage_engine_args, it will be overridden. Adding it here gives a false sense that the CLI value is being respected, when in reality it is not. This should either be removed, or the downstream override should be fixed if the intent is to let users control num_gpus directly.

3. distributed_executor_backend -- the conditional style is inconsistent

distributed_executor_backend is a legitimate fix: it was missing from the dict literal. But the conditional if key in kwargs and kwargs[key] is not None pattern is inconsistent with how every other field is handled in this function (using kwargs.get(key, default) inside the dict literal). Using kwargs.get("distributed_executor_backend", "mp") inline would be simpler and consistent -- the dataclass default is "mp", so that aligns.

Suggestion

A cleaner approach would be to add trust_remote_code and distributed_executor_backend directly in the dict literal (like all the other fields), remove the redundant conditional blocks for boundary_ratio/flow_shift, and drop num_gpus since it is overridden downstream. Something like:

stage_engine_args = {
    ...
    "trust_remote_code": kwargs.get("trust_remote_code", False),
    "distributed_executor_backend": kwargs.get("distributed_executor_backend", "mp"),
    ...
}

No conditional blocks needed.

Test

The test is well-written and covers the right fields. It will need minor adjustment once the redundant parts are removed.

Overall this is a real bug fix for trust_remote_code and distributed_executor_backend, but needs cleanup to avoid redundancy and misleading num_gpus propagation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants