Skip to content

[Refactor] Replace args whitelist with explicit CLI key detection#3160

Closed
gcanlin wants to merge 3 commits into
vllm-project:mainfrom
gcanlin:args-clean
Closed

[Refactor] Replace args whitelist with explicit CLI key detection#3160
gcanlin wants to merge 3 commits into
vllm-project:mainfrom
gcanlin:args-clean

Conversation

@gcanlin
Copy link
Copy Markdown
Collaborator

@gcanlin gcanlin commented Apr 26, 2026

Summary

  • Remove nullify_stage_engine_defaults and its hardcoded whitelist — the old design
    required manually maintaining _DEPLOY_ENGINE_ARG_OVERRIDE_FIELDS for every vLLM parameter
    that should be YAML-configurable. Unlisted fields (like profiler_config) silently broke.
  • Introduce detect_explicit_cli_keys mechanism — parses sys.argv to determine which
    flags the user actually typed, so only those become overrides. Argparse defaults never
    shadow deploy YAML values.
  • Add dict→ProfilerConfig conversion in OmniEngineArgs.__post_init__ so that
    profiler_config: {profiler: torch, ...} from YAML is correctly deserialized.

Config merge chain (after this PR)

                    ┌─────────────────────┐
                    │   sys.argv[1:]      │
                    └────────┬────────────┘
                             │
                  detect_explicit_cli_keys(argv, parser)
                             │
                             ▼
                    ┌─────────────────────┐
                    │ _explicit_cli_keys  │  ← set of dest names user actually typed
                    │  e.g. {"max_num_seqs", "dtype"}
                    └────────┬────────────┘
                             │
          ┌──────────────────┼──────────────────┐
          │ serve.py         │ api_server.py     │ omni_base.py
          │ (vllm omni serve)│ (__main__)        │ (from_cli_args)
          └──────┬───────────┴──────┬────────────┴──────┐
                 │                  │                    │
                 ▼                  ▼                    ▼
          omni_run_server    omni_run_server       OmniBase.__init__
          (hasattr guard)    (fallback compute)         │
                 │                  │                    │
                 └──────────┬───────┘                    │
                            ▼                            ▼
                  build_async_omni_from_stage_config     │
                  kwargs.setdefault("_explicit_cli_keys")│
                            │                            │
                            └────────────┬───────────────┘
                                         ▼
                              AsyncOmniEngine.__init__
                                         │
                                         ▼
                              _resolve_stage_configs
                              (keeps _explicit_cli_keys in kwargs)
                                         │
                                         ▼
                           load_stage_configs_from_model
                                         │
                    ┌────────────────────┼────────────────────┐
                    │                    │                    │
                    ▼                    ▼                    ▼
             pop _explicit_cli_keys   cli_overrides      deploy YAML
                    │              = all argparse args    (per-stage)
                    │                    │                    │
                    ▼                    ▼                    │
              filter cli_overrides:     │                    │
              keep only keys in         │                    │
              _explicit_cli_keys        │                    │
                    │                    │                    │
                    └────────┬───────────┘                    │
                             ▼                               │
                     filtered overrides                      │
                     (only user-typed)                       │
                             │                               │
                             └───────────┬───────────────────┘
                                         ▼
                              build_stage_runtime_overrides
                              YAML ← merge ← filtered CLI overrides
                                         │
                                         ▼
                              OmniEngineArgs(**merged)
                              __post_init__: dict → ProfilerConfig
                                         │
                                         ▼
                              create_engine_config() → VllmConfig → Worker

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
@gcanlin gcanlin requested a review from hsliuustc0106 as a code owner April 26, 2026 14:38
@gcanlin
Copy link
Copy Markdown
Collaborator Author

gcanlin commented Apr 26, 2026

@amy-why-3459 Do you find that profiler_config can't be passed by deploy config? This PR should fix it.

@gcanlin gcanlin requested a review from lishunyang12 April 26, 2026 14:42
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 95ae0221a0

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 541 to 544
args_dict = vars(args).copy()
args_dict.pop("_cli_explicit_keys", None)
config_path, stage_configs = load_and_resolve_stage_configs(
model,
args_dict.get("stage_configs_path"),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Propagate explicit CLI keys in headless mode

The headless path now passes raw vars(args) into load_and_resolve_stage_configs without attaching _explicit_cli_keys, so load_stage_configs_from_model() does not run the new explicit-key filter and treats all non-None argparse defaults as real CLI overrides. In headless runs this silently overwrites deploy YAML values (for example dtype, parallelism, batching knobs) even when the user did not provide those flags, which is the exact precedence regression this refactor is meant to prevent.

Useful? React with 👍 / 👎.

Comment on lines 477 to +487
def deploy_override_field_names() -> frozenset[str]:
"""Return kwargs whose parser defaults must not override deploy YAML."""
return _DEPLOY_ENGINE_ARG_OVERRIDE_FIELDS | _DEPLOY_RUNTIME_OVERRIDE_FIELDS
"""Return kwargs whose parser defaults must not override deploy YAML.

Dynamically computed from all ``OmniEngineArgs`` fields so that newly
added engine args (e.g. ``profiler_config``) are automatically covered
without maintaining a manual whitelist.
"""
engine_fields = frozenset(f.name for f in fields(OmniEngineArgs))
# Orchestrator-only keys are handled separately; shared keys need to
# flow to both orchestrator and engine, so exclude them here.
return (engine_fields - orchestrator_field_names()) | _DEPLOY_RUNTIME_OVERRIDE_FIELDS
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep backward-compatible nullify helper export

This change deletes nullify_stage_engine_defaults from vllm_omni.engine.arg_utils without migrating remaining callers, but multiple shipped offline examples and tests still import that symbol (some at module import time), so those entrypoints now fail immediately with import-time errors. If this helper is being retired, the same commit should either update all in-repo call sites or provide a compatibility shim.

Useful? React with 👍 / 👎.

engine_fields = frozenset(f.name for f in fields(OmniEngineArgs))
# Orchestrator-only keys are handled separately; shared keys need to
# flow to both orchestrator and engine, so exclude them here.
return (engine_fields - orchestrator_field_names()) | _DEPLOY_RUNTIME_OVERRIDE_FIELDS
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tbh this scope feels too large for what was flagged on #3078. Why not just add profiler_config to _DEPLOY_ENGINE_ARG_OVERRIDE_FIELDS and a dict→ProfilerConfig conversion in OmniEngineArgs.__post_init__? Removing nullify_stage_engine_defaults also breaks ~12 example imports (examples/offline_inference/qwen3_omni/end2end.py:24, etc.) and 3 tests in tests/test_arg_utils.py.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or if you want to drop the manual list entirely, invert the direction: a small _DIFFUSION_REQUIRED_PARSER_DEFAULTS denylist (the few fields pure-diffusion needs argparse defaults for) and nullify engine_fields - orchestrator_fields - denylist. Same size as today's whitelist but adding a new engine field auto-gets deploy-override semantics — only diffusion-only required fields need listing.

@@ -542,7 +539,6 @@ def run_headless(args: argparse.Namespace) -> None:
raise ValueError("headless mode requires worker_backend=multi_process")

args_dict = vars(args).copy()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even keeping this approach, _explicit_cli_keys is never set on args for the headless path — omni_run_server sets it but run_headless doesn't, so load_stage_configs_from_model skips the filter and argparse defaults still override deploy YAML.

@@ -496,8 +475,16 @@ def orchestrator_field_names() -> frozenset[str]:


def deploy_override_field_names() -> frozenset[str]:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems unused after this PR — nullify_stage_engine_defaults and from_cli_args were the only callers. _DEPLOY_RUNTIME_OVERRIDE_FIELDS too.

@xiaohajiayou
Copy link
Copy Markdown
Contributor

I think there is still a subtle difference here.

As discussed in #3078, the requirement is not only about CLI/argv-based paths. Users can provide explicit overrides through at least three entry paths:

  1. Omni(..., engine_args=...)
  2. parser / CLI-based paths
  3. direct Omni(**kwargs) paths

The explicit CLI-key approach works well for parser / argv-based paths, but it cannot distinguish user intent for the other two paths, because those values do not necessarily come from sys.argv.

That is why the nullification approach was introduced: before parse_args(), we set defaults for deploy-overridable fields to None. After parsing, any non-None value for those fields means it was explicitly provided by the user, regardless of whether the caller later constructs Omni via from_cli_args, direct Omni(**kwargs), or another kwargs-based path.

This gives the same precedence behavior across the three input scenarios:

explicit user override > deploy YAML > dataclass/default value

If the main concern of this PR is removing the duplicated hardcoded allowlist from arg_utils.py, #3162 addresses that by deriving the nullified field set from the already-maintained deploy schema in stage_config.py (DeployConfig, StageDeployConfig, and special deploy/runtime fields such as async_chunk and devices).

Could you take a look at #3162 and see whether that direction resolves the maintainability concern while preserving consistent override semantics for non-argv construction paths?

Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cross-PR Config Refactor Review

This is one of several coordinated config refactor PRs. Reviewed together with #3162, #3154, #3144, #3128, #3120, #3139.

What this PR does

  • Replaces the hardcoded engine arg override whitelist with dynamic field detection from OmniEngineArgs -- new engine args are automatically covered.
  • Removes nullify_stage_engine_defaults from production entrypoints, replacing it with explicit_cli_keys filtering in load_stage_configs_from_model.
  • Adds profiler_config dict to dataclass conversion in OmniEngineArgs.

One issue: merge order dependency with #3144

This PR removes nullify_stage_engine_defaults from arg_utils.py. But #3144 adds callers of that same function in 12+ offline example scripts, importing it from arg_utils. This creates a hard dependency:

  • If #3160 merges first, #3144's imports will fail.
  • If #3144 merges first, this PR will have a file-level conflict on arg_utils.py.

Recommendation: #3160 should merge last in the chain. Either keep nullify_stage_engine_defaults as a thin shim in arg_utils.py that warns and delegates, or coordinate the merge so all of #3144's callers land first.

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
@zhumingjue138
Copy link
Copy Markdown
Contributor

please add ut test case if it is necessary

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
@gcanlin
Copy link
Copy Markdown
Collaborator Author

gcanlin commented Apr 28, 2026

@hsliuustc0106 @lishunyang12 @xiaohajiayou I will close this PR because I have no more bandwidth to handle it. But please continue to fix the bug. Thanks

@gcanlin gcanlin closed this Apr 28, 2026
@lishunyang12
Copy link
Copy Markdown
Collaborator

@hsliuustc0106 @lishunyang12 @xiaohajiayou I will close this PR because I have no more bandwidth to handle it. But please continue to fix the bug. Thanks

Copy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants