[Config Refactor] Derive deploy override fields from stage config#3162
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b54ccf2c68
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
b54ccf2 to
d63ea16
Compare
hsliuustc0106
left a comment
There was a problem hiding this comment.
Cross-PR Config Refactor Review
This is one of several coordinated config refactor PRs. Reviewed together with #3160, #3154, #3144, #3128, #3120, #3139.
What this PR does
Moves deploy_override_field_names() from arg_utils.py to stage_config.py, where the deploy schema fields actually live. The override field set is now computed from _STAGE_DEPLOY_FIELDS | _PIPELINE_WIDE_ENGINE_FIELDS | {"async_chunk", "devices"}.
Assessment
Clean, small, single-purpose. No issues.
Merge order note
This PR touches omni_base.py (the from_cli_args import path), which is also touched by #3160 (rewrites from_cli_args) and #3144 (deprecates it). These should merge in order: #3162 → #3144 → #3160, to minimize conflict surface. #3160 removes the section that #3162 modifies, so #3162 must merge first.
hsliuustc0106
left a comment
There was a problem hiding this comment.
Cross-PR Config Refactor Review
This is one of several coordinated config refactor PRs. Reviewed together with #3160, #3154, #3144, #3128, #3120, #3139.
What this PR does
Moves deploy_override_field_names() from arg_utils.py to stage_config.py, where the deploy schema fields actually live. The override field set is now computed from the stage deploy fields, pipeline-wide engine fields, plus async_chunk and devices.
Assessment
Clean, small, single-purpose. No issues.
Merge order note
This PR touches omni_base.py (the from_cli_args import path), which is also touched by #3160 (rewrites from_cli_args) and #3144 (deprecates it). These should merge in order: #3162 -> #3144 -> #3160, to minimize conflict surface. #3160 removes the section that #3162 modifies, so #3162 must merge first.
I’ve considered this issue as well, and it shouldn’t be a problem with the current implementation. For upstream-owned fields, the intended behavior is: The key detail is that upstream default consistency does not come from downstream treating For llm stage, the default-value / precedence flow (new deploy path) is like: |
|
It seems that there exists the same bug in this PR. I pull it and test the bad case: |
This ensures that only values explicitly provided in the deploy YAML are included in the resulting
This unifies the override behavior while delegating default resolution to the appropriate downstream layer. |
Related context
Follow-up questionThis leads to a design question:
Option 1: Strict schemaKeep OmniEngineArgs(**filtered_engine_args_dict)Implications:
Option 2: Allow YAML extensibilityAllow users to extend deploy YAML beyond the predefined schema. Current issue:
In this case, if we want to support extensible YAML with correct override semantics:
This approach is more flexible, but may be more complex to maintain. Could you please take a look when you have time and advise how to resolve these issues? |
| data_parallel_size: int | None = None | ||
| pipeline_parallel_size: int | None = None | ||
| config_format: str | None = None | ||
| load_format: str | None = None | ||
| tokenizer_mode: str | None = None |
There was a problem hiding this comment.
This change will also lead to the same issue that is setting the whitelist for the engine args of vllm.
There was a problem hiding this comment.
Yes, I agree this points to a more fundamental issue: how to determine which fields are truly intended to be user-overridable.
The complexity comes from the fact that we currently have three different input sources into Omni(), while also mixing LLM and diffusion configurations in the same flow.
To correctly distinguish explicit user inputs, we have two relatively straightforward options:
In the current override mechanism, any field with a non-None value when constructing StageConfig is treated as an explicit user input.
-
Maintain a complete field set in
StageDeployConfig/DeployConfig(i.e., a whitelist), and set all defaults toNone, so that only explicitly provided values participate in override -
Alternatively, we would need to maintain a diffusion-specific blocklist, and set all other fields to None.
Otherwise, we would need to introduce a more complex mechanism, such as:
- removing
make_arg_parser(serve_parser)in the online path - and applying default completion only after
stage_configconstruction viaEngineArgs - The offline path is still unclear under this approach and would require further design.
lishunyang12
left a comment
There was a problem hiding this comment.
Re: design question — I'd lean Option 1. The whitelist drift is what just bit us with profiler_config, and Option 2 has the same shape — every new vLLM EngineArg silently breaks override semantics until someone hits it. Strict schema forces the failure at PR time, which is cheaper than runtime. The sync tax per vLLM bump is real but small.
| gpu_memory_utilization: float | None = None | ||
| tensor_parallel_size: int | None = None | ||
| enforce_eager: bool | None = None | ||
| max_num_batched_tokens: int = 32768 |
There was a problem hiding this comment.
Intentional to keep max_num_seqs, max_num_batched_tokens, and trust_remote_code (L446) concrete? Same leakage path as the others if the user omits them — these would still override upstream defaults.
There was a problem hiding this comment.
I have not changed these defaults to None in this PR yet because their current values are not the same as vLLM defaults, and #3128's description explicitly listed some of them as retained vLLM-Omni-specific behavior, e.g. max_num_seqs (64 vs vLLM's 256).
That said, I noticed that #3128's actual diff also removes max_num_seqs. So if the intended direction is to delegate these defaults back to vLLM, then I agree these fields can likely be changed to None as well, I updated in 418c850
Tbh I'd go Option 1. The The maintenance tax is overstated IMO — this PR added 8 fields at once to clear backlog, but steady-state is more like 1-2 per vLLM bump.
Option 2 also has a hidden coupling: making it work cleanly requires either removing Concrete suggestion: land this PR as Option 1, then in a follow-up (a) make |
|
Option 1 looks like the most maintainable path for the current code structure. The main value is making override semantics fail early: new upstream engine fields have to be represented intentionally in the deploy schema, instead of silently flowing through or being masked by parser/dataclass defaults. That gives reviewers and tests a concrete place to catch drift, while keeping the current |
|
I think we need a few different pieces for long-term sustainability.
Setting all defaults to
I'll look into some of these threads as well @xiaohajiayou, but happy to collaborate however 🙂 |
|
I reviewed this against #3220 and the schema-derived override direction looks right. One small review-readiness gap: |
Based on the existing YAMLs, these fields are currently configured per stage rather than at the top level. So I think a better first step is to put This keeps the existing YAML syntax unchanged, while allowing this PR’s schema-derived whitelist to recognize these fields explicitly and preserve the correct override semantics., I fix this in 9ad134e. |
f437427 to
418c850
Compare
Thanks, I agree with this direction. I think there are two related but separate layers here:
So I see this PR as the smaller correctness step: keep the strict deploy schema as the override whitelist and make the current precedence behavior correct. Then we can follow up with a more general explicit-value tracking utility / CI alignment checks to make the mechanism more maintainable. |
|
Cool, I think we are on the same page. I agree, I think there is a lot of nuance to this, but it's best to decouple discussions about the implementation from the user experience, and it shouldn't block this PR, more just discussion for the future 🙂 After thinking about this a bit more, maybe it's confusing to have the schema set top-level keys that are applied to all stages without letting those stages also specify it themselves + just validating against bad behavior. As a minimal example with one stage: If you can do something like this: dtype: float32
stages:
- stage_id: 0
gpu_memory_utilization: 0.9
max_num_seqs: 32You should be able to do something like this too: stages:
- stage_id: 0
gpu_memory_utilization: 0.9
max_num_seqs: 32
dtype: float32But if that's true, this is natural feeling as well: stages:
- stage_id: 0
gpu_memory_utilization: 0.9
max_num_seqs: 32
dtype: float32
- stage_id: 0
gpu_memory_utilization: 0.9
max_num_seqs: 32
dtype: float16and you'd just expect it to throw an error since dtype needs consistent values for now. Having to know about which keys are allowed where and treating everything as separate adds another layer of complexity, when the resolved stage config should be directly translatable to engine args of that corresponding stage type (although Diffusion is kind of bundling everything into the OmniDiffusionConfig at the moment). Being able to essentially map it to an engine args like thing directly and push the validation further down after initial merging and resolution also makes more sense for extensibility. E.g., if we add reconstruction engine for world models later on, the stage config object will probably get even more confusing because it will need to be compatible with those stages too |
a8410ed to
b05cb02
Compare
Signed-off-by: xiaohajiayou <923390377@qq.com>
Signed-off-by: xiaohajiayou <923390377@qq.com>
Signed-off-by: xiaohajiayou <923390377@qq.com>
Signed-off-by: xiaohajiayou <923390377@qq.com>
Signed-off-by: xiaohajiayou <923390377@qq.com>
Signed-off-by: xiaohajiayou <923390377@qq.com>
2ea08fc to
176246d
Compare
Signed-off-by: xiaohajiayou <923390377@qq.com>
176246d to
e2aa6aa
Compare
|
I was able to reproduce the reported stage-0 torch-profiler case locally using For reference, the stage-0 configuration I used is: stages:
- stage_id: 0
max_num_seqs: 10
gpu_memory_utilization: 0.3
async_scheduling: true
max_num_batched_tokens: 512
max_model_len: 4096
devices: "0"
output_connectors:
to_stage_1: connector_of_shared_memory
profiler_config:
profiler: torch
torch_profiler_dir: ./thinker-omniResults:
Based on this, the issue reported in #3220 appears to be resolved. Relevant logs: |
|
The current CI issues seem to be caused by missing fields in some model YAMLs. Previously, these were implicitly handled by default fallbacks. However, after setting these defaults to To address this, I’ve added the required default values back into the affected model YAMLs in e2aa6aa. |
|
Please leave a Todo issue for follow-ups. @xiaohajiayou |
…nfig (vllm-project#3162)" This reverts commit 01ebc0c. Co-authored-by: Gaohan123 <20148503+Gaohan123@users.noreply.github.com>
…lm-project#3162) Signed-off-by: xiaohajiayou <923390377@qq.com>
…lm-project#3162) Signed-off-by: xiaohajiayou <923390377@qq.com> Signed-off-by: sphinxkkkbc <binchengkang8@gmail.com>
…lm-project#3162) Signed-off-by: xiaohajiayou <923390377@qq.com>
Purpose
Move deploy override field ownership from
arg_utils.pytostage_config.py.The nullify path now derives deploy-overridable fields from the deploy schema/merge logic instead of maintaining a duplicated manual allowlist in
arg_utils.py. This keeps the field set aligned withDeployConfig,StageDeployConfig, and special deploy/runtime fields such asasync_chunkanddevices.Test Plan
Omni(...)construction and deprecatedfrom_cli_args(...)compatibility.Test Result
.venv/bin/python -m py_compile ....venv/bin/pytest tests/entrypoints/test_omni_entrypoints.py -q39 passed2 passedEssential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)