fix --gpu-memory-utilization CLI override#2516
fix --gpu-memory-utilization CLI override#2516tarikcurto wants to merge 1 commit intovllm-project:mainfrom
Conversation
Signed-off-by: Tarik Curto <centro.tarik@live.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6db697568f
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if key in kwargs and kwargs[key] != getattr(_default, key, None): | ||
| user_overrides[key] = kwargs[key] |
There was a problem hiding this comment.
Detect explicit CLI overrides independently of value equality
This override detection only includes keys whose value differs from OmniEngineArgs defaults, so explicitly provided flags that happen to equal the default are treated as "not user-set" and never reapplied after YAML merge. In multi-stage YAMLs that set a different per-stage value, an explicit CLI flag like --gpu-memory-utilization set to the default still gets silently overridden by YAML, which contradicts the intended "explicit CLI wins" behavior.
Useful? React with 👍 / 👎.
|
Hi, @tarikcurto. Thanks for your attention in vllm-omni. This is a long-standing issue. We are working on a large-scale config refractoring to tackle this. Can you take a look at our RFC #2072 and a preliminary PR #2383. Any feedback is appreciated. |
|
Hello @lishunyang12, thank you for your clarification. In that case i proceed to close main pr. |
|
I have seen that @lishunyang12 closed (paused) their PR #2383. |
|
Fixed in PR #2663 with a more minimal solution. I proceed to close this PR. |
Purpose
Fix
--gpu-memory-utilization(and other standardEngineArgsflags) being silently ignored when serving multi-stage Omni models with--omni.Related issue:
--gpu-memory-utilizationCLI flag ignored in--omnimode (Voxtral-4B-TTS-2603 uses ~80% GPU despite--gpu-memory-utilization .2)Root cause:
load_stage_configs_from_yamlmerges CLI kwargs with YAML per-stageengine_argsviaOmegaConf.merge(cli_args, yaml_args). SinceOmegaConf.mergeis left-to-right (later wins), YAML values always override CLI values.voxtral_tts.yamlhardcodesgpu_memory_utilization: 0.8(stage 0) and0.1(stage 1), discarding any user-specified value.Fix: Thread a
user_overridesdict through the config loading chain. It is computed inAsyncOmniEngine._resolve_stage_configsby comparing kwargs againstOmniEngineArgsdefaults — only values that differ from the dataclassdefault are treated as user-specified. After the YAML merge,
user_overridesare re-applied as the highest-priority layer, ensuring:--gpu-memory-utilization,--max-model-len) are always respected.enforce_eager,scheduler_cls,worker_type) are unaffected.Changed files:
vllm_omni/engine/async_omni_engine.py— computeuser_overridesfrom kwargs vs.OmniEngineArgsdefaultsvllm_omni/entrypoints/utils.py— propagateuser_overridesthroughload_and_resolve_stage_configs→load_stage_configs_from_model→load_stage_configs_from_yamland re-apply post-mergeTest Plan
Manually verified with
mistralai/Voxtral-4B-TTS-2603on an NVIDIA RTX 6000 (96 GiB):Checked nvidia-smi and the Available KV cache memory log line for each stage to confirm the flag is respected.
No new test scripts are added as this is a config-loading fix in the legacy OmegaConf path (marked deprecated, slated for removal in PR series [2/N]). Existing unit tests in tests/ continue to pass.
Test Result
Before fix:
(EngineCore pid=3097) INFO [base.py:129] Available KV cache memory: 67.2 GiB (process-scoped)
nvidia-smi: 79831MiB / 97887MiB (~82% utilization)
After fix:
(EngineCore pid=XXXX) INFO [base.py:129] Available KV cache memory: ~11.7 GiB (process-scoped)
nvidia-smi: 21471MiB / 97887MiB (~20% utilization per stage)
Stage 0 and stage 1 each receive gpu_memory_utilization=0.2 as specified, overriding the YAML defaults of 0.8 and 0.1.