Skip to content

fix --gpu-memory-utilization CLI override#2516

Closed
tarikcurto wants to merge 1 commit intovllm-project:mainfrom
tarikcurto:fix/gpu-memory-utilization-arg-override
Closed

fix --gpu-memory-utilization CLI override#2516
tarikcurto wants to merge 1 commit intovllm-project:mainfrom
tarikcurto:fix/gpu-memory-utilization-arg-override

Conversation

@tarikcurto
Copy link
Copy Markdown

@tarikcurto tarikcurto commented Apr 6, 2026

Purpose

Fix --gpu-memory-utilization (and other standard EngineArgs flags) being silently ignored when serving multi-stage Omni models with --omni.

Related issue: --gpu-memory-utilization CLI flag ignored in --omni mode (Voxtral-4B-TTS-2603 uses ~80% GPU despite --gpu-memory-utilization .2)

Root cause: load_stage_configs_from_yaml merges CLI kwargs with YAML per-stage engine_args via OmegaConf.merge(cli_args, yaml_args). Since OmegaConf.merge is left-to-right (later wins), YAML values always override CLI values.
voxtral_tts.yaml hardcodes gpu_memory_utilization: 0.8 (stage 0) and 0.1 (stage 1), discarding any user-specified value.

Fix: Thread a user_overrides dict through the config loading chain. It is computed in AsyncOmniEngine._resolve_stage_configs by comparing kwargs against OmniEngineArgs defaults — only values that differ from the dataclass
default are treated as user-specified. After the YAML merge, user_overrides are re-applied as the highest-priority layer, ensuring:

  • Explicit CLI flags (e.g. --gpu-memory-utilization, --max-model-len) are always respected.
  • YAML model-specific settings that the user did not override (e.g. enforce_eager, scheduler_cls, worker_type) are unaffected.

Changed files:

  • vllm_omni/engine/async_omni_engine.py — compute user_overrides from kwargs vs. OmniEngineArgs defaults
  • vllm_omni/entrypoints/utils.py — propagate user_overrides through load_and_resolve_stage_configsload_stage_configs_from_modelload_stage_configs_from_yaml and re-apply post-merge

Test Plan

Manually verified with mistralai/Voxtral-4B-TTS-2603 on an NVIDIA RTX 6000 (96 GiB):

vllm serve mistralai/Voxtral-4B-TTS-2603 --omni \
  --gpu-memory-utilization .2 \
  --max-model-len 1000 \
  --port 8001

Checked nvidia-smi and the Available KV cache memory log line for each stage to confirm the flag is respected.

No new test scripts are added as this is a config-loading fix in the legacy OmegaConf path (marked deprecated, slated for removal in PR series [2/N]). Existing unit tests in tests/ continue to pass.

Test Result

Before fix:
(EngineCore pid=3097) INFO [base.py:129] Available KV cache memory: 67.2 GiB (process-scoped)
nvidia-smi: 79831MiB / 97887MiB (~82% utilization)

After fix:
(EngineCore pid=XXXX) INFO [base.py:129] Available KV cache memory: ~11.7 GiB (process-scoped)
nvidia-smi: 21471MiB / 97887MiB (~20% utilization per stage)

Stage 0 and stage 1 each receive gpu_memory_utilization=0.2 as specified, overriding the YAML defaults of 0.8 and 0.1.

Signed-off-by: Tarik Curto <centro.tarik@live.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6db697568f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +954 to +955
if key in kwargs and kwargs[key] != getattr(_default, key, None):
user_overrides[key] = kwargs[key]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Detect explicit CLI overrides independently of value equality

This override detection only includes keys whose value differs from OmniEngineArgs defaults, so explicitly provided flags that happen to equal the default are treated as "not user-set" and never reapplied after YAML merge. In multi-stage YAMLs that set a different per-stage value, an explicit CLI flag like --gpu-memory-utilization set to the default still gets silently overridden by YAML, which contradicts the intended "explicit CLI wins" behavior.

Useful? React with 👍 / 👎.

@lishunyang12
Copy link
Copy Markdown
Collaborator

Hi, @tarikcurto. Thanks for your attention in vllm-omni. This is a long-standing issue. We are working on a large-scale config refractoring to tackle this. Can you take a look at our RFC #2072 and a preliminary PR #2383. Any feedback is appreciated.

@tarikcurto
Copy link
Copy Markdown
Author

Hello @lishunyang12, thank you for your clarification. In that case i proceed to close main pr.

@tarikcurto tarikcurto closed this Apr 6, 2026
@tarikcurto
Copy link
Copy Markdown
Author

I have seen that @lishunyang12 closed (paused) their PR #2383.
In the meantime, i think that main branch should have a fix to have --gpu-memory-utilization argument working.
By that reason, i decided to reopen this small fix PR.

@tarikcurto tarikcurto reopened this Apr 8, 2026
@tarikcurto
Copy link
Copy Markdown
Author

Fixed in PR #2663 with a more minimal solution.

I proceed to close this PR.

@tarikcurto tarikcurto closed this Apr 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants