stage cli refactor by wuhang2014 · Pull Request #24 · fake0fan/vllm-omni

wuhang2014 · 2026-03-17T12:58:50Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

StageConfigFactory.create_from_model() + to_omegaconf() is not ready for the entrypoint refactor — it belongs in config refactor [2/N]. The current code passes raw EngineArgs defaults through _merge_cli_overrides into OmegaConf.create(), which crashes on non-primitive types (Literal, set, Counter). Even after fixing serialization, EngineArgs defaults (gpu_memory_utilization=0.9) overwrite per-stage YAML values (0.3/0.2), causing OOM when two stages share one GPU. Switch to the existing load_and_resolve_stage_configs() path which correctly merges YAML stage configs with CLI overrides. Tested on H100 with Qwen3-TTS — server starts and benchmarks run.

…onfig-path fix: use legacy config loading path instead of StageConfigFactory

CI runs on NVIDIA L4 (24GB). With gpu_memory_utilization=0.08, each stage only gets ~1.9GB which leaves no room for KV cache after model loading. Increase to 0.3 (7.2GB per stage, 14.4GB total) to fit both stages on L4.

Fix bagel bugs 2

…emory fix: increase gpu_memory_utilization for TTS CI on L4

Copilot

Pull request overview

Refactors vLLM-Omni stage/serve initialization to support a new “single-stage” deployment flow, including a master registration server for engine cores and a revamped headless stage runner.

Changes:

Add OmniMasterServer + omni-specific engine-core launch/handshake utilities for single-stage mode.
Rework AsyncOmniEngine stage initialization to support single-stage filtering and remote-stage startup waiting.
Update CLI serve --headless and OpenAI API server engine construction to use the new wiring / OmniEngineArgs.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
vllm_omni/entrypoints/openai/api_server.py	Switches AsyncOmni construction to use `OmniEngineArgs.from_cli_args`.
vllm_omni/entrypoints/cli/serve.py	Replaces deprecated headless mode with new headless stage runner that registers with an Omni master.
vllm_omni/engine/stage_init_utils.py	Extends `StartedLlmStage` with a `remote_client` field (intended for remote stages).
vllm_omni/engine/remote_stage_client.py	Adds a `RemoteStageClient` marker dataclass (currently unused).
vllm_omni/engine/omni_core_engine.py	New module implementing `OmniMasterServer`, registration, and omni launch/remote handshake helpers.
vllm_omni/engine/async_omni_engine.py	Adds single-stage mode detection, master-server startup, and remote-stage initialization path.
vllm_omni/engine/arg_utils.py	Makes `stage_id` optional to enable single-stage mode; adds omni master address/port args.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        """Register a CacheConfig whose num_gpu_blocks will be updated on READY."""
+        self._cache_configs[stage_id] = cache_config
+


+        if self.single_stage_mode:
+            if not self._omni_master_address or not self._omni_master_port:
+                raise ValueError(
+                    "AsyncOmniEngine single_stage_mode requires both "
+                    "omni_master_address and omni_master_port to be set."
+                )
+            # Collect all LLM stage IDs for pre-allocation.
+            all_llm_stage_ids = [
+                i for i, sc in enumerate(self.stage_configs) if getattr(sc, "stage_type", "llm") != "diffusion"
+            ]
+            self._omni_master_server = OmniMasterServer(
+                master_address=self._omni_master_address,
+                master_port=self._omni_master_port,
+                stage_ids=all_llm_stage_ids,
+            )
+            self._omni_master_server.start()
+            logger.info(
+                "[AsyncOmniEngine] OmniMasterServer started for stages %s",
+                all_llm_stage_ids,
+            )


Signed-off-by: wuhang <wuhang6@huawei.com>

lishunyang12 and others added 5 commits March 17, 2026 17:26

Merge pull request vllm-project#20 from lishunyang12/fix/use-legacy-c…

722e84e

…onfig-path fix: use legacy config loading path instead of StageConfigFactory

fix: increase gpu_memory_utilization for TTS CI on L4 GPUs

ea92b38

CI runs on NVIDIA L4 (24GB). With gpu_memory_utilization=0.08, each stage only gets ~1.9GB which leaves no room for KV cache after model loading. Increase to 0.3 (7.2GB per stage, 14.4GB total) to fit both stages on L4.

Merge pull request vllm-project#22 from princepride/fix-bagel-bugs-2

d843a89

Fix bagel bugs 2

Merge pull request vllm-project#23 from lishunyang12/fix/tts-ci-gpu-m…

938007e

…emory fix: increase gpu_memory_utilization for TTS CI on L4

Copilot AI review requested due to automatic review settings March 17, 2026 12:58

Copilot started reviewing on behalf of wuhang2014 March 17, 2026 12:59 View session

wuhang2014 force-pushed the epref3 branch from b0d1e43 to 471d8be Compare March 17, 2026 13:03

Copilot AI reviewed Mar 17, 2026

View reviewed changes

wuhang2014 force-pushed the epref3 branch from 471d8be to d679280 Compare March 17, 2026 13:09

fake0fan force-pushed the refactor branch from b42c69e to 0faad47 Compare March 17, 2026 14:52

stage cli refactor

9e08628

Signed-off-by: wuhang <wuhang6@huawei.com>

wuhang2014 force-pushed the epref3 branch from ec3bcc0 to 9e08628 Compare March 18, 2026 09:26

fake0fan mentioned this pull request Mar 19, 2026

[Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring vllm-project/vllm-omni#1908

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stage cli refactor#24

stage cli refactor#24
wuhang2014 wants to merge 6 commits intofake0fan:refactorfrom
wuhang2014:epref3

wuhang2014 commented Mar 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		"""Register a CacheConfig whose num_gpu_blocks will be updated on READY."""
		self._cache_configs[stage_id] = cache_config

Conversation

wuhang2014 commented Mar 17, 2026

Purpose

Test Plan

Test Result

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants