Open
Conversation
StageConfigFactory.create_from_model() + to_omegaconf() is not ready for the entrypoint refactor — it belongs in config refactor [2/N]. The current code passes raw EngineArgs defaults through _merge_cli_overrides into OmegaConf.create(), which crashes on non-primitive types (Literal, set, Counter). Even after fixing serialization, EngineArgs defaults (gpu_memory_utilization=0.9) overwrite per-stage YAML values (0.3/0.2), causing OOM when two stages share one GPU. Switch to the existing load_and_resolve_stage_configs() path which correctly merges YAML stage configs with CLI overrides. Tested on H100 with Qwen3-TTS — server starts and benchmarks run.
…onfig-path fix: use legacy config loading path instead of StageConfigFactory
CI runs on NVIDIA L4 (24GB). With gpu_memory_utilization=0.08, each stage only gets ~1.9GB which leaves no room for KV cache after model loading. Increase to 0.3 (7.2GB per stage, 14.4GB total) to fit both stages on L4.
Fix bagel bugs 2
…emory fix: increase gpu_memory_utilization for TTS CI on L4
There was a problem hiding this comment.
Pull request overview
Refactors vLLM-Omni stage/serve initialization to support a new “single-stage” deployment flow, including a master registration server for engine cores and a revamped headless stage runner.
Changes:
- Add
OmniMasterServer+ omni-specific engine-core launch/handshake utilities for single-stage mode. - Rework
AsyncOmniEnginestage initialization to support single-stage filtering and remote-stage startup waiting. - Update CLI
serve --headlessand OpenAI API server engine construction to use the new wiring /OmniEngineArgs.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| vllm_omni/entrypoints/openai/api_server.py | Switches AsyncOmni construction to use OmniEngineArgs.from_cli_args. |
| vllm_omni/entrypoints/cli/serve.py | Replaces deprecated headless mode with new headless stage runner that registers with an Omni master. |
| vllm_omni/engine/stage_init_utils.py | Extends StartedLlmStage with a remote_client field (intended for remote stages). |
| vllm_omni/engine/remote_stage_client.py | Adds a RemoteStageClient marker dataclass (currently unused). |
| vllm_omni/engine/omni_core_engine.py | New module implementing OmniMasterServer, registration, and omni launch/remote handshake helpers. |
| vllm_omni/engine/async_omni_engine.py | Adds single-stage mode detection, master-server startup, and remote-stage initialization path. |
| vllm_omni/engine/arg_utils.py | Makes stage_id optional to enable single-stage mode; adds omni master address/port args. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+144
to
+146
| """Register a CacheConfig whose num_gpu_blocks will be updated on READY.""" | ||
| self._cache_configs[stage_id] = cache_config | ||
|
|
Comment on lines
+432
to
+451
| if self.single_stage_mode: | ||
| if not self._omni_master_address or not self._omni_master_port: | ||
| raise ValueError( | ||
| "AsyncOmniEngine single_stage_mode requires both " | ||
| "omni_master_address and omni_master_port to be set." | ||
| ) | ||
| # Collect all LLM stage IDs for pre-allocation. | ||
| all_llm_stage_ids = [ | ||
| i for i, sc in enumerate(self.stage_configs) if getattr(sc, "stage_type", "llm") != "diffusion" | ||
| ] | ||
| self._omni_master_server = OmniMasterServer( | ||
| master_address=self._omni_master_address, | ||
| master_port=self._omni_master_port, | ||
| stage_ids=all_llm_stage_ids, | ||
| ) | ||
| self._omni_master_server.start() | ||
| logger.info( | ||
| "[AsyncOmniEngine] OmniMasterServer started for stages %s", | ||
| all_llm_stage_ids, | ||
| ) |
Signed-off-by: wuhang <wuhang6@huawei.com>
10 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)