[Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring#1908
Merged
hsliuustc0106 merged 141 commits intovllm-project:mainfrom Mar 18, 2026
Merged
[Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring#1908hsliuustc0106 merged 141 commits intovllm-project:mainfrom
hsliuustc0106 merged 141 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Peiqi Yin <yinpeiqi809@gmail.com>
Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>
Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>
Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>
Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>
Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>
Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>
Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>
Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>
Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>
Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>
Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>
Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>
Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>
Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>
Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>
Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>
Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>
Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>
Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>
Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>
Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>
Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>
|
I noticed that --headless has been deprecated, and Ray is now the recommended backend. However, diffusion models do not currently support Ray. In that case, what is the recommended way to deploy omni models across multiple nodes? Also, what are the future plans for multi-node support? |
5 tasks
5 tasks
Contributor
Author
The functionality has already been implemented (fake0fan#24) and is expected to be merged in the near future. |
This was referenced Mar 19, 2026
1 task
5 tasks
fhfuih
added a commit
to fhfuih/vllm-omni
that referenced
this pull request
Mar 19, 2026
Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
fhfuih
added a commit
to fhfuih/vllm-omni
that referenced
this pull request
Mar 19, 2026
Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
This was referenced Mar 19, 2026
38 tasks
yiliu30
pushed a commit
to yiliu30/vllm-omni-fork
that referenced
this pull request
Mar 20, 2026
…#1908) Signed-off-by: yinpe <11810305@mail.sustech.edu.cn> Signed-off-by: Peiqi Yin <yinpeiqi809@gmail.com> Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com> Signed-off-by: wuhang <whlbx@hotmail.com> Signed-off-by: linyueqian <linyueqian@outlook.com> Signed-off-by: princepride <wangzhipeng628@gmail.com> Signed-off-by: lishunyang <lishunyang12@163.com> Signed-off-by: Chenguang ZHENG <645327136@qq.com> Signed-off-by: Peiqi Yin <60515999+yinpeiqi@users.noreply.github.com> Co-authored-by: yinpe <11810305@mail.sustech.edu.cn> Co-authored-by: Peiqi Yin <yinpeiqi809@gmail.com> Co-authored-by: wuhang <whlbx@hotmail.com> Co-authored-by: Peiqi Yin <60515999+yinpeiqi@users.noreply.github.com> Co-authored-by: linyueqian <linyueqian@outlook.com> Co-authored-by: princepride <wangzhipeng628@gmail.com> Co-authored-by: lishunyang <lishunyang12@163.com> Signed-off-by: yiliu30 <yi4.liu@intel.com>
3 tasks
yJader
added a commit
to omni-nicelab/vllm-omni-batching
that referenced
this pull request
Mar 21, 2026
- add notes in scheduler - align to vllm-project#1908, move "step_execution" into AsyncOmniEngine._create_default_diffusion_stage_cfg - note: due to 6bdb55a, tests can't pass Signed-off-by: jader <yjader@foxmail.com>
yJader
added a commit
to omni-nicelab/vllm-omni-batching
that referenced
this pull request
Mar 21, 2026
- Add notes to scheduler - Align with vllm-project#1908; move "step_execution" into `AsyncOmniEngine._create_default_diffusion_stage_cfg` - NOTE: Due to 6bdb55a, tests are currently failing and need to be fixed later Signed-off-by: jader <yjader@foxmail.com>
hsliuustc0106
added a commit
to hsliuustc0106/vllm-omni-skills
that referenced
this pull request
Mar 22, 2026
### vllm-omni-audio-tts - Source: [PR #2059](vllm-project/vllm-omni#2059) - [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool - Changes: - Bug fix: [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool ### vllm-omni-perf - Source: [PR #2059](vllm-project/vllm-omni#2059) - [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool - Changes: - Bug fix: [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool ### vllm-omni-api - Source: [PR #2058](vllm-project/vllm-omni#2058) - [Bugfix] Fix Fish Speech and CosyVoice3 online serving - missing is_comprehension and broken model detection - Changes: - Bug fix: [Bugfix] Fix Fish Speech and CosyVoice3 online serving - missing is_comprehension and broken model detection ### vllm-omni-contrib - Source: [PR #2045](vllm-project/vllm-omni#2045) - [Voxtral] Improve example ### vllm-omni-cicd - Source: [PR #2045](vllm-project/vllm-omni#2045) - [Voxtral] Improve example ### vllm-omni-api - Source: [PR #2042](vllm-project/vllm-omni#2042) - [bugfix] /chat/completion doesn't read extra_body for diffusion model - Changes: - Bug fix: [bugfix] /chat/completion doesn't read extra_body for diffusion model ### vllm-omni-perf - Source: [PR #2042](vllm-project/vllm-omni#2042) - [bugfix] /chat/completion doesn't read extra_body for diffusion model - Changes: - Bug fix: [bugfix] /chat/completion doesn't read extra_body for diffusion model ### vllm-omni-contrib - Source: [PR #2038](vllm-project/vllm-omni#2038) - [Doc] Update docs and dockerfiles for rebase of vllm v0.18.0 ### vllm-omni-serving - Source: [PR #2037](vllm-project/vllm-omni#2037) - [Rebase] Rebase to vllm v0.18.0 ### vllm-omni-contrib - Source: [PR #2037](vllm-project/vllm-omni#2037) - [Rebase] Rebase to vllm v0.18.0 ### vllm-omni-api - Source: [PR #2037](vllm-project/vllm-omni#2037) - [Rebase] Rebase to vllm v0.18.0 ### vllm-omni-cicd - Source: [PR #2037](vllm-project/vllm-omni#2037) - [Rebase] Rebase to vllm v0.18.0 ### vllm-omni-cicd - Source: [PR #2032](vllm-project/vllm-omni#2032) - [CI] Change Bagel online test environment variable `VLLM_TEST_CLEAN_GPU_MEMORY` to `0` ### vllm-omni-cicd - Source: [PR #2031](vllm-project/vllm-omni#2031) - [CI] Fix test. - Changes: - Bug fix: [CI] Fix test. ### vllm-omni-cicd - Source: [PR #2017](vllm-project/vllm-omni#2017) - [CI] [ROCm] Setup `test-ready.yml` and `test-merge.yml` ### vllm-omni-cicd - Source: [PR #2014](vllm-project/vllm-omni#2014) - [Test] Implement mock HTTP request handling in benchmark CLI tests ### vllm-omni-perf - Source: [PR #2014](vllm-project/vllm-omni#2014) - [Test] Implement mock HTTP request handling in benchmark CLI tests ### vllm-omni-serving - Source: [PR #2012](vllm-project/vllm-omni#2012) - [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips - Changes: - Bug fix: [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips ### vllm-omni-image-gen - Source: [PR #2012](vllm-project/vllm-omni#2012) - [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips - Changes: - Bug fix: [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips ### vllm-omni-perf - Source: [PR #2012](vllm-project/vllm-omni#2012) - [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips - Changes: - Bug fix: [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips ### vllm-omni-serving - Source: [PR #2009](vllm-project/vllm-omni#2009) - [Bugfix] revert PR#1758 which introduced the accuracy problem of qwen3-omni - Changes: - Bug fix: [Bugfix] revert PR#1758 which introduced the accuracy problem of qwen3-omni ### vllm-omni-image-gen - Source: [PR #2007](vllm-project/vllm-omni#2007) - [Bugfix]Fix bug of online server can not return mutli images - Changes: - Bug fix: [Bugfix]Fix bug of online server can not return mutli images - Additions: - Qwen-Image-Layered - Qwen-Image-Layered - Qwen-Image-Layered ### vllm-omni-api - Source: [PR #2007](vllm-project/vllm-omni#2007) - [Bugfix]Fix bug of online server can not return mutli images - Changes: - Bug fix: [Bugfix]Fix bug of online server can not return mutli images ### vllm-omni-cicd - Source: [PR #1998](vllm-project/vllm-omni#1998) - [CI] Split BAGEL tests into dummy/real weight tiers (L2/L3) ### vllm-omni-serving - Source: [PR #1985](vllm-project/vllm-omni#1985) - [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls - Changes: - Performance improvement: [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls ### vllm-omni-audio-tts - Source: [PR #1985](vllm-project/vllm-omni#1985) - [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls - Changes: - Performance improvement: [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls ### vllm-omni-perf - Source: [PR #1985](vllm-project/vllm-omni#1985) - [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls - Changes: - Performance improvement: [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls ### vllm-omni-serving - Source: [PR #1984](vllm-project/vllm-omni#1984) - [CI] [ROCm] Bugfix device environment issue - Changes: - Bug fix: [CI] [ROCm] Bugfix device environment issue ### vllm-omni-api - Source: [PR #1984](vllm-project/vllm-omni#1984) - [CI] [ROCm] Bugfix device environment issue - Changes: - Bug fix: [CI] [ROCm] Bugfix device environment issue ### vllm-omni-serving - Source: [PR #1982](vllm-project/vllm-omni#1982) - [Fix] Fix slow hasattr in CUDAGraphWrapper.__getattr__ - Changes: - Bug fix: [Fix] Fix slow hasattr in CUDAGraphWrapper.__getattr__ ### vllm-omni-cicd - Source: [PR #1982](vllm-project/vllm-omni#1982) - [Fix] Fix slow hasattr in CUDAGraphWrapper.__getattr__ - Changes: - Bug fix: [Fix] Fix slow hasattr in CUDAGraphWrapper.__getattr__ ### vllm-omni-api - Source: [PR #1979](vllm-project/vllm-omni#1979) - [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series) - Changes: - Bug fix: [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series) - Additions: - `/v1/chat/completions` ### vllm-omni-perf - Source: [PR #1979](vllm-project/vllm-omni#1979) - [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series) - Changes: - Bug fix: [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series) ### vllm-omni-contrib - Source: [PR #1976](vllm-project/vllm-omni#1976) - [skip ci][Docs] Update WeChat QR code (fix filename case) - Changes: - Bug fix: [skip ci][Docs] Update WeChat QR code (fix filename case) ### vllm-omni-contrib - Source: [PR #1974](vllm-project/vllm-omni#1974) - [Docs] Update WeChat QR code for community support ### vllm-omni-cicd - Source: [PR #1945](vllm-project/vllm-omni#1945) - Fix Base voice clone streaming quality and stop-token crash - Changes: - Bug fix: Fix Base voice clone streaming quality and stop-token crash ### vllm-omni-cicd - Source: [PR #1938](vllm-project/vllm-omni#1938) - [Test] L4 complete diffusion feature test for Bagel models - Changes: - New feature: [Test] L4 complete diffusion feature test for Bagel models ### vllm-omni-perf - Source: [PR #1938](vllm-project/vllm-omni#1938) - [Test] L4 complete diffusion feature test for Bagel models - Changes: - New feature: [Test] L4 complete diffusion feature test for Bagel models ### vllm-omni-perf - Source: [PR #1934](vllm-project/vllm-omni#1934) - Fix OmniGen2 transformer config loading for HF models - Changes: - Bug fix: Fix OmniGen2 transformer config loading for HF models ### vllm-omni-audio-tts - Source: [PR #1930](vllm-project/vllm-omni#1930) - [Bug][Qwen3TTS][Streaming] remove dynamic initial chunk and only compute on initial request ### vllm-omni-perf - Source: [PR #1930](vllm-project/vllm-omni#1930) - [Bug][Qwen3TTS][Streaming] remove dynamic initial chunk and only compute on initial request ### vllm-omni-audio-tts - Source: [PR #1926](vllm-project/vllm-omni#1926) - [Misc] removed qwen3_tts.py as it is out-dated ### vllm-omni-contrib - Source: [PR #1920](vllm-project/vllm-omni#1920) - [Docs] Add Wan2.1-T2V as supported video generation models - Changes: - New feature: [Docs] Add Wan2.1-T2V as supported video generation models ### vllm-omni-video-gen - Source: [PR #1915](vllm-project/vllm-omni#1915) - [Bugfix] fix helios video generate use cpu device - Changes: - Bug fix: [Bugfix] fix helios video generate use cpu device ### vllm-omni-perf - Source: [PR #1915](vllm-project/vllm-omni#1915) - [Bugfix] fix helios video generate use cpu device - Changes: - Bug fix: [Bugfix] fix helios video generate use cpu device ### vllm-omni-audio-tts - Source: [PR #1913](vllm-project/vllm-omni#1913) - [Optim][Qwen3TTS][CodePredictor] support torch.compile with reduce-overhead and dynamic False ### vllm-omni-perf - Source: [PR #1913](vllm-project/vllm-omni#1913) - [Optim][Qwen3TTS][CodePredictor] support torch.compile with reduce-overhead and dynamic False ### vllm-omni-api - Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring ### vllm-omni-perf - Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring ### vllm-omni-contrib - Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring ### vllm-omni-serving - Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring ### vllm-omni-cicd - Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring ### vllm-omni-image-gen - Source: [PR #1900](vllm-project/vllm-omni#1900) - [Feat] support HSDP for Flux family - Changes: - New feature: [Feat] support HSDP for Flux family ### vllm-omni-contrib - Source: [PR #1900](vllm-project/vllm-omni#1900) - [Feat] support HSDP for Flux family - Changes: - New feature: [Feat] support HSDP for Flux family ### vllm-omni-distributed - Source: [PR #1898](vllm-project/vllm-omni#1898) - [Feature]: Remove some useless `hf_overrides` in yaml - Changes: - New feature: [Feature]: Remove some useless `hf_overrides` in yaml ### vllm-omni-quantization - Source: [PR #1898](vllm-project/vllm-omni#1898) - [Feature]: Remove some useless `hf_overrides` in yaml - Changes: - New feature: [Feature]: Remove some useless `hf_overrides` in yaml ### vllm-omni-cicd - Source: [PR #1898](vllm-project/vllm-omni#1898) - [Feature]: Remove some useless `hf_overrides` in yaml - Changes: - New feature: [Feature]: Remove some useless `hf_overrides` in yaml ### vllm-omni-perf - Source: [PR #1898](vllm-project/vllm-omni#1898) - [Feature]: Remove some useless `hf_overrides` in yaml - Changes: - New feature: [Feature]: Remove some useless `hf_overrides` in yaml ### vllm-omni-contrib - Source: [PR #1890](vllm-project/vllm-omni#1890) - [NPU] Upgrade to v0.17.0 ### vllm-omni-contrib - Source: [PR #1889](vllm-project/vllm-omni#1889) - Add `Governance` section - Changes: - New feature: Add `Governance` section ### vllm-omni-distributed - Source: [PR #1881](vllm-project/vllm-omni#1881) - [Feat] Support T5 Tensor Parallelism - Changes: - New feature: [Feat] Support T5 Tensor Parallelism ### vllm-omni-cicd - Source: [PR #1881](vllm-project/vllm-omni#1881) - [Feat] Support T5 Tensor Parallelism - Changes: - New feature: [Feat] Support T5 Tensor Parallelism
5 tasks
5 tasks
1 task
5 tasks
5 tasks
1 task
4 tasks
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation.
Based on the design outlined in #967, this code refactors vLLM-Omni’s existing multi-stage pipeline architecture to better align with vLLM’s core design patterns. The refactoring centers around three key design principles:
AsyncOmniproperly implements vLLM'sEngineClientprotocol@yinpeiqi @hsliuustc0106 @Gaohan123 @wuhang2014 @chickeyton
AsyncOmni Architecture (Qwen3-Omni Example)
1. System Architecture
2. Execution Flow (Arrow Steps, one generate request)
3. Runtime Sequence (one generate request)
sequenceDiagram participant APP as App participant AO as AsyncOmni participant ENG as AsyncOmniEngine participant ORCH as Orchestrator participant S0 as Stage-0 Client participant SN as Next Stage Client APP->>AO: generate AO->>AO: start output_handler once AO->>ENG: add_request(stage_id=0, ...) ENG->>ENG: input_processor.process_inputs() ENG->>ORCH: request_queue.put(add_request) ORCH->>ORCH: _handle_add_request ORCH->>S0: add_request_async loop poll route forward ORCH->>S0: get_output_async / get_diffusion_output_async ORCH->>ORCH: _route_output alt need forward to next stage ORCH->>SN: add_request_async end ORCH-->>ENG: output_queue.put end AO->>ENG: try_get_output_blocking ENG-->>AO: message AO-->>APP: yield OmniRequestOutput4. Comparison
V0:
Current:
Test scripts:
TODO