[Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring by fake0fan · Pull Request #1908 · vllm-project/vllm-omni

fake0fan · 2026-03-16T03:35:03Z

Motivation.

Based on the design outlined in #967, this code refactors vLLM-Omni’s existing multi-stage pipeline architecture to better align with vLLM’s core design patterns. The refactoring centers around three key design principles:

EngineClient Protocol Alignment - Ensuring AsyncOmni properly implements vLLM's EngineClient protocol
PipelineOrchestrator - Extracting pipeline orchestration logic into a dedicated component
Communication Optimization - Removing mp.Queue Communication

@yinpeiqi @hsliuustc0106 @Gaohan123 @wuhang2014 @chickeyton

AsyncOmni Architecture (Qwen3-Omni Example)

1. System Architecture

• ┌─────────────────────────────────────────────────────────────────────────────────┐
  │                                    API Layer                                    │
  │  ┌─────────────────────────────────────┐  ┌──────────────────────────────────┐  │
  │  │ AsyncOmni (EngineClient)            │  │ Omni                             │  │
  │  │ • generate() / abort() / shutdown() │  │ • generate()                     │  │
  │  │ • _final_output_handler()           │  │                                  |  │
  │  └─────────────────────────────────────┘  └──────────────────────────────────┘  │
  ├─────────────────────────────────────────────────────────────────────────────────┤
  │                              Engine Layer (Proxy)                               │
  │  ┌───────────────────────────────────────────────────────────────────────────┐  │
  │  │ AsyncOmniEngine                                                           │  │
  │  │ • _bootstrap_orchestrator() & _initialize_stages()                        │  │
  │  │ • add_request() / add_request_async() -> input_processor.process_inputs() │  │
  │  │ • try_get_output() / try_get_output_async()                               │  │
  │  └───────────────────┬─────────────────────────────────▲─────────────────────┘  │
  │         request_queue (janus.Queue)        output_queue (janus.Queue)           │
  ├──────────────────────┼─────────────────────────────────┼────────────────────────┤
  │                      ▼        Orchestration Layer      │                        │
  │  ┌───────────────────────────────────────────────────────────────────────────┐  │
  │  │ Orchestrator [background thread]                                          │  │
  │  │ • _request_handler()                                                      │  │
  │  │     -  stage_client.add_request_async() & _prewarm_async_chunk_stages()   │  │
  │  │ • _orchestration_output_handler()                                         │  │
  │  │     -  _process_stage_outputs() -> output_processors[i].process_outputs() │  │
  │  │     -  _route_output() & _forward_to_next_stage()                         │  │
  │  └──────────┬─────────────────────────┬────────────────────────┬─────────────┘  │
  ├─────────────┼─────────────────────────┼────────────────────────┼────────────────┤
  │             │                 Communication Layer              │                │
  │  ┌───────────────────────┐ ┌───────────────────────┐ ┌───────────────────────┐  │
  │  │ StageEngineCoreClient │ │ StageEngineCoreClient │ │ StageDiffusionClient  │  │
  │  │ • ZMQ ROUTER / PULL   │ │ • ZMQ ROUTER / PULL   │ │ • ZMQ ROUTER / PULL   │  │
  │  │ • Msgpack codec       │ │ • Msgpack codec       │ │ • Msgpack codec       │  │
  │  └──────────┬────────────┘ └──────────┬────────────┘ └──────────┬────────────┘  │
  │             ▼ ZMQ IPC                 ▼ ZMQ IPC                 ▼ ZMQ IPC       │
  ├─────────────────────────────────────────────────────────────────────────────────┤
  │                                 Execution Layer                                 │
  │  ┌───────────────────────┐ ┌───────────────────────┐ ┌───────────────────────┐  │
  │  │ StageCoreProc         │ │ StageCoreProc         │ │ DiffusionEngine       │  │
  │  │ [background process]  │ │ [background process]  │ │ [background process]  │  │
  │  └───────────────────────┘ └───────────────────────┘ └───────────────────────┘  │
  └─────────────────────────────────────────────────────────────────────────────────┘

2. Execution Flow (Arrow Steps, one generate request)

[1] App
    -> AsyncOmni.generate(prompt, request_id)

[2] AsyncOmni
    -> _final_output_handler()   (started on first request)
    -> AsyncOmniEngine.add_request(stage_id=0, ...)

[3] AsyncOmniEngine.add_request
    -> (if stage-0 is llm and input is not EngineCoreRequest)
       InputProcessor.process_inputs()
       OutputProcessor[0].add_request()
    -> request_queue.put(add_request_msg)

[4] Orchestrator._request_handler
    -> _handle_add_request(msg)
    -> stage_clients[0].add_request_async(...)

[5] Orchestrator._orchestration_loop (loop)
    -> poll stage output
       - llm stage: await get_output_async()
       - diffusion stage: get_diffusion_output_async()
    -> (llm stage) output_processors[i].process_outputs(...)
    -> _route_output(...)
    -> if finished and not final_stage and non-async-chunk:
         _forward_to_next_stage(...)
         -> next_stage.add_request_async(...)
    -> output_queue.put(output)

[6] AsyncOmni._final_output_loop (background coroutine)
    -> AsyncOmniEngine.try_get_output_blocking()
    -> route by request_id to ClientRequestState.queue

[7] AsyncOmni._process_orchestrator_results
    -> read from ClientRequestState.queue
    -> _process_single_result(...)
    -> yield OmniRequestOutput

[8] Exit condition
    -> receive result["finished"] == True
    -> generate() ends

3. Runtime Sequence (one generate request)

sequenceDiagram
    participant APP as App
    participant AO as AsyncOmni
    participant ENG as AsyncOmniEngine
    participant ORCH as Orchestrator
    participant S0 as Stage-0 Client
    participant SN as Next Stage Client

    APP->>AO: generate
    AO->>AO: start output_handler once
    AO->>ENG: add_request(stage_id=0, ...)
    ENG->>ENG: input_processor.process_inputs()
    ENG->>ORCH: request_queue.put(add_request)

    ORCH->>ORCH: _handle_add_request
    ORCH->>S0: add_request_async

    loop poll route forward
        ORCH->>S0: get_output_async / get_diffusion_output_async
        ORCH->>ORCH: _route_output
        alt need forward to next stage
            ORCH->>SN: add_request_async
        end
        ORCH-->>ENG: output_queue.put
    end

    AO->>ENG: try_get_output_blocking
    ENG-->>AO: message
    AO-->>APP: yield OmniRequestOutput

4. Comparison

V0:

┌────────────────────────────────────────────────────────────────────────────┐
│ Main Process                                                               │
│  ┌──────────────────────┐   ┌────────────────────────────────────────────┐ │
│  │ generate()           │   │ final_output_handler()                     │ │
│  └──────────────────────┘   └────────────────────────────────────────────┘ │
└──────────┬─────────────────────────┬─────────────────────────┬─────────────┘
  mp.Queue (in_q/out_q)    mp.Queue (in_q/out_q)    mp.Queue (in_q/out_q)
           ▼▲                        ▼▲                        ▼▲
┌───────────────────────┐  ┌───────────────────────┐  ┌──────────────────────┐
│ Worker Proc-0         │  │ Worker Proc-1         │  │ Worker Proc-2        │
│ (Thinker LLM)         │  │ (Talker LLM)          │  │ (Vocoder)            │
│  ┌────────────────┐   │  │  ┌────────────────┐   │  │  ┌────────────────┐  │
│  │_stage_worker   │   │  │  │_stage_worker   │   │  │  │_stage_worker   │  │
│  │_async()        │   │  │  │_async()        │   │  │  │_async()        │  │
│  └────────────────┘   │  │  └────────────────┘   │  │  └────────────────┘  │
│  ┌────────────────┐   │  │  ┌────────────────┐   │  │  ┌────────────────┐  │
│  │output_handler()│   │  │  │output_handler()│   │  │  │output_handler()│  │
│  └────────────────┘   │  │  └────────────────┘   │  │  └────────────────┘  │
└──────────┬────────────┘  └──────────┬────────────┘  └──────────┬───────────┘
       ZMQ ▼ ▲ ZMQ               ZMQ ▼ ▲ ZMQ               ZMQ ▼ ▲ ZMQ
┌──────────────────────┐   ┌──────────────────────┐   ┌──────────────────────┐
│ EngineCore Proc-0    │   │ EngineCore Proc-1    │   │ EngineCore Proc-2    │
│ (Thinker)            │   │ (Talker)             │   │ (Vocoder)            │
└──────────────────────┘   └──────────────────────┘   └──────────────────────┘

Current:

┌────────────────────────────────────────────────────────────────────────────┐
│ Main Process                                                               │
│  ┌──────────────────────────────────────────────────────────────────────┐  │
│  │ Main Thread                                                          │  │
│  │  ┌──────────────────────┐   ┌─────────────────────────────────────┐  │  │
│  │  │ generate()           │   │ final_output_handler()              │  │  │
│  │  └──────────────────────┘   └─────────────────────────────────────┘  │  │
│  └──────────────────────────────────────────────────────────────────────┘  │
│         asyncio.Queue (request_queue) ▼  ▲ asyncio.Queue (output_queue)    │
│  ┌──────────────────────────────────────────────────────────────────────┐  │
│  │ Orchestrator Thread                                                  │  │
│  │  ┌──────────────────────┐  ┌──────────────────────────────────────┐  │  │
│  │  │ _request_handler()   │  │ _orchestration_output_handler()      │  │  │
│  │  └──────────────────────┘  └──────────────────────────────────────┘  │  │
│  │  ┌────────────────────────────────────────────────────────────────┐  │  │
│  │  │ engine_core_output_handler()*3  (stage-0 / stage-1 / stage-2)  │  │  │
│  │  └────────────────────────────────────────────────────────────────┘  │  │
│  └───────┬─────────────────────────┬─────────────────────────┬──────────┘  │
└──────────┬─────────────────────────┬─────────────────────────┬─────────────┘
       ZMQ ▼ ▲ ZMQ               ZMQ ▼ ▲ ZMQ               ZMQ ▼ ▲ ZMQ  
  ┌──────────────────────┐  ┌──────────────────────┐  ┌──────────────────────┐
  │ EngineCore Proc-0    │  │ EngineCore Proc-1    │  │ EngineCore Proc-2    │
  │ (Thinker)            │  │ (Talker)             │  │ (Vocoder)            │
  └──────────────────────┘  └──────────────────────┘  └──────────────────────┘

Test scripts:

# enter offline inference folder.
cd qwen2_5_omni

# legacy impl:
VLLM_OMNI_USE_V1=0 VLLM_LOGGING_LEVEL=INFO python end2end_v1.py --output-wav output_audio \
                  --query-type use_mixed_modalities
# current impl:
VLLM_OMNI_USE_V1=1 VLLM_LOGGING_LEVEL=INFO python end2end_v1.py --output-wav output_audio \
                  --query-type use_mixed_modalities

cd qwen3_omni
# legacy impl:
VLLM_OMNI_USE_V1=0 python end2end_v1.py --output-wav output_audio --query-type text --async-chunk --enable-stats
# current impl:
VLLM_OMNI_USE_V1=1 python end2end_v1.py --output-wav output_audio --query-type text --async-chunk --enable-stats

cd bagel
# legacy impl:
VLLM_OMNI_USE_V1=0 python end2end_v1.py --prompts "A cute cat"
# current impl:
VLLM_OMNI_USE_V1=1 python end2end_v1.py --prompts "A cute cat"

cd text_to_image
# legacy impl:
VLLM_OMNI_USE_V1=0 python text_to_image_async.py --prompt "a cup of coffee on the table" --output output.png
# current impl:
VLLM_OMNI_USE_V1=1 python text_to_image_async.py --prompt "a cup of coffee on the table" --output output.png

TODO

Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>

Signed-off-by: Peiqi Yin <yinpeiqi809@gmail.com>

Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>

Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>

Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>

jasonlee-1024 · 2026-03-18T18:24:56Z

I noticed that --headless has been deprecated, and Ray is now the recommended backend. However, diffusion models do not currently support Ray. In that case, what is the recommended way to deploy omni models across multiple nodes? Also, what are the future plans for multi-node support?

fake0fan · 2026-03-19T01:46:38Z

I noticed that --headless has been deprecated, and Ray is now the recommended backend. However, diffusion models do not currently support Ray. In that case, what is the recommended way to deploy omni models across multiple nodes? Also, what are the future plans for multi-node support?

The functionality has already been implemented (fake0fan#24) and is expected to be merged in the near future.

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

…#1908) Signed-off-by: yinpe <11810305@mail.sustech.edu.cn> Signed-off-by: Peiqi Yin <yinpeiqi809@gmail.com> Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com> Signed-off-by: wuhang <whlbx@hotmail.com> Signed-off-by: linyueqian <linyueqian@outlook.com> Signed-off-by: princepride <wangzhipeng628@gmail.com> Signed-off-by: lishunyang <lishunyang12@163.com> Signed-off-by: Chenguang ZHENG <645327136@qq.com> Signed-off-by: Peiqi Yin <60515999+yinpeiqi@users.noreply.github.com> Co-authored-by: yinpe <11810305@mail.sustech.edu.cn> Co-authored-by: Peiqi Yin <yinpeiqi809@gmail.com> Co-authored-by: wuhang <whlbx@hotmail.com> Co-authored-by: Peiqi Yin <60515999+yinpeiqi@users.noreply.github.com> Co-authored-by: linyueqian <linyueqian@outlook.com> Co-authored-by: princepride <wangzhipeng628@gmail.com> Co-authored-by: lishunyang <lishunyang12@163.com> Signed-off-by: yiliu30 <yi4.liu@intel.com>

- add notes in scheduler - align to vllm-project#1908, move "step_execution" into AsyncOmniEngine._create_default_diffusion_stage_cfg - note: due to 6bdb55a, tests can't pass Signed-off-by: jader <yjader@foxmail.com>

- Add notes to scheduler - Align with vllm-project#1908; move "step_execution" into `AsyncOmniEngine._create_default_diffusion_stage_cfg` - NOTE: Due to 6bdb55a, tests are currently failing and need to be fixed later Signed-off-by: jader <yjader@foxmail.com>

### vllm-omni-audio-tts - Source: [PR #2059](vllm-project/vllm-omni#2059) - [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool - Changes: - Bug fix: [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool ### vllm-omni-perf - Source: [PR #2059](vllm-project/vllm-omni#2059) - [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool - Changes: - Bug fix: [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool ### vllm-omni-api - Source: [PR #2058](vllm-project/vllm-omni#2058) - [Bugfix] Fix Fish Speech and CosyVoice3 online serving - missing is_comprehension and broken model detection - Changes: - Bug fix: [Bugfix] Fix Fish Speech and CosyVoice3 online serving - missing is_comprehension and broken model detection ### vllm-omni-contrib - Source: [PR #2045](vllm-project/vllm-omni#2045) - [Voxtral] Improve example ### vllm-omni-cicd - Source: [PR #2045](vllm-project/vllm-omni#2045) - [Voxtral] Improve example ### vllm-omni-api - Source: [PR #2042](vllm-project/vllm-omni#2042) - [bugfix] /chat/completion doesn't read extra_body for diffusion model - Changes: - Bug fix: [bugfix] /chat/completion doesn't read extra_body for diffusion model ### vllm-omni-perf - Source: [PR #2042](vllm-project/vllm-omni#2042) - [bugfix] /chat/completion doesn't read extra_body for diffusion model - Changes: - Bug fix: [bugfix] /chat/completion doesn't read extra_body for diffusion model ### vllm-omni-contrib - Source: [PR #2038](vllm-project/vllm-omni#2038) - [Doc] Update docs and dockerfiles for rebase of vllm v0.18.0 ### vllm-omni-serving - Source: [PR #2037](vllm-project/vllm-omni#2037) - [Rebase] Rebase to vllm v0.18.0 ### vllm-omni-contrib - Source: [PR #2037](vllm-project/vllm-omni#2037) - [Rebase] Rebase to vllm v0.18.0 ### vllm-omni-api - Source: [PR #2037](vllm-project/vllm-omni#2037) - [Rebase] Rebase to vllm v0.18.0 ### vllm-omni-cicd - Source: [PR #2037](vllm-project/vllm-omni#2037) - [Rebase] Rebase to vllm v0.18.0 ### vllm-omni-cicd - Source: [PR #2032](vllm-project/vllm-omni#2032) - [CI] Change Bagel online test environment variable `VLLM_TEST_CLEAN_GPU_MEMORY` to `0` ### vllm-omni-cicd - Source: [PR #2031](vllm-project/vllm-omni#2031) - [CI] Fix test. - Changes: - Bug fix: [CI] Fix test. ### vllm-omni-cicd - Source: [PR #2017](vllm-project/vllm-omni#2017) - [CI] [ROCm] Setup `test-ready.yml` and `test-merge.yml` ### vllm-omni-cicd - Source: [PR #2014](vllm-project/vllm-omni#2014) - [Test] Implement mock HTTP request handling in benchmark CLI tests ### vllm-omni-perf - Source: [PR #2014](vllm-project/vllm-omni#2014) - [Test] Implement mock HTTP request handling in benchmark CLI tests ### vllm-omni-serving - Source: [PR #2012](vllm-project/vllm-omni#2012) - [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips - Changes: - Bug fix: [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips ### vllm-omni-image-gen - Source: [PR #2012](vllm-project/vllm-omni#2012) - [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips - Changes: - Bug fix: [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips ### vllm-omni-perf - Source: [PR #2012](vllm-project/vllm-omni#2012) - [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips - Changes: - Bug fix: [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips ### vllm-omni-serving - Source: [PR #2009](vllm-project/vllm-omni#2009) - [Bugfix] revert PR#1758 which introduced the accuracy problem of qwen3-omni - Changes: - Bug fix: [Bugfix] revert PR#1758 which introduced the accuracy problem of qwen3-omni ### vllm-omni-image-gen - Source: [PR #2007](vllm-project/vllm-omni#2007) - [Bugfix]Fix bug of online server can not return mutli images - Changes: - Bug fix: [Bugfix]Fix bug of online server can not return mutli images - Additions: - Qwen-Image-Layered - Qwen-Image-Layered - Qwen-Image-Layered ### vllm-omni-api - Source: [PR #2007](vllm-project/vllm-omni#2007) - [Bugfix]Fix bug of online server can not return mutli images - Changes: - Bug fix: [Bugfix]Fix bug of online server can not return mutli images ### vllm-omni-cicd - Source: [PR #1998](vllm-project/vllm-omni#1998) - [CI] Split BAGEL tests into dummy/real weight tiers (L2/L3) ### vllm-omni-serving - Source: [PR #1985](vllm-project/vllm-omni#1985) - [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls - Changes: - Performance improvement: [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls ### vllm-omni-audio-tts - Source: [PR #1985](vllm-project/vllm-omni#1985) - [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls - Changes: - Performance improvement: [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls ### vllm-omni-perf - Source: [PR #1985](vllm-project/vllm-omni#1985) - [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls - Changes: - Performance improvement: [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls ### vllm-omni-serving - Source: [PR #1984](vllm-project/vllm-omni#1984) - [CI] [ROCm] Bugfix device environment issue - Changes: - Bug fix: [CI] [ROCm] Bugfix device environment issue ### vllm-omni-api - Source: [PR #1984](vllm-project/vllm-omni#1984) - [CI] [ROCm] Bugfix device environment issue - Changes: - Bug fix: [CI] [ROCm] Bugfix device environment issue ### vllm-omni-serving - Source: [PR #1982](vllm-project/vllm-omni#1982) - [Fix] Fix slow hasattr in CUDAGraphWrapper.__getattr__ - Changes: - Bug fix: [Fix] Fix slow hasattr in CUDAGraphWrapper.__getattr__ ### vllm-omni-cicd - Source: [PR #1982](vllm-project/vllm-omni#1982) - [Fix] Fix slow hasattr in CUDAGraphWrapper.__getattr__ - Changes: - Bug fix: [Fix] Fix slow hasattr in CUDAGraphWrapper.__getattr__ ### vllm-omni-api - Source: [PR #1979](vllm-project/vllm-omni#1979) - [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series) - Changes: - Bug fix: [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series) - Additions: - `/v1/chat/completions` ### vllm-omni-perf - Source: [PR #1979](vllm-project/vllm-omni#1979) - [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series) - Changes: - Bug fix: [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series) ### vllm-omni-contrib - Source: [PR #1976](vllm-project/vllm-omni#1976) - [skip ci][Docs] Update WeChat QR code (fix filename case) - Changes: - Bug fix: [skip ci][Docs] Update WeChat QR code (fix filename case) ### vllm-omni-contrib - Source: [PR #1974](vllm-project/vllm-omni#1974) - [Docs] Update WeChat QR code for community support ### vllm-omni-cicd - Source: [PR #1945](vllm-project/vllm-omni#1945) - Fix Base voice clone streaming quality and stop-token crash - Changes: - Bug fix: Fix Base voice clone streaming quality and stop-token crash ### vllm-omni-cicd - Source: [PR #1938](vllm-project/vllm-omni#1938) - [Test] L4 complete diffusion feature test for Bagel models - Changes: - New feature: [Test] L4 complete diffusion feature test for Bagel models ### vllm-omni-perf - Source: [PR #1938](vllm-project/vllm-omni#1938) - [Test] L4 complete diffusion feature test for Bagel models - Changes: - New feature: [Test] L4 complete diffusion feature test for Bagel models ### vllm-omni-perf - Source: [PR #1934](vllm-project/vllm-omni#1934) - Fix OmniGen2 transformer config loading for HF models - Changes: - Bug fix: Fix OmniGen2 transformer config loading for HF models ### vllm-omni-audio-tts - Source: [PR #1930](vllm-project/vllm-omni#1930) - [Bug][Qwen3TTS][Streaming] remove dynamic initial chunk and only compute on initial request ### vllm-omni-perf - Source: [PR #1930](vllm-project/vllm-omni#1930) - [Bug][Qwen3TTS][Streaming] remove dynamic initial chunk and only compute on initial request ### vllm-omni-audio-tts - Source: [PR #1926](vllm-project/vllm-omni#1926) - [Misc] removed qwen3_tts.py as it is out-dated ### vllm-omni-contrib - Source: [PR #1920](vllm-project/vllm-omni#1920) - [Docs] Add Wan2.1-T2V as supported video generation models - Changes: - New feature: [Docs] Add Wan2.1-T2V as supported video generation models ### vllm-omni-video-gen - Source: [PR #1915](vllm-project/vllm-omni#1915) - [Bugfix] fix helios video generate use cpu device - Changes: - Bug fix: [Bugfix] fix helios video generate use cpu device ### vllm-omni-perf - Source: [PR #1915](vllm-project/vllm-omni#1915) - [Bugfix] fix helios video generate use cpu device - Changes: - Bug fix: [Bugfix] fix helios video generate use cpu device ### vllm-omni-audio-tts - Source: [PR #1913](vllm-project/vllm-omni#1913) - [Optim][Qwen3TTS][CodePredictor] support torch.compile with reduce-overhead and dynamic False ### vllm-omni-perf - Source: [PR #1913](vllm-project/vllm-omni#1913) - [Optim][Qwen3TTS][CodePredictor] support torch.compile with reduce-overhead and dynamic False ### vllm-omni-api - Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring ### vllm-omni-perf - Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring ### vllm-omni-contrib - Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring ### vllm-omni-serving - Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring ### vllm-omni-cicd - Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring ### vllm-omni-image-gen - Source: [PR #1900](vllm-project/vllm-omni#1900) - [Feat] support HSDP for Flux family - Changes: - New feature: [Feat] support HSDP for Flux family ### vllm-omni-contrib - Source: [PR #1900](vllm-project/vllm-omni#1900) - [Feat] support HSDP for Flux family - Changes: - New feature: [Feat] support HSDP for Flux family ### vllm-omni-distributed - Source: [PR #1898](vllm-project/vllm-omni#1898) - [Feature]: Remove some useless `hf_overrides` in yaml - Changes: - New feature: [Feature]: Remove some useless `hf_overrides` in yaml ### vllm-omni-quantization - Source: [PR #1898](vllm-project/vllm-omni#1898) - [Feature]: Remove some useless `hf_overrides` in yaml - Changes: - New feature: [Feature]: Remove some useless `hf_overrides` in yaml ### vllm-omni-cicd - Source: [PR #1898](vllm-project/vllm-omni#1898) - [Feature]: Remove some useless `hf_overrides` in yaml - Changes: - New feature: [Feature]: Remove some useless `hf_overrides` in yaml ### vllm-omni-perf - Source: [PR #1898](vllm-project/vllm-omni#1898) - [Feature]: Remove some useless `hf_overrides` in yaml - Changes: - New feature: [Feature]: Remove some useless `hf_overrides` in yaml ### vllm-omni-contrib - Source: [PR #1890](vllm-project/vllm-omni#1890) - [NPU] Upgrade to v0.17.0 ### vllm-omni-contrib - Source: [PR #1889](vllm-project/vllm-omni#1889) - Add `Governance` section - Changes: - New feature: Add `Governance` section ### vllm-omni-distributed - Source: [PR #1881](vllm-project/vllm-omni#1881) - [Feat] Support T5 Tensor Parallelism - Changes: - New feature: [Feat] Support T5 Tensor Parallelism ### vllm-omni-cicd - Source: [PR #1881](vllm-project/vllm-omni#1881) - [Feat] Support T5 Tensor Parallelism - Changes: - New feature: [Feat] Support T5 Tensor Parallelism

yinpeiqi and others added 30 commits March 16, 2026 10:45

add async

389979e

Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>

init runnable async omni

b9bc8e3

Signed-off-by: Peiqi Yin <yinpeiqi809@gmail.com>

temp

fa1099b

Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>

update async omni

2140b90

Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>

refactor init

65cafdc

Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>

add next stage without input processor

1078d57

Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>

move input processor to engine

ad1b313

Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>

decouple input processor

378f631

Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>

refactor output processor

3fd53a7

Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>

remove omni input processor

97bc157

Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>

use orchestrator

b25a26f

Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>

update

01319dd

Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>

add metrics

d76f9b5

Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>

add download

6093ea5

Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>

add support for diffusion model

fc61104

Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>

add doc

a20f291

Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>

update e2e

0173b6c

Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>

add precommit

5445e58

Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>

fix main

6c689f8

Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>

[draft] add basic support for bagel

421a46b

Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>

add async chunk

f82c940

Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>

add qwen3 example

d8d1072

Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>

update test

8290c5d

Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>

move init to engine

d6e66c6

Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>

rename files

da80711

Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>

rename output handler

5f5d6b1

Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>

add doc

9f3e156

Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>

cleanup

63f3b02

Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>

add test case

0b65a8b

Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>

update doc

1659c7b

Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>

linyueqian mentioned this pull request Mar 18, 2026

[RFC]: TTS Development Roadmap - March 2026 #1795

Open

pi314ever mentioned this pull request Mar 18, 2026

[Enhancement] Escalate stage timeout to error #1558

Closed

5 tasks

xuechendi mentioned this pull request Mar 19, 2026

[XPU] update bagel modeling to remove cuda hardcode, add xpu stage_config #1931

Merged

5 tasks

This was referenced Mar 19, 2026

[Bugfix] NextStep cannot launch with offline example text_to_image.py #1947

Closed

fix: NextStep pipeline class in config fallback path #1960

Open

linyueqian mentioned this pull request Mar 19, 2026

[Bug]: Qwen3-TTS _ model inference crash #1996

Closed

1 task

lishunyang12 mentioned this pull request Mar 19, 2026

[Metrics] Adding vllm-omni diffusion metrics support #1977

Open

5 tasks

fhfuih added a commit to fhfuih/vllm-omni that referenced this pull request Mar 19, 2026

Unskip NextStep test as vllm-project#1908 fix the issue

bbbf8c3

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

fhfuih added a commit to fhfuih/vllm-omni that referenced this pull request Mar 19, 2026

Update Omni ouput handling after vllm-project#1908

57ec851

Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>

This was referenced Mar 19, 2026

[RFC]: Qwen-Image、Qwen-Image-Layered、Qwen-Image-Edit-Plus、Wan2.2 Production-grade Feature Monitoring JiusiServe/vllm-omni#167

Closed

[RFC]: Exit on OOM #1346

Open

hsliuustc0106 mentioned this pull request Mar 19, 2026

[RFC]: vLLM-Omni 2026 Q1 Roadmap #677

Open

38 tasks

linyueqian mentioned this pull request Mar 21, 2026

[Bugfix] Fix Fish Speech and CosyVoice3 online serving - missing is_comprehension and broken model detection #2058

Merged

3 tasks

yJader mentioned this pull request Mar 21, 2026

squash merge pr/scheduler(#1625) omni-nicelab/vllm-omni-batching#3

Closed

NickCao mentioned this pull request Mar 22, 2026

[RFC] [Misc] Convert untyped OmniStageTask dict to TypedDict #1547

Closed

5 tasks

gcanlin mentioned this pull request Mar 23, 2026

[RFC]: Unified Torch Profiler Interface for vLLM-Omni #2088

Closed

fake0fan mentioned this pull request Mar 23, 2026

[Feature][RL] Support batching for QwenImage in async mode #1593

Merged

5 tasks

linyueqian mentioned this pull request Mar 24, 2026

[RFC]: [Config Refactor][2/2] Pipeline + Deploy Config System #2072

Open

1 task

wuhang2014 mentioned this pull request Mar 24, 2026

[Entrypoint][Refactor]Stage CLI Refactor #2020

Merged

5 tasks

pi314ever mentioned this pull request Mar 24, 2026

[Enhancement] Patch AsyncOmniEngine try_get_output[_async] hanging issues #2153

Merged

5 tasks

Gaohan123 mentioned this pull request Mar 25, 2026

[Bug]: Stage_ready deserialization fails with remote-code tokenizer #1817

Closed

1 task

Lidang-Jiang mentioned this pull request Apr 4, 2026

[Bugfix] Suppress harmless repo_utils ERROR in stage workers for local model paths #1658

Closed

4 tasks

yinpeiqi mentioned this pull request Apr 9, 2026

[Core] Refactor CFG compaion tracker and use in Orchestrator #2623

Open

5 tasks

inkcherry mentioned this pull request Apr 10, 2026

add moriio transfer engine #1742

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring#1908

[Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring#1908
hsliuustc0106 merged 141 commits intovllm-project:mainfrom
fake0fan:refactor

fake0fan commented Mar 16, 2026 •

edited

Loading

Uh oh!

jasonlee-1024 commented Mar 18, 2026

Uh oh!

fake0fan commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

18 participants

Conversation

fake0fan commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation.

AsyncOmni Architecture (Qwen3-Omni Example)

1. System Architecture

2. Execution Flow (Arrow Steps, one generate request)

3. Runtime Sequence (one generate request)

4. Comparison

TODO

Uh oh!

jasonlee-1024 commented Mar 18, 2026

Uh oh!

fake0fan commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

18 participants

fake0fan commented Mar 16, 2026 •

edited

Loading