Skip to content

[Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring#1908

Merged
hsliuustc0106 merged 141 commits intovllm-project:mainfrom
fake0fan:refactor
Mar 18, 2026
Merged

[Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring#1908
hsliuustc0106 merged 141 commits intovllm-project:mainfrom
fake0fan:refactor

Conversation

@fake0fan
Copy link
Copy Markdown
Contributor

@fake0fan fake0fan commented Mar 16, 2026

Motivation.

Based on the design outlined in #967, this code refactors vLLM-Omni’s existing multi-stage pipeline architecture to better align with vLLM’s core design patterns. The refactoring centers around three key design principles:

  1. EngineClient Protocol Alignment - Ensuring AsyncOmni properly implements vLLM's EngineClient protocol
  2. PipelineOrchestrator - Extracting pipeline orchestration logic into a dedicated component
  3. Communication Optimization - Removing mp.Queue Communication

@yinpeiqi @hsliuustc0106 @Gaohan123 @wuhang2014 @chickeyton

AsyncOmni Architecture (Qwen3-Omni Example)

1. System Architecture

• ┌─────────────────────────────────────────────────────────────────────────────────┐
  │                                    API Layer                                    │
  │  ┌─────────────────────────────────────┐  ┌──────────────────────────────────┐  │
  │  │ AsyncOmni (EngineClient)            │  │ Omni                             │  │
  │  │ • generate() / abort() / shutdown() │  │ • generate()                     │  │
  │  │ • _final_output_handler()           │  │                                  |  │
  │  └─────────────────────────────────────┘  └──────────────────────────────────┘  │
  ├─────────────────────────────────────────────────────────────────────────────────┤
  │                              Engine Layer (Proxy)                               │
  │  ┌───────────────────────────────────────────────────────────────────────────┐  │
  │  │ AsyncOmniEngine                                                           │  │
  │  │ • _bootstrap_orchestrator() & _initialize_stages()                        │  │
  │  │ • add_request() / add_request_async() -> input_processor.process_inputs() │  │
  │  │ • try_get_output() / try_get_output_async()                               │  │
  │  └───────────────────┬─────────────────────────────────▲─────────────────────┘  │
  │         request_queue (janus.Queue)        output_queue (janus.Queue)           │
  ├──────────────────────┼─────────────────────────────────┼────────────────────────┤
  │                      ▼        Orchestration Layer      │                        │
  │  ┌───────────────────────────────────────────────────────────────────────────┐  │
  │  │ Orchestrator [background thread]                                          │  │
  │  │ • _request_handler()                                                      │  │
  │  │     -  stage_client.add_request_async() & _prewarm_async_chunk_stages()   │  │
  │  │ • _orchestration_output_handler()                                         │  │
  │  │     -  _process_stage_outputs() -> output_processors[i].process_outputs() │  │
  │  │     -  _route_output() & _forward_to_next_stage()                         │  │
  │  └──────────┬─────────────────────────┬────────────────────────┬─────────────┘  │
  ├─────────────┼─────────────────────────┼────────────────────────┼────────────────┤
  │             │                 Communication Layer              │                │
  │  ┌───────────────────────┐ ┌───────────────────────┐ ┌───────────────────────┐  │
  │  │ StageEngineCoreClient │ │ StageEngineCoreClient │ │ StageDiffusionClient  │  │
  │  │ • ZMQ ROUTER / PULL   │ │ • ZMQ ROUTER / PULL   │ │ • ZMQ ROUTER / PULL   │  │
  │  │ • Msgpack codec       │ │ • Msgpack codec       │ │ • Msgpack codec       │  │
  │  └──────────┬────────────┘ └──────────┬────────────┘ └──────────┬────────────┘  │
  │             ▼ ZMQ IPC                 ▼ ZMQ IPC                 ▼ ZMQ IPC       │
  ├─────────────────────────────────────────────────────────────────────────────────┤
  │                                 Execution Layer                                 │
  │  ┌───────────────────────┐ ┌───────────────────────┐ ┌───────────────────────┐  │
  │  │ StageCoreProc         │ │ StageCoreProc         │ │ DiffusionEngine       │  │
  │  │ [background process]  │ │ [background process]  │ │ [background process]  │  │
  │  └───────────────────────┘ └───────────────────────┘ └───────────────────────┘  │
  └─────────────────────────────────────────────────────────────────────────────────┘

2. Execution Flow (Arrow Steps, one generate request)

[1] App
    -> AsyncOmni.generate(prompt, request_id)

[2] AsyncOmni
    -> _final_output_handler()   (started on first request)
    -> AsyncOmniEngine.add_request(stage_id=0, ...)

[3] AsyncOmniEngine.add_request
    -> (if stage-0 is llm and input is not EngineCoreRequest)
       InputProcessor.process_inputs()
       OutputProcessor[0].add_request()
    -> request_queue.put(add_request_msg)

[4] Orchestrator._request_handler
    -> _handle_add_request(msg)
    -> stage_clients[0].add_request_async(...)

[5] Orchestrator._orchestration_loop (loop)
    -> poll stage output
       - llm stage: await get_output_async()
       - diffusion stage: get_diffusion_output_async()
    -> (llm stage) output_processors[i].process_outputs(...)
    -> _route_output(...)
    -> if finished and not final_stage and non-async-chunk:
         _forward_to_next_stage(...)
         -> next_stage.add_request_async(...)
    -> output_queue.put(output)

[6] AsyncOmni._final_output_loop (background coroutine)
    -> AsyncOmniEngine.try_get_output_blocking()
    -> route by request_id to ClientRequestState.queue

[7] AsyncOmni._process_orchestrator_results
    -> read from ClientRequestState.queue
    -> _process_single_result(...)
    -> yield OmniRequestOutput

[8] Exit condition
    -> receive result["finished"] == True
    -> generate() ends

3. Runtime Sequence (one generate request)

sequenceDiagram
    participant APP as App
    participant AO as AsyncOmni
    participant ENG as AsyncOmniEngine
    participant ORCH as Orchestrator
    participant S0 as Stage-0 Client
    participant SN as Next Stage Client

    APP->>AO: generate
    AO->>AO: start output_handler once
    AO->>ENG: add_request(stage_id=0, ...)
    ENG->>ENG: input_processor.process_inputs()
    ENG->>ORCH: request_queue.put(add_request)

    ORCH->>ORCH: _handle_add_request
    ORCH->>S0: add_request_async

    loop poll route forward
        ORCH->>S0: get_output_async / get_diffusion_output_async
        ORCH->>ORCH: _route_output
        alt need forward to next stage
            ORCH->>SN: add_request_async
        end
        ORCH-->>ENG: output_queue.put
    end

    AO->>ENG: try_get_output_blocking
    ENG-->>AO: message
    AO-->>APP: yield OmniRequestOutput
Loading

4. Comparison

V0:

┌────────────────────────────────────────────────────────────────────────────┐
│ Main Process                                                               │
│  ┌──────────────────────┐   ┌────────────────────────────────────────────┐ │
│  │ generate()           │   │ final_output_handler()                     │ │
│  └──────────────────────┘   └────────────────────────────────────────────┘ │
└──────────┬─────────────────────────┬─────────────────────────┬─────────────┘
  mp.Queue (in_q/out_q)    mp.Queue (in_q/out_q)    mp.Queue (in_q/out_q)
           ▼▲                        ▼▲                        ▼▲
┌───────────────────────┐  ┌───────────────────────┐  ┌──────────────────────┐
│ Worker Proc-0         │  │ Worker Proc-1         │  │ Worker Proc-2        │
│ (Thinker LLM)         │  │ (Talker LLM)          │  │ (Vocoder)            │
│  ┌────────────────┐   │  │  ┌────────────────┐   │  │  ┌────────────────┐  │
│  │_stage_worker   │   │  │  │_stage_worker   │   │  │  │_stage_worker   │  │
│  │_async()        │   │  │  │_async()        │   │  │  │_async()        │  │
│  └────────────────┘   │  │  └────────────────┘   │  │  └────────────────┘  │
│  ┌────────────────┐   │  │  ┌────────────────┐   │  │  ┌────────────────┐  │
│  │output_handler()│   │  │  │output_handler()│   │  │  │output_handler()│  │
│  └────────────────┘   │  │  └────────────────┘   │  │  └────────────────┘  │
└──────────┬────────────┘  └──────────┬────────────┘  └──────────┬───────────┘
       ZMQ ▼ ▲ ZMQ               ZMQ ▼ ▲ ZMQ               ZMQ ▼ ▲ ZMQ
┌──────────────────────┐   ┌──────────────────────┐   ┌──────────────────────┐
│ EngineCore Proc-0    │   │ EngineCore Proc-1    │   │ EngineCore Proc-2    │
│ (Thinker)            │   │ (Talker)             │   │ (Vocoder)            │
└──────────────────────┘   └──────────────────────┘   └──────────────────────┘

Current:

┌────────────────────────────────────────────────────────────────────────────┐
│ Main Process                                                               │
│  ┌──────────────────────────────────────────────────────────────────────┐  │
│  │ Main Thread                                                          │  │
│  │  ┌──────────────────────┐   ┌─────────────────────────────────────┐  │  │
│  │  │ generate()           │   │ final_output_handler()              │  │  │
│  │  └──────────────────────┘   └─────────────────────────────────────┘  │  │
│  └──────────────────────────────────────────────────────────────────────┘  │
│         asyncio.Queue (request_queue) ▼  ▲ asyncio.Queue (output_queue)    │
│  ┌──────────────────────────────────────────────────────────────────────┐  │
│  │ Orchestrator Thread                                                  │  │
│  │  ┌──────────────────────┐  ┌──────────────────────────────────────┐  │  │
│  │  │ _request_handler()   │  │ _orchestration_output_handler()      │  │  │
│  │  └──────────────────────┘  └──────────────────────────────────────┘  │  │
│  │  ┌────────────────────────────────────────────────────────────────┐  │  │
│  │  │ engine_core_output_handler()*3  (stage-0 / stage-1 / stage-2)  │  │  │
│  │  └────────────────────────────────────────────────────────────────┘  │  │
│  └───────┬─────────────────────────┬─────────────────────────┬──────────┘  │
└──────────┬─────────────────────────┬─────────────────────────┬─────────────┘
       ZMQ ▼ ▲ ZMQ               ZMQ ▼ ▲ ZMQ               ZMQ ▼ ▲ ZMQ  
  ┌──────────────────────┐  ┌──────────────────────┐  ┌──────────────────────┐
  │ EngineCore Proc-0    │  │ EngineCore Proc-1    │  │ EngineCore Proc-2    │
  │ (Thinker)            │  │ (Talker)             │  │ (Vocoder)            │
  └──────────────────────┘  └──────────────────────┘  └──────────────────────┘

Test scripts:

# enter offline inference folder.
cd qwen2_5_omni

# legacy impl:
VLLM_OMNI_USE_V1=0 VLLM_LOGGING_LEVEL=INFO python end2end_v1.py --output-wav output_audio \
                  --query-type use_mixed_modalities
# current impl:
VLLM_OMNI_USE_V1=1 VLLM_LOGGING_LEVEL=INFO python end2end_v1.py --output-wav output_audio \
                  --query-type use_mixed_modalities

cd qwen3_omni
# legacy impl:
VLLM_OMNI_USE_V1=0 python end2end_v1.py --output-wav output_audio --query-type text --async-chunk --enable-stats
# current impl:
VLLM_OMNI_USE_V1=1 python end2end_v1.py --output-wav output_audio --query-type text --async-chunk --enable-stats

cd bagel
# legacy impl:
VLLM_OMNI_USE_V1=0 python end2end_v1.py --prompts "A cute cat"
# current impl:
VLLM_OMNI_USE_V1=1 python end2end_v1.py --prompts "A cute cat"

cd text_to_image
# legacy impl:
VLLM_OMNI_USE_V1=0 python text_to_image_async.py --prompt "a cup of coffee on the table" --output output.png
# current impl:
VLLM_OMNI_USE_V1=1 python text_to_image_async.py --prompt "a cup of coffee on the table" --output output.png

TODO

  • 1. Allow diffusion engine on a process (current inline) @chickeyton
  • 2. headless serve @wuhang2014
  • 3. Documents of the new architecture
  • 4. Write blog
  • 5. Unit tests for AsyncOmni, Omni, AsyncOmniEngine, Orchestrator
  • 6. Metrics
  • 7. Refactor logics about Bagel CFG
  • 8. To discuss later: queue for collect rpc
  • 9. Verify GLM-Image
  • 10. To discuss later: Qwen3-TTS need independent input processor?

yinpeiqi and others added 30 commits March 16, 2026 10:45
Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>
Signed-off-by: Peiqi Yin <yinpeiqi809@gmail.com>
Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>
Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>
Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>
Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>
Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>
Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>
Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>
Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>
Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>
Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>
Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>
Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>
Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>
Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>
Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>
Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>
Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>
Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>
Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>
Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>
Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>
Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>
Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>
Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>
Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>
Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>
Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>
Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>
@jasonlee-1024
Copy link
Copy Markdown

I noticed that --headless has been deprecated, and Ray is now the recommended backend. However, diffusion models do not currently support Ray. In that case, what is the recommended way to deploy omni models across multiple nodes? Also, what are the future plans for multi-node support?

@fake0fan
Copy link
Copy Markdown
Contributor Author

I noticed that --headless has been deprecated, and Ray is now the recommended backend. However, diffusion models do not currently support Ray. In that case, what is the recommended way to deploy omni models across multiple nodes? Also, what are the future plans for multi-node support?

The functionality has already been implemented (fake0fan#24) and is expected to be merged in the near future.

fhfuih added a commit to fhfuih/vllm-omni that referenced this pull request Mar 19, 2026
Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
fhfuih added a commit to fhfuih/vllm-omni that referenced this pull request Mar 19, 2026
Signed-off-by: Huang, Zeyu <11222265+fhfuih@users.noreply.github.com>
yiliu30 pushed a commit to yiliu30/vllm-omni-fork that referenced this pull request Mar 20, 2026
…#1908)

Signed-off-by: yinpe <11810305@mail.sustech.edu.cn>
Signed-off-by: Peiqi Yin <yinpeiqi809@gmail.com>
Signed-off-by: yinpeiqi <yinpeiqi809@gmail.com>
Signed-off-by: wuhang <whlbx@hotmail.com>
Signed-off-by: linyueqian <linyueqian@outlook.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: lishunyang <lishunyang12@163.com>
Signed-off-by: Chenguang ZHENG <645327136@qq.com>
Signed-off-by: Peiqi Yin <60515999+yinpeiqi@users.noreply.github.com>
Co-authored-by: yinpe <11810305@mail.sustech.edu.cn>
Co-authored-by: Peiqi Yin <yinpeiqi809@gmail.com>
Co-authored-by: wuhang <whlbx@hotmail.com>
Co-authored-by: Peiqi Yin <60515999+yinpeiqi@users.noreply.github.com>
Co-authored-by: linyueqian <linyueqian@outlook.com>
Co-authored-by: princepride <wangzhipeng628@gmail.com>
Co-authored-by: lishunyang <lishunyang12@163.com>

Signed-off-by: yiliu30 <yi4.liu@intel.com>
yJader added a commit to omni-nicelab/vllm-omni-batching that referenced this pull request Mar 21, 2026
- add notes in scheduler
- align to vllm-project#1908, move "step_execution" into AsyncOmniEngine._create_default_diffusion_stage_cfg
- note: due to 6bdb55a, tests can't pass

Signed-off-by: jader <yjader@foxmail.com>
yJader added a commit to omni-nicelab/vllm-omni-batching that referenced this pull request Mar 21, 2026
- Add notes to scheduler
- Align with vllm-project#1908; move "step_execution" into `AsyncOmniEngine._create_default_diffusion_stage_cfg`
- NOTE: Due to 6bdb55a, tests are currently failing and need to be fixed later

Signed-off-by: jader <yjader@foxmail.com>
hsliuustc0106 added a commit to hsliuustc0106/vllm-omni-skills that referenced this pull request Mar 22, 2026
### vllm-omni-audio-tts
- Source: [PR #2059](vllm-project/vllm-omni#2059) - [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool
- Changes:
  - Bug fix: [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool

### vllm-omni-perf
- Source: [PR #2059](vllm-project/vllm-omni#2059) - [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool
- Changes:
  - Bug fix: [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool

### vllm-omni-api
- Source: [PR #2058](vllm-project/vllm-omni#2058) - [Bugfix] Fix Fish Speech and CosyVoice3 online serving - missing is_comprehension and broken model detection
- Changes:
  - Bug fix: [Bugfix] Fix Fish Speech and CosyVoice3 online serving - missing is_comprehension and broken model detection

### vllm-omni-contrib
- Source: [PR #2045](vllm-project/vllm-omni#2045) - [Voxtral] Improve example

### vllm-omni-cicd
- Source: [PR #2045](vllm-project/vllm-omni#2045) - [Voxtral] Improve example

### vllm-omni-api
- Source: [PR #2042](vllm-project/vllm-omni#2042) - [bugfix] /chat/completion doesn't read extra_body for diffusion model
- Changes:
  - Bug fix: [bugfix] /chat/completion doesn't read extra_body for diffusion model

### vllm-omni-perf
- Source: [PR #2042](vllm-project/vllm-omni#2042) - [bugfix] /chat/completion doesn't read extra_body for diffusion model
- Changes:
  - Bug fix: [bugfix] /chat/completion doesn't read extra_body for diffusion model

### vllm-omni-contrib
- Source: [PR #2038](vllm-project/vllm-omni#2038) - [Doc] Update docs and dockerfiles for rebase of vllm v0.18.0

### vllm-omni-serving
- Source: [PR #2037](vllm-project/vllm-omni#2037) - [Rebase] Rebase to vllm v0.18.0

### vllm-omni-contrib
- Source: [PR #2037](vllm-project/vllm-omni#2037) - [Rebase] Rebase to vllm v0.18.0

### vllm-omni-api
- Source: [PR #2037](vllm-project/vllm-omni#2037) - [Rebase] Rebase to vllm v0.18.0

### vllm-omni-cicd
- Source: [PR #2037](vllm-project/vllm-omni#2037) - [Rebase] Rebase to vllm v0.18.0

### vllm-omni-cicd
- Source: [PR #2032](vllm-project/vllm-omni#2032) - [CI] Change Bagel online test environment variable `VLLM_TEST_CLEAN_GPU_MEMORY` to `0`

### vllm-omni-cicd
- Source: [PR #2031](vllm-project/vllm-omni#2031) - [CI] Fix test.
- Changes:
  - Bug fix: [CI] Fix test.

### vllm-omni-cicd
- Source: [PR #2017](vllm-project/vllm-omni#2017) - [CI] [ROCm] Setup `test-ready.yml` and `test-merge.yml`

### vllm-omni-cicd
- Source: [PR #2014](vllm-project/vllm-omni#2014) - [Test] Implement mock HTTP request handling in benchmark CLI tests

### vllm-omni-perf
- Source: [PR #2014](vllm-project/vllm-omni#2014) - [Test] Implement mock HTTP request handling in benchmark CLI tests

### vllm-omni-serving
- Source: [PR #2012](vllm-project/vllm-omni#2012) - [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips
- Changes:
  - Bug fix: [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips

### vllm-omni-image-gen
- Source: [PR #2012](vllm-project/vllm-omni#2012) - [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips
- Changes:
  - Bug fix: [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips

### vllm-omni-perf
- Source: [PR #2012](vllm-project/vllm-omni#2012) - [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips
- Changes:
  - Bug fix: [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips

### vllm-omni-serving
- Source: [PR #2009](vllm-project/vllm-omni#2009) - [Bugfix] revert PR#1758 which introduced the accuracy problem of qwen3-omni
- Changes:
  - Bug fix: [Bugfix] revert PR#1758 which introduced the accuracy problem of qwen3-omni

### vllm-omni-image-gen
- Source: [PR #2007](vllm-project/vllm-omni#2007) - [Bugfix]Fix bug of online server can not return mutli images
- Changes:
  - Bug fix: [Bugfix]Fix bug of online server can not return mutli images
- Additions:
  - Qwen-Image-Layered
  - Qwen-Image-Layered
  - Qwen-Image-Layered

### vllm-omni-api
- Source: [PR #2007](vllm-project/vllm-omni#2007) - [Bugfix]Fix bug of online server can not return mutli images
- Changes:
  - Bug fix: [Bugfix]Fix bug of online server can not return mutli images

### vllm-omni-cicd
- Source: [PR #1998](vllm-project/vllm-omni#1998) - [CI] Split BAGEL tests into dummy/real weight tiers (L2/L3)

### vllm-omni-serving
- Source: [PR #1985](vllm-project/vllm-omni#1985) - [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls
- Changes:
  - Performance improvement: [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls

### vllm-omni-audio-tts
- Source: [PR #1985](vllm-project/vllm-omni#1985) - [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls
- Changes:
  - Performance improvement: [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls

### vllm-omni-perf
- Source: [PR #1985](vllm-project/vllm-omni#1985) - [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls
- Changes:
  - Performance improvement: [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls

### vllm-omni-serving
- Source: [PR #1984](vllm-project/vllm-omni#1984) - [CI] [ROCm] Bugfix device environment issue
- Changes:
  - Bug fix: [CI] [ROCm] Bugfix device environment issue

### vllm-omni-api
- Source: [PR #1984](vllm-project/vllm-omni#1984) - [CI] [ROCm] Bugfix device environment issue
- Changes:
  - Bug fix: [CI] [ROCm] Bugfix device environment issue

### vllm-omni-serving
- Source: [PR #1982](vllm-project/vllm-omni#1982) - [Fix] Fix slow hasattr in CUDAGraphWrapper.__getattr__
- Changes:
  - Bug fix: [Fix] Fix slow hasattr in CUDAGraphWrapper.__getattr__

### vllm-omni-cicd
- Source: [PR #1982](vllm-project/vllm-omni#1982) - [Fix] Fix slow hasattr in CUDAGraphWrapper.__getattr__
- Changes:
  - Bug fix: [Fix] Fix slow hasattr in CUDAGraphWrapper.__getattr__

### vllm-omni-api
- Source: [PR #1979](vllm-project/vllm-omni#1979) - [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series)
- Changes:
  - Bug fix: [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series)
- Additions:
  - `/v1/chat/completions`

### vllm-omni-perf
- Source: [PR #1979](vllm-project/vllm-omni#1979) - [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series)
- Changes:
  - Bug fix: [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series)

### vllm-omni-contrib
- Source: [PR #1976](vllm-project/vllm-omni#1976) - [skip ci][Docs] Update WeChat QR code (fix filename case)
- Changes:
  - Bug fix: [skip ci][Docs] Update WeChat QR code (fix filename case)

### vllm-omni-contrib
- Source: [PR #1974](vllm-project/vllm-omni#1974) - [Docs] Update WeChat QR code for community support

### vllm-omni-cicd
- Source: [PR #1945](vllm-project/vllm-omni#1945) - Fix Base voice clone streaming quality and stop-token crash
- Changes:
  - Bug fix: Fix Base voice clone streaming quality and stop-token crash

### vllm-omni-cicd
- Source: [PR #1938](vllm-project/vllm-omni#1938) - [Test] L4 complete diffusion feature test for Bagel models
- Changes:
  - New feature: [Test] L4 complete diffusion feature test for Bagel models

### vllm-omni-perf
- Source: [PR #1938](vllm-project/vllm-omni#1938) - [Test] L4 complete diffusion feature test for Bagel models
- Changes:
  - New feature: [Test] L4 complete diffusion feature test for Bagel models

### vllm-omni-perf
- Source: [PR #1934](vllm-project/vllm-omni#1934) - Fix OmniGen2 transformer config loading for HF models
- Changes:
  - Bug fix: Fix OmniGen2 transformer config loading for HF models

### vllm-omni-audio-tts
- Source: [PR #1930](vllm-project/vllm-omni#1930) - [Bug][Qwen3TTS][Streaming] remove dynamic initial chunk and only compute on initial request

### vllm-omni-perf
- Source: [PR #1930](vllm-project/vllm-omni#1930) - [Bug][Qwen3TTS][Streaming] remove dynamic initial chunk and only compute on initial request

### vllm-omni-audio-tts
- Source: [PR #1926](vllm-project/vllm-omni#1926) - [Misc] removed qwen3_tts.py as it is out-dated

### vllm-omni-contrib
- Source: [PR #1920](vllm-project/vllm-omni#1920) - [Docs] Add Wan2.1-T2V as supported video generation models
- Changes:
  - New feature: [Docs] Add Wan2.1-T2V as supported video generation models

### vllm-omni-video-gen
- Source: [PR #1915](vllm-project/vllm-omni#1915) - [Bugfix] fix helios video generate use cpu device
- Changes:
  - Bug fix: [Bugfix] fix helios video generate use cpu device

### vllm-omni-perf
- Source: [PR #1915](vllm-project/vllm-omni#1915) - [Bugfix] fix helios video generate use cpu device
- Changes:
  - Bug fix: [Bugfix] fix helios video generate use cpu device

### vllm-omni-audio-tts
- Source: [PR #1913](vllm-project/vllm-omni#1913) - [Optim][Qwen3TTS][CodePredictor] support torch.compile with reduce-overhead and dynamic False

### vllm-omni-perf
- Source: [PR #1913](vllm-project/vllm-omni#1913) - [Optim][Qwen3TTS][CodePredictor] support torch.compile with reduce-overhead and dynamic False

### vllm-omni-api
- Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring

### vllm-omni-perf
- Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring

### vllm-omni-contrib
- Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring

### vllm-omni-serving
- Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring

### vllm-omni-cicd
- Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring

### vllm-omni-image-gen
- Source: [PR #1900](vllm-project/vllm-omni#1900) - [Feat] support HSDP for Flux family
- Changes:
  - New feature: [Feat] support HSDP for Flux family

### vllm-omni-contrib
- Source: [PR #1900](vllm-project/vllm-omni#1900) - [Feat] support HSDP for Flux family
- Changes:
  - New feature: [Feat] support HSDP for Flux family

### vllm-omni-distributed
- Source: [PR #1898](vllm-project/vllm-omni#1898) - [Feature]: Remove some useless `hf_overrides` in yaml
- Changes:
  - New feature: [Feature]: Remove some useless `hf_overrides` in yaml

### vllm-omni-quantization
- Source: [PR #1898](vllm-project/vllm-omni#1898) - [Feature]: Remove some useless `hf_overrides` in yaml
- Changes:
  - New feature: [Feature]: Remove some useless `hf_overrides` in yaml

### vllm-omni-cicd
- Source: [PR #1898](vllm-project/vllm-omni#1898) - [Feature]: Remove some useless `hf_overrides` in yaml
- Changes:
  - New feature: [Feature]: Remove some useless `hf_overrides` in yaml

### vllm-omni-perf
- Source: [PR #1898](vllm-project/vllm-omni#1898) - [Feature]: Remove some useless `hf_overrides` in yaml
- Changes:
  - New feature: [Feature]: Remove some useless `hf_overrides` in yaml

### vllm-omni-contrib
- Source: [PR #1890](vllm-project/vllm-omni#1890) - [NPU] Upgrade to v0.17.0

### vllm-omni-contrib
- Source: [PR #1889](vllm-project/vllm-omni#1889) - Add `Governance` section
- Changes:
  - New feature: Add `Governance` section

### vllm-omni-distributed
- Source: [PR #1881](vllm-project/vllm-omni#1881) - [Feat] Support T5 Tensor Parallelism
- Changes:
  - New feature: [Feat] Support T5 Tensor Parallelism

### vllm-omni-cicd
- Source: [PR #1881](vllm-project/vllm-omni#1881) - [Feat] Support T5 Tensor Parallelism
- Changes:
  - New feature: [Feat] Support T5 Tensor Parallelism
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

high priority high priority issue, needs to be done asap ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.