Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
cf7c02a
reduce TTFA by lower initial codec frames required at start of decoding
Mar 1, 2026
88ed021
update docs
Mar 1, 2026
c2f4550
update examples
Mar 1, 2026
19f5f80
add time to e2e script to compute TTFC
Mar 2, 2026
29e3416
add a simple test for streaming decoding with variable initial chunk …
Mar 2, 2026
f75b3ef
Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…
Mar 2, 2026
ff6d6c4
last warmup chunk must overlap with the normal path
Mar 2, 2026
3a0a5d4
Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…
Mar 2, 2026
d54d2bf
fix
Mar 2, 2026
912e536
from warmup to initial phase
Mar 2, 2026
360e1d3
Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…
Mar 2, 2026
b78a146
remove warmup in docs
Mar 3, 2026
217b825
merge main
Mar 3, 2026
490a8e3
Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…
Mar 3, 2026
5636536
Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…
Mar 4, 2026
f9977b7
update docs and examples
Mar 4, 2026
439abba
add per-request configurable initial_codec_chunk_frames
Mar 4, 2026
e6798d5
add test for configurable initial_codec_chunk_frames
Mar 4, 2026
ea46784
Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…
Mar 4, 2026
4b0bb24
update comment
Mar 4, 2026
008aab7
Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…
Mar 4, 2026
5a70eba
Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…
Mar 4, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/design/feature/async_chunk_design.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ The `async_chunk` feature enables asynchronous, chunked processing of data acros

For qwen3-omni:
- **Thinker → Talker**: Per decode step (typically chunk_size=1)
- **Talker → Code2Wav**: Accumulated to code2wav chunk_size(default=25, currently only support default, will support chunk_size soon) before sending
- **Talker → Code2Wav**: Accumulated to `codec_chunk_frames` (default=25) before sending. Set `initial_codec_chunk_frames` to emit smaller chunks during the initial phase for reduced TTFA
- **Code2Wav**: Streaming decode with code2wav chunk_size

With `async_chunk`:
Expand Down
1 change: 1 addition & 0 deletions docs/serving/speech_api.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,7 @@ Content-Type: application/json
| `language` | string | "Auto" | Language (see supported languages below) |
| `instructions` | string | "" | Voice style/emotion instructions |
| `max_new_tokens` | integer | 2048 | Maximum tokens to generate |
| `initial_codec_chunk_frames` | integer | null | Initial chunk size for reduced TTFA (overrides stage config) |

**Supported languages:** Auto, Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian

Expand Down
3 changes: 2 additions & 1 deletion docs/user_guide/examples/offline_inference/qwen3_tts.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,8 @@ Add `--streaming` to stream audio chunks progressively via `AsyncOmni` (requires
python end2end.py --query-type CustomVoice --streaming --output-dir /tmp/out_stream
```

Each 25-frame Code2Wav chunk is logged as it arrives. The final WAV file is written once generation
Each Code2Wav chunk is logged as it arrives (default 25 frames; configurable via `codec_chunk_frames`
and `initial_codec_chunk_frames` in the stage config). The final WAV file is written once generation
completes. This demonstrates that audio data is available progressively rather than only at the end.

> **Note:** Streaming uses `AsyncOmni` internally. The non-streaming path (`Omni`) is unchanged.
Expand Down
2 changes: 1 addition & 1 deletion docs/user_guide/examples/online_serving/qwen3_tts.md
Original file line number Diff line number Diff line change
Expand Up @@ -268,7 +268,7 @@ Returns binary audio data with appropriate `Content-Type` header (e.g., `audio/w
## Streaming

Set `stream=true` with `response_format="pcm"` to receive raw PCM audio chunks as they are decoded
(one chunk per 25-frame Code2Wav window):
(one chunk per Code2Wav window, default 25 frames; configurable in the stage config):

```bash
curl -X POST http://localhost:8091/v1/audio/speech \
Expand Down
3 changes: 2 additions & 1 deletion examples/offline_inference/qwen3_tts/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,8 @@ Add `--streaming` to stream audio chunks progressively via `AsyncOmni` (requires
python end2end.py --query-type CustomVoice --streaming --output-dir /tmp/out_stream
```

Each 25-frame Code2Wav chunk is logged as it arrives. The final WAV file is written once generation
Each Code2Wav chunk is logged as it arrives (default 25 frames; configurable via `codec_chunk_frames`
and `initial_codec_chunk_frames` in the stage config). The final WAV file is written once generation
completes. This demonstrates that audio data is available progressively rather than only at the end.

> **Note:** Streaming uses `AsyncOmni` internally. The non-streaming path (`Omni`) is unchanged.
Expand Down
17 changes: 16 additions & 1 deletion examples/offline_inference/qwen3_tts/end2end.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
import asyncio
import logging
import os
import time
from typing import Any, NamedTuple

import soundfile as sf
Expand Down Expand Up @@ -337,13 +338,27 @@ async def main_streaming(args):

for i, prompt in enumerate(inputs):
request_id = str(i)
t_start = time.perf_counter()
t_prev = t_start
chunk_idx = 0
async for stage_output in omni.generate(prompt, request_id=request_id):
mm = stage_output.request_output.outputs[0].multimodal_output
if not stage_output.finished:
t_now = time.perf_counter()
audio = mm.get("audio")
n = len(audio) if isinstance(audio, list) else (0 if audio is None else 1)
logger.info(f"Request {request_id}: received chunk {n}")
dt_ms = (t_now - t_prev) * 1000
ttfa_ms = (t_now - t_start) * 1000
if chunk_idx == 0:
logger.info(f"Request {request_id}: chunk {chunk_idx} samples={n} TTFA={ttfa_ms:.1f}ms")
else:
logger.info(f"Request {request_id}: chunk {chunk_idx} samples={n} inter_chunk={dt_ms:.1f}ms")
t_prev = t_now
chunk_idx += 1
else:
t_end = time.perf_counter()
total_ms = (t_end - t_start) * 1000
logger.info(f"Request {request_id}: done total={total_ms:.1f}ms chunks={chunk_idx}")
_save_wav(output_dir, request_id, mm)


Expand Down
3 changes: 2 additions & 1 deletion examples/online_serving/qwen3_tts/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -250,6 +250,7 @@ Returns binary audio data with appropriate `Content-Type` header (e.g., `audio/w
| `language` | string | "Auto" | Language (see supported languages below) |
| `instructions` | string | "" | Voice style/emotion instructions |
| `max_new_tokens` | int | 2048 | Maximum tokens to generate |
| `initial_codec_chunk_frames` | int | null | Initial chunk size for reduced TTFA (overrides stage config) |
| `stream` | bool | false | Stream raw PCM chunks as they are decoded (requires `response_format="pcm"`) |

**Supported languages:** Auto, Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
Expand All @@ -265,7 +266,7 @@ Returns binary audio data with appropriate `Content-Type` header (e.g., `audio/w
## Streaming

Set `stream=true` with `response_format="pcm"` to receive raw PCM audio chunks as they are decoded
(one chunk per 25-frame Code2Wav window):
(one chunk per Code2Wav window, default 25 frames; configurable in the stage config):

```bash
curl -X POST http://localhost:8091/v1/audio/speech \
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,64 +4,90 @@
from collections import defaultdict
from types import SimpleNamespace

import pytest
import torch

from vllm_omni.model_executor.stage_input_processors.qwen3_tts import talker2code2wav_async_chunk

_FRAME = [1, 2, 3, 4] # 4-codebook frame
_Q = len(_FRAME) # num quantizers

def _req(external_req_id: str, *, finished: bool):

def _req(rid: str, *, finished: bool, initial_codec_chunk_frames: int | None = None):
ai = None
if initial_codec_chunk_frames is not None:
entry = SimpleNamespace(list_data=[initial_codec_chunk_frames])
ai = SimpleNamespace(entries={"initial_codec_chunk_frames": entry})
return SimpleNamespace(
external_req_id=external_req_id,
external_req_id=rid,
is_finished=lambda: finished,
additional_information=ai,
)


def test_talker2code2wav_async_chunk_does_not_emit_empty_chunk_when_not_finished():
transfer_manager = SimpleNamespace(
def _tm(*, chunk_frames=25, left_context=25, initial_chunk=0):
return SimpleNamespace(
code_prompt_token_ids=defaultdict(list),
connector=SimpleNamespace(config={"extra": {"codec_chunk_frames": 25, "codec_left_context_frames": 25}}),
put_req_chunk=defaultdict(int),
connector=SimpleNamespace(
config={
"extra": {
"codec_chunk_frames": chunk_frames,
"codec_left_context_frames": left_context,
"initial_codec_chunk_frames": initial_chunk,
}
}
),
)


def _call(tm, rid, *, n_frames, put_req=0, finished=False, req_ic=None):
tm.code_prompt_token_ids[rid] = [_FRAME[:] for _ in range(n_frames)]
tm.put_req_chunk[rid] = put_req
return talker2code2wav_async_chunk(
transfer_manager=tm,
pooling_output={"audio_codes": torch.zeros((0,))},
request=_req(rid, finished=finished, initial_codec_chunk_frames=req_ic),
is_finished=finished,
)


def test_does_not_emit_empty_chunk_when_not_finished():
tm = _tm()
request = _req("rid-empty", finished=False)

payload = talker2code2wav_async_chunk(
transfer_manager=transfer_manager,
transfer_manager=tm,
pooling_output={"audio_codes": torch.zeros((0,))},
request=request,
)

assert payload is None


def test_talker2code2wav_async_chunk_flushes_tail_when_finished_without_pooler_output():
transfer_manager = SimpleNamespace(
code_prompt_token_ids=defaultdict(list),
connector=SimpleNamespace(config={"extra": {"codec_chunk_frames": 25, "codec_left_context_frames": 25}}),
)
request_id = "rid-tail"
transfer_manager.code_prompt_token_ids[request_id] = [[1, 2, 3, 4] for _ in range(24)]
request = _req(request_id, finished=True)
def test_flushes_tail_when_finished_without_pooler_output():
tm = _tm()
rid = "rid-tail"
tm.code_prompt_token_ids[rid] = [_FRAME[:] for _ in range(24)]
request = _req(rid, finished=True)

payload = talker2code2wav_async_chunk(
transfer_manager=transfer_manager,
pooling_output=None, # e.g. EOS step with no audio_codes
transfer_manager=tm,
pooling_output=None,
request=request,
)

assert payload is not None
assert payload["finished"].item() is True
# ctx_frames header + flat codes
assert len(payload["code_predictor_codes"]) == 1 + 4 * 24
assert len(payload["code_predictor_codes"]) == _Q * 24


def test_talker2code2wav_async_chunk_emits_eof_marker_when_finished_with_no_frames():
transfer_manager = SimpleNamespace(
code_prompt_token_ids=defaultdict(list),
connector=SimpleNamespace(config={"extra": {"codec_chunk_frames": 25, "codec_left_context_frames": 25}}),
)
def test_emits_eof_marker_when_finished_with_no_frames():
tm = _tm()
request = _req("rid-eof", finished=True)

payload = talker2code2wav_async_chunk(
transfer_manager=transfer_manager,
transfer_manager=tm,
pooling_output=None,
request=request,
)
Expand All @@ -70,3 +96,59 @@ def test_talker2code2wav_async_chunk_emits_eof_marker_when_finished_with_no_fram
"code_predictor_codes": [],
"finished": torch.tensor(True, dtype=torch.bool),
}


_CASES = [
# Normal path (initial=0): emit at chunk_size boundaries
((25, 25, 0), (24, 0, False), None),
((25, 25, 0), (25, 0, False), (0, 25)),
# Initial-chunk phase: hold, first emit, second emit
((25, 25, 10), (9, 0, False), None),
((25, 25, 10), (10, 0, False), (0, 10)),
((25, 25, 10), (20, 1, False), (10, 20)),
# Non-divisible: holds at chunk boundary
((25, 25, 12), (25, 2, False), None),
# Normal phase: offset by initial_coverage (chunk//initial * initial)
((25, 25, 10), (45, 2, False), (20, 45)),
# Second normal emit (offset must stay stable)
((25, 25, 10), (70, 3, False), (25, 50)),
# initial >= chunk clamps to chunk_size (behaves as normal)
((25, 25, 30), (25, 0, False), (0, 25)),
# finished=True flushes IC tail
((25, 25, 10), (5, 0, True), (0, 5)),
# finished=True flushes non-divisible IC residual
((25, 25, 12), (25, 2, True), (24, 25)),
# finished=True flushes normal phase tail
((25, 25, 10), (30, 2, True), (20, 30)),
]


@pytest.mark.parametrize("config, state, expected", _CASES)
def test_streaming_decoding_with_variable_initial(config, state, expected):
chunk_frames, left_context, initial_chunk = config
n_frames, put_req, finished = state

tm = _tm(chunk_frames=chunk_frames, left_context=left_context, initial_chunk=initial_chunk)
payload = _call(tm, "r", n_frames=n_frames, put_req=put_req, finished=finished)

if expected is None:
assert payload is None
else:
exp_ctx, exp_window = expected
assert payload is not None
assert payload["left_context_size"] == exp_ctx
assert len(payload["code_predictor_codes"]) == _Q * exp_window


def test_per_request_override_activates_initial_phase():
tm = _tm(initial_chunk=0)
payload = _call(tm, "r-override", n_frames=10, req_ic=10)
assert payload is not None
assert payload["left_context_size"] == 0
assert len(payload["code_predictor_codes"]) == _Q * 10


def test_per_request_override_wins_over_stage_config():
tm = _tm(initial_chunk=5)
payload = _call(tm, "r-override2", n_frames=10, put_req=0, req_ic=15)
assert payload is None
5 changes: 5 additions & 0 deletions vllm_omni/entrypoints/openai/protocol/audio.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,11 @@ class OpenAICreateSpeechRequest(BaseModel):
default=None,
description="Maximum tokens to generate",
)
initial_codec_chunk_frames: int | None = Field(
default=None,
ge=0,
description="Initial chunk size for reduced TTFA. Overrides stage config for this request.",
)

@field_validator("stream_format")
@classmethod
Expand Down
3 changes: 3 additions & 0 deletions vllm_omni/entrypoints/openai/serving_speech.py
Original file line number Diff line number Diff line change
Expand Up @@ -468,6 +468,9 @@ def _build_tts_params(self, request: OpenAICreateSpeechRequest) -> dict[str, Any
else:
params["max_new_tokens"] = [2048]

if request.initial_codec_chunk_frames is not None:
params["initial_codec_chunk_frames"] = [request.initial_codec_chunk_frames]

# VoiceDesign requires non_streaming_mode (match offline script behaviour).
# CustomVoice and Base rely on the model default (True and False respectively).
if params["task_type"][0] == "VoiceDesign":
Expand Down
4 changes: 4 additions & 0 deletions vllm_omni/model_executor/stage_configs/qwen3_tts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,10 @@ runtime:
# Align with Omni: small chunks with sufficient context overlap.
codec_chunk_frames: 25
codec_left_context_frames: 25
# First chunk size for reduced TTFA (0 = disabled).
# When > 0, emits small chunks every N frames during the initial phase,
# then switches to codec_chunk_frames cadence.
initial_codec_chunk_frames: 0

edges:
- from: 0
Expand Down
4 changes: 4 additions & 0 deletions vllm_omni/model_executor/stage_configs/qwen3_tts_batch.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,10 @@ runtime:
# Align with Omni: small chunks with sufficient context overlap.
codec_chunk_frames: 25
codec_left_context_frames: 25
# First chunk size for reduced TTFA (0 = disabled).
# When > 0, emits small chunks every N frames during the initial phase,
# then switches to codec_chunk_frames cadence.
initial_codec_chunk_frames: 0

edges:
- from: 0
Expand Down
Loading