[feat][Qwen3TTS] Simple dynamic TTFA based on Code2Wav load by JuanPZuluaga · Pull Request #1714 · vllm-project/vllm-omni

JuanPZuluaga · 2026-03-06T10:32:44Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

When a Qwen3-TTS request starts, the Code2Wav stage must accumulate enough codec tokens before it can emit the first audio chunk. The size of that first batch (initial chunk, IC) directly controls TTFA: smaller IC = faster first audio, but more decode overhead per request.

we introduce a dynamic IC sizing: instead of a fixed initial_codec_chunk_frames config value, the IC is selected at runtime from power of 2 (2, 4, 8, 16, 32) based on current server load. Under light load the IC is small (low TTFA); under heavy load it grows toward the configured max (amortizing decode cost across concurrent requests).

The available IC steps are derived from chunk_size: powers of 2 up to (but not including) chunk_size. e.g., chunk_size=25 gives steps [2, 4, 8, 16]; chunk_size=50 gives [2, 4, 8, 16, 32]. we also drop the manual config in yaml.

The load factor is computed as active_requests / max_batch_size and mapped linearly to the available steps. Per-request overrides via additional_information (field still is: initial_codec_chunk_frames) still take full priority, in case users want to really set this value. The chunking logic was also simplified: both phases share the same window calculation, and the normal phase is offset by initial_coverage to avoid replaying frames already emitted during the IC phase.

Test Plan

python -m pytest tests/model_executor/stage_input_processors/test_qwen3_tts_async_chunk.py

We can also verify things by:

start the server and send requests at low and high concurrency.
under low load, TTFA should decrease noticeably compared to a fixed IC baseline.
under high load, IC grows toward max_ic_for_chunk_size(chunk_size), matching the fixed IC behavior.

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

… load in the code2wav block Signed-off-by: pablo <pablo@agigo.ai>

… feat/qwen3tts-dynamic-ttfc

JuanPZuluaga · 2026-03-07T07:51:59Z

PR from vLLM that we might take as inspiration: vllm-project/vllm#34936

Signed-off-by: pablo <pablo@agigo.ai>

… feat/qwen3tts-dynamic-ttfc

Signed-off-by: pablo <pablo@agigo.ai>

…n code2wav load Signed-off-by: pablo <pablo@agigo.ai>

Signed-off-by: pablo <pablo@agigo.ai>

JuanPZuluaga · 2026-03-08T19:45:34Z

Somre results regarding this approach:

I tested multiple 2 configurations:

codec_chunk_frames=17, c17
codec_chunk_frames=33, c33

Key observations:

at conc=1, both are identical (~100ms), the talker stage is the bottleneck, so IC doesn't matter yet
at conc=4, (reaching max_batch_size), c17 has ~45% lower TTFP (205ms vs 371ms), this is the max load, we could have and the c33 has steps=[2,4,8,16,32], so IC goes to 32 instead of 16 for c17`.
at conc=6-8 it is overloaded and both degrade

Also, goos to note that throughput is similar (~2.5 req/s).

I can add more results or experiments if needed.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ce0c149ca8

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-08T19:51:06Z

+    in_initial_phase = initial_chunk_size > 0 and initial_chunk_size < chunk_size and length < chunk_size
+    effective_chunk = initial_chunk_size if in_initial_phase else chunk_size


Keep phase transition from replaying first chunk

When length hits chunk_size, this condition switches out of initial mode and the next emit uses full-chunk cadence with left_context_size=0. If smaller initial chunks were already emitted (for example IC=10 emits at 10 and 20), the call at 25 re-sends frames 1..25, so downstream audio includes duplicated content. With default chunk_size=25, this affects all dynamic IC steps (2/4/8/16) and produces audible replay at the initial→normal boundary.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-08T19:51:06Z

+    if not per_request_override:
+        max_ic = max_ic_for_chunk_size(chunk_size)
+        active = sum(1 for v in transfer_manager.code_prompt_token_ids.values() if 0 < len(v) < chunk_size)
+        capacity = getattr(transfer_manager, "scheduler_max_num_seqs", 1)
+        initial_chunk_size = compute_dynamic_initial_chunk_size(active, capacity, max_ic)


Choose dynamic IC once per request

This recomputes initial_chunk_size on every call from global load, so a request can change IC mid-stream as concurrency fluctuates. Because chunk emission has no per-request sent-offset state, changing IC resets cadence and causes previously emitted frames to be emitted again (e.g., 2-frame emit first, then load rises and 8-frame emit replays frames 1..8). That makes streamed audio unstable under varying load.

Useful? React with 👍 / 👎.

linyueqian · 2026-03-09T01:23:13Z

Hey @JuanPZuluaga, nice work on this, the dynamic IC idea makes a lot of sense. I ran the code locally and tests all pass. Found two things worth looking at:

1. Load estimation misses requests past the initial phase

In qwen3_tts.py:

active = sum(1 for v in transfer_manager.code_prompt_token_ids.values() if 0 < len(v) < chunk_size)

The len(v) < chunk_size filter means any request that's already accumulated >= 25 frames won't be counted as "active". In practice, if you have 4 long-running requests (say len=50 each) and a new one comes in, active reports 1 instead of 5, so the new request gets IC=2 (minimum) even though the server is heavily loaded.

I think this should just be:

active = sum(1 for v in transfer_manager.code_prompt_token_ids.values() if len(v) > 0)

2. Phase transition re-decodes frames that were already sent

The old code tracked an initial_coverage offset so the normal phase picks up where the initial phase left off. The new stateless logic drops this, which causes overlap at the transition boundary.

Concrete example with IC=10, chunk_size=25. I simulated frame by frame accumulation locally:

Length	Old behavior	New behavior
10	emit frames 0~9, lc=0	emit frames 0~9, lc=0
20	emit frames 0~19, lc=10	emit frames 0~19, lc=10
25	hold	emit frames 0~24, lc=0
45	emit frames 0~44, lc=20	hold
50	hold	emit frames 0~49, lc=25

At length=25 the new code switches to normal phase and sends frames 0~24 with left_context_size=0. But the decoder already processed chunks covering frames 0~19 during the initial phase. That's redundant decode work and might affect audio quality depending on how the decoder handles it.

… feat/qwen3tts-dynamic-ttfc

…oved from IC to main path Signed-off-by: pablo <pablo@agigo.ai>

Signed-off-by: pablo <pablo@agigo.ai>

JuanPZuluaga · 2026-03-09T07:45:03Z

A small issue noticed here:

code_prompt_token_ids always was growing monotonically, thus the "load" of the code2wav was always reaching the peak within the first few requests.

the cause: is race condition between two threads: Scheduler call save_async() (in omni_ar_schedulerwhich queues the last task) then immediately calls cleanup() (pops code_prompt_token_ids[request_id]).
save_loop thread (in chunk_transfer_adapter) later processes that queued task, which accesses transfer_manager.code_prompt_token_ids[request_id] — and since it's a defaultdict(list), this recreates the entry.

The re-created entry is never cleaned up (scheduler already ran cleanup), so stale entries accumulate indefinitely, thus, the loading wouldn't work at all.

The fis is to pop the code_prompt_token_ids entry inside the gate function (talker2code2wav_async_chunk) when finished=True.

EDIT: i moved the cleanup to the chunk_transfer_adapter.py:_update_request_payload()

Signed-off-by: pablo <pablo@agigo.ai>

JuanPZuluaga · 2026-03-09T08:27:26Z

Some new results:

we can see TTFP/TTFA/TTFC: very similar under low load (con=1-2)
We can see it starts to grow more significantly for c33 vs c17, this is due to the load in the server.

Some audio samples for:

SAMPLE_TEXT="Once upon a time in a small coastal town, there lived a lighthouse keeper \
named Thomas who had watched over the waters for more than thirty years. Every evening at \
sunset, he would climb the winding staircase to light the great lamp, and every morning he \
would extinguish it as the sun rose over the horizon. The townspeople relied on his steady \
presence, for countless ships had found safe passage thanks to his dedication."

default/baseline: stream_false_ic_0.wav
ic not set, load balancing: stream_true_ic_0.wav
ic set in the request: stream_true_ic_4.wav

… feat/qwen3tts-dynamic-ttfc

linyueqian · 2026-03-09T16:04:28Z

Hey @JuanPZuluaga, latest updates look good, stale entry fix and load counting are solid. Audio samples sound clean.

One thing worth looking at: IC is recomputed statelessly each call, but initial_coverage assumes the same IC was used throughout. If load changes mid-request, the IC at transition time can differ from what was actually used during the initial phase. For example, request starts at low load with IC=2, emits frames 0-1, then load spikes and IC becomes 8, so initial_coverage computes as 16 instead of 2. The decoder seems to handle this fine based on the samples, but it's worth being aware of.

The current tests only cover single-request progression. Would be good to add a test where one request goes through the full IC-to-normal lifecycle while other requests come and go between calls:

def test_ic_stable_across_load_change():
    tm = _tm(max_num_seqs=8)
    p1 = _call(tm, "r", n_frames=2)  # low load, IC=2
    assert p1 is not None

    for i in range(6):
        tm.code_prompt_token_ids[f"other-{i}"] = [[0]] * 10

    # "r" hits normal phase under high load, IC now recomputed as 16
    p2 = _call(tm, "r", n_frames=25)
    assert p2 is not None

That would document the current behavior and catch regressions if the transition logic changes.

Signed-off-by: pablo <pablo@agigo.ai>

JuanPZuluaga · 2026-03-09T22:28:35Z

@linyueqian that's correct, the current setting would change the load mid-request. I added a small comment about IC changing mid request and added a test test_ic_load_change_mid_request where ic changes mid request.

Let me know if we need another change.

… feat/qwen3tts-dynamic-ttfc

linyueqian

LGTM

… feat/qwen3tts-dynamic-ttfc

…ject#1714) Signed-off-by: pablo <pablo@agigo.ai> Co-authored-by: pablo <pablo@agigo.ai> Signed-off-by: lishunyang <lishunyang12@163.com>

Adapt to PR vllm-project#1714 which computes initial_codec_chunk_frames dynamically based on code2wav load. The static config entry is no longer needed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: linyueqian <linyueqian@outlook.com>

…ject#1714) Signed-off-by: pablo <pablo@agigo.ai> Co-authored-by: pablo <pablo@agigo.ai>

pablo added 2 commits March 6, 2026 10:29

simple implementation of dynamic first chunk to reduce TTFA, based on…

d01e830

… load in the code2wav block Signed-off-by: pablo <pablo@agigo.ai>

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

fde83c2

… feat/qwen3tts-dynamic-ttfc

pablo added 7 commits March 7, 2026 21:19

updated logic, simpler overall

edcbd4b

Signed-off-by: pablo <pablo@agigo.ai>

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

3b8f7c3

… feat/qwen3tts-dynamic-ttfc

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

c80e5f7

… feat/qwen3tts-dynamic-ttfc

update tests

1be9168

Signed-off-by: pablo <pablo@agigo.ai>

update compute actives req

f499d23

Signed-off-by: pablo <pablo@agigo.ai>

remove initial_codec_chunk_frames from yaml, as we compute ic based o…

d7e1308

…n code2wav load Signed-off-by: pablo <pablo@agigo.ai>

update tests

ce0c149

Signed-off-by: pablo <pablo@agigo.ai>

JuanPZuluaga marked this pull request as ready for review March 8, 2026 19:45

JuanPZuluaga requested a review from hsliuustc0106 as a code owner March 8, 2026 19:45

chatgpt-codex-connector Bot reviewed Mar 8, 2026

View reviewed changes

pablo added 2 commits March 9, 2026 05:42

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

a0e8c7a

… feat/qwen3tts-dynamic-ttfc

update qwen3_tts to include in load balancing already requests that m…

31d43a0

…oved from IC to main path Signed-off-by: pablo <pablo@agigo.ai>

JuanPZuluaga mentioned this pull request Mar 9, 2026

[Test] Add Qwen3-TTS nightly performance benchmark #1700

Merged

fix: pop code_prompt_token_ids so load balancing works!

74b4a54

Signed-off-by: pablo <pablo@agigo.ai>

pablo added 4 commits March 9, 2026 07:50

update docs with dynamic IC and examples

7e7cbeb

Signed-off-by: pablo <pablo@agigo.ai>

update tests

0d82b99

Signed-off-by: pablo <pablo@agigo.ai>

move code_prompt_token_ids cleanup to transfer_adapter

1f48227

Signed-off-by: pablo <pablo@agigo.ai>

cleanup yaml

6cf16b4

Signed-off-by: pablo <pablo@agigo.ai>

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

63ee447

… feat/qwen3tts-dynamic-ttfc

pablo added 3 commits March 9, 2026 22:20

add change mid-request test

93f8713

Signed-off-by: pablo <pablo@agigo.ai>

comment in initial_coverage path

8e37ab7

Signed-off-by: pablo <pablo@agigo.ai>

merge main

0a0c8b6

Signed-off-by: pablo <pablo@agigo.ai>

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

d3f076d

… feat/qwen3tts-dynamic-ttfc

linyueqian added the ready label to trigger buildkite CI label Mar 10, 2026

linyueqian approved these changes Mar 10, 2026

View reviewed changes

linyueqian enabled auto-merge (squash) March 10, 2026 17:31

linyueqian mentioned this pull request Mar 10, 2026

[RFC]: TTS Development Roadmap - March 2026 #1795

Open

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

e813002

… feat/qwen3tts-dynamic-ttfc

JuanPZuluaga mentioned this pull request Mar 10, 2026

[Feat][Qwen3TTS][Code2wav] triton SnakeBeta and Cuda Graph #1797

Merged

5 tasks

linyueqian merged commit 1d0a97f into vllm-project:main Mar 10, 2026
6 of 7 checks passed

JuanPZuluaga deleted the feat/qwen3tts-dynamic-ttfc branch March 16, 2026 12:34

linyueqian mentioned this pull request Mar 16, 2026

[Bug]: Qwen 3 TTS Generating Incorrect Audio #1861

Closed

1 task

clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026

[feat][Qwen3TTS] Simple dynamic TTFA based on Code2Wav load (vllm-pro…

9f4c9f9

…ject#1714) Signed-off-by: pablo <pablo@agigo.ai> Co-authored-by: pablo <pablo@agigo.ai>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat][Qwen3TTS] Simple dynamic TTFA based on Code2Wav load#1714

[feat][Qwen3TTS] Simple dynamic TTFA based on Code2Wav load#1714
linyueqian merged 22 commits into
vllm-project:mainfrom
JuanPZuluaga:feat/qwen3tts-dynamic-ttfc

JuanPZuluaga commented Mar 6, 2026 •

edited

Loading

Uh oh!

JuanPZuluaga commented Mar 7, 2026 •

edited

Loading

Uh oh!

JuanPZuluaga commented Mar 8, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Mar 8, 2026

Uh oh!

chatgpt-codex-connector Bot Mar 8, 2026

Uh oh!

linyueqian commented Mar 9, 2026 •

edited

Loading

Uh oh!

JuanPZuluaga commented Mar 9, 2026 •

edited

Loading

Uh oh!

JuanPZuluaga commented Mar 9, 2026

Uh oh!

linyueqian commented Mar 9, 2026

Uh oh!

JuanPZuluaga commented Mar 9, 2026

Uh oh!

linyueqian left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		in_initial_phase = initial_chunk_size > 0 and initial_chunk_size < chunk_size and length < chunk_size
		effective_chunk = initial_chunk_size if in_initial_phase else chunk_size

Conversation

JuanPZuluaga commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

JuanPZuluaga commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JuanPZuluaga commented Mar 8, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

linyueqian commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JuanPZuluaga commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JuanPZuluaga commented Mar 9, 2026

Uh oh!

linyueqian commented Mar 9, 2026

Uh oh!

JuanPZuluaga commented Mar 9, 2026

Uh oh!

linyueqian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JuanPZuluaga commented Mar 6, 2026 •

edited

Loading

JuanPZuluaga commented Mar 7, 2026 •

edited

Loading

linyueqian commented Mar 9, 2026 •

edited

Loading

JuanPZuluaga commented Mar 9, 2026 •

edited

Loading