Skip to content

[feat][Qwen3TTS] Simple dynamic TTFA based on Code2Wav load#1714

Merged
linyueqian merged 22 commits into
vllm-project:mainfrom
JuanPZuluaga:feat/qwen3tts-dynamic-ttfc
Mar 10, 2026
Merged

[feat][Qwen3TTS] Simple dynamic TTFA based on Code2Wav load#1714
linyueqian merged 22 commits into
vllm-project:mainfrom
JuanPZuluaga:feat/qwen3tts-dynamic-ttfc

Conversation

@JuanPZuluaga
Copy link
Copy Markdown
Contributor

@JuanPZuluaga JuanPZuluaga commented Mar 6, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

When a Qwen3-TTS request starts, the Code2Wav stage must accumulate enough codec tokens before it can emit the first audio chunk. The size of that first batch (initial chunk, IC) directly controls TTFA: smaller IC = faster first audio, but more decode overhead per request.

we introduce a dynamic IC sizing: instead of a fixed initial_codec_chunk_frames config value, the IC is selected at runtime from power of 2 (2, 4, 8, 16, 32) based on current server load. Under light load the IC is small (low TTFA); under heavy load it grows toward the configured max (amortizing decode cost across concurrent requests).

The available IC steps are derived from chunk_size: powers of 2 up to (but not including) chunk_size. e.g., chunk_size=25 gives steps [2, 4, 8, 16]; chunk_size=50 gives [2, 4, 8, 16, 32]. we also drop the manual config in yaml.

The load factor is computed as active_requests / max_batch_size and mapped linearly to the available steps. Per-request overrides via additional_information (field still is: initial_codec_chunk_frames) still take full priority, in case users want to really set this value. The chunking logic was also simplified: both phases share the same window calculation, and the normal phase is offset by initial_coverage to avoid replaying frames already emitted during the IC phase.

Test Plan

python -m pytest tests/model_executor/stage_input_processors/test_qwen3_tts_async_chunk.py

We can also verify things by:

  • start the server and send requests at low and high concurrency.
  • under low load, TTFA should decrease noticeably compared to a fixed IC baseline.
  • under high load, IC grows toward max_ic_for_chunk_size(chunk_size), matching the fixed IC behavior.

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

pablo added 2 commits March 6, 2026 10:29
@JuanPZuluaga
Copy link
Copy Markdown
Contributor Author

JuanPZuluaga commented Mar 7, 2026

PR from vLLM that we might take as inspiration: vllm-project/vllm#34936

pablo added 7 commits March 7, 2026 21:19
Signed-off-by: pablo <pablo@agigo.ai>
Signed-off-by: pablo <pablo@agigo.ai>
Signed-off-by: pablo <pablo@agigo.ai>
…n code2wav load

Signed-off-by: pablo <pablo@agigo.ai>
Signed-off-by: pablo <pablo@agigo.ai>
@JuanPZuluaga
Copy link
Copy Markdown
Contributor Author

Somre results regarding this approach:

I tested multiple 2 configurations:

  • codec_chunk_frames=17, c17
  • codec_chunk_frames=33, c33

Key observations:

  • at conc=1, both are identical (~100ms), the talker stage is the bottleneck, so IC doesn't matter yet
  • at conc=4, (reaching max_batch_size), c17 has ~45% lower TTFP (205ms vs 371ms), this is the max load, we could have and the c33 has steps=[2,4,8,16,32], so IC goes to 32 instead of 16 for c17`.
  • at conc=6-8 it is overloaded and both degrade

Also, goos to note that throughput is similar (~2.5 req/s).

comparison

I can add more results or experiments if needed.

@JuanPZuluaga JuanPZuluaga marked this pull request as ready for review March 8, 2026 19:45
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ce0c149ca8

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +141 to +142
in_initial_phase = initial_chunk_size > 0 and initial_chunk_size < chunk_size and length < chunk_size
effective_chunk = initial_chunk_size if in_initial_phase else chunk_size
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep phase transition from replaying first chunk

When length hits chunk_size, this condition switches out of initial mode and the next emit uses full-chunk cadence with left_context_size=0. If smaller initial chunks were already emitted (for example IC=10 emits at 10 and 20), the call at 25 re-sends frames 1..25, so downstream audio includes duplicated content. With default chunk_size=25, this affects all dynamic IC steps (2/4/8/16) and produces audible replay at the initial→normal boundary.

Useful? React with 👍 / 👎.

Comment on lines +100 to +104
if not per_request_override:
max_ic = max_ic_for_chunk_size(chunk_size)
active = sum(1 for v in transfer_manager.code_prompt_token_ids.values() if 0 < len(v) < chunk_size)
capacity = getattr(transfer_manager, "scheduler_max_num_seqs", 1)
initial_chunk_size = compute_dynamic_initial_chunk_size(active, capacity, max_ic)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Choose dynamic IC once per request

This recomputes initial_chunk_size on every call from global load, so a request can change IC mid-stream as concurrency fluctuates. Because chunk emission has no per-request sent-offset state, changing IC resets cadence and causes previously emitted frames to be emitted again (e.g., 2-frame emit first, then load rises and 8-frame emit replays frames 1..8). That makes streamed audio unstable under varying load.

Useful? React with 👍 / 👎.

@linyueqian
Copy link
Copy Markdown
Collaborator

linyueqian commented Mar 9, 2026

Hey @JuanPZuluaga, nice work on this, the dynamic IC idea makes a lot of sense. I ran the code locally and tests all pass. Found two things worth looking at:

1. Load estimation misses requests past the initial phase

In qwen3_tts.py:

active = sum(1 for v in transfer_manager.code_prompt_token_ids.values() if 0 < len(v) < chunk_size)

The len(v) < chunk_size filter means any request that's already accumulated >= 25 frames won't be counted as "active". In practice, if you have 4 long-running requests (say len=50 each) and a new one comes in, active reports 1 instead of 5, so the new request gets IC=2 (minimum) even though the server is heavily loaded.

I think this should just be:

active = sum(1 for v in transfer_manager.code_prompt_token_ids.values() if len(v) > 0)

2. Phase transition re-decodes frames that were already sent

The old code tracked an initial_coverage offset so the normal phase picks up where the initial phase left off. The new stateless logic drops this, which causes overlap at the transition boundary.

Concrete example with IC=10, chunk_size=25. I simulated frame by frame accumulation locally:

Length Old behavior New behavior
10 emit frames 0~9, lc=0 emit frames 0~9, lc=0
20 emit frames 0~19, lc=10 emit frames 0~19, lc=10
25 hold emit frames 0~24, lc=0
45 emit frames 0~44, lc=20 hold
50 hold emit frames 0~49, lc=25

At length=25 the new code switches to normal phase and sends frames 0~24 with left_context_size=0. But the decoder already processed chunks covering frames 0~19 during the initial phase. That's redundant decode work and might affect audio quality depending on how the decoder handles it.

pablo added 2 commits March 9, 2026 05:42
Signed-off-by: pablo <pablo@agigo.ai>
@JuanPZuluaga
Copy link
Copy Markdown
Contributor Author

JuanPZuluaga commented Mar 9, 2026

A small issue noticed here:

code_prompt_token_ids always was growing monotonically, thus the "load" of the code2wav was always reaching the peak within the first few requests.

the cause: is race condition between two threads: Scheduler call save_async() (in omni_ar_schedulerwhich queues the last task) then immediately calls cleanup() (pops code_prompt_token_ids[request_id]).
save_loop thread (in chunk_transfer_adapter) later processes that queued task, which accesses transfer_manager.code_prompt_token_ids[request_id] — and since it's a defaultdict(list), this recreates the entry.

The re-created entry is never cleaned up (scheduler already ran cleanup), so stale entries accumulate indefinitely, thus, the loading wouldn't work at all.

The fis is to pop the code_prompt_token_ids entry inside the gate function (talker2code2wav_async_chunk) when finished=True.

EDIT: i moved the cleanup to the chunk_transfer_adapter.py:_update_request_payload()

pablo added 4 commits March 9, 2026 07:50
Signed-off-by: pablo <pablo@agigo.ai>
Signed-off-by: pablo <pablo@agigo.ai>
Signed-off-by: pablo <pablo@agigo.ai>
Signed-off-by: pablo <pablo@agigo.ai>
@JuanPZuluaga
Copy link
Copy Markdown
Contributor Author

Some new results:

comparison
  • we can see TTFP/TTFA/TTFC: very similar under low load (con=1-2)
  • We can see it starts to grow more significantly for c33 vs c17, this is due to the load in the server.

Some audio samples for:

SAMPLE_TEXT="Once upon a time in a small coastal town, there lived a lighthouse keeper \
named Thomas who had watched over the waters for more than thirty years. Every evening at \
sunset, he would climb the winding staircase to light the great lamp, and every morning he \
would extinguish it as the sun rose over the horizon. The townspeople relied on his steady \
presence, for countless ships had found safe passage thanks to his dedication."

default/baseline: stream_false_ic_0.wav
ic not set, load balancing: stream_true_ic_0.wav
ic set in the request: stream_true_ic_4.wav

@linyueqian
Copy link
Copy Markdown
Collaborator

Hey @JuanPZuluaga, latest updates look good, stale entry fix and load counting are solid. Audio samples sound clean.

One thing worth looking at: IC is recomputed statelessly each call, but initial_coverage assumes the same IC was used throughout. If load changes mid-request, the IC at transition time can differ from what was actually used during the initial phase. For example, request starts at low load with IC=2, emits frames 0-1, then load spikes and IC becomes 8, so initial_coverage computes as 16 instead of 2. The decoder seems to handle this fine based on the samples, but it's worth being aware of.

The current tests only cover single-request progression. Would be good to add a test where one request goes through the full IC-to-normal lifecycle while other requests come and go between calls:

def test_ic_stable_across_load_change():
    tm = _tm(max_num_seqs=8)
    p1 = _call(tm, "r", n_frames=2)  # low load, IC=2
    assert p1 is not None

    for i in range(6):
        tm.code_prompt_token_ids[f"other-{i}"] = [[0]] * 10

    # "r" hits normal phase under high load, IC now recomputed as 16
    p2 = _call(tm, "r", n_frames=25)
    assert p2 is not None

That would document the current behavior and catch regressions if the transition logic changes.

pablo added 3 commits March 9, 2026 22:20
Signed-off-by: pablo <pablo@agigo.ai>
Signed-off-by: pablo <pablo@agigo.ai>
Signed-off-by: pablo <pablo@agigo.ai>
@JuanPZuluaga
Copy link
Copy Markdown
Contributor Author

@linyueqian that's correct, the current setting would change the load mid-request. I added a small comment about IC changing mid request and added a test test_ic_load_change_mid_request where ic changes mid request.

Let me know if we need another change.

@linyueqian linyueqian added the ready label to trigger buildkite CI label Mar 10, 2026
Copy link
Copy Markdown
Collaborator

@linyueqian linyueqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@linyueqian linyueqian merged commit 1d0a97f into vllm-project:main Mar 10, 2026
6 of 7 checks passed
lishunyang12 pushed a commit to lishunyang12/vllm-omni that referenced this pull request Mar 11, 2026
…ject#1714)

Signed-off-by: pablo <pablo@agigo.ai>
Co-authored-by: pablo <pablo@agigo.ai>
Signed-off-by: lishunyang <lishunyang12@163.com>
linyueqian added a commit to linyueqian/vllm-omni that referenced this pull request Mar 11, 2026
Adapt to PR vllm-project#1714 which computes initial_codec_chunk_frames
dynamically based on code2wav load. The static config entry
is no longer needed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: linyueqian <linyueqian@outlook.com>
@JuanPZuluaga JuanPZuluaga deleted the feat/qwen3tts-dynamic-ttfc branch March 16, 2026 12:34
clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
…ject#1714)

Signed-off-by: pablo <pablo@agigo.ai>
Co-authored-by: pablo <pablo@agigo.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants