[feat][Qwen3TTS] Simple dynamic TTFA based on Code2Wav load#1714
Conversation
… load in the code2wav block Signed-off-by: pablo <pablo@agigo.ai>
… feat/qwen3tts-dynamic-ttfc
|
PR from vLLM that we might take as inspiration: vllm-project/vllm#34936 |
Signed-off-by: pablo <pablo@agigo.ai>
… feat/qwen3tts-dynamic-ttfc
… feat/qwen3tts-dynamic-ttfc
Signed-off-by: pablo <pablo@agigo.ai>
Signed-off-by: pablo <pablo@agigo.ai>
…n code2wav load Signed-off-by: pablo <pablo@agigo.ai>
Signed-off-by: pablo <pablo@agigo.ai>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ce0c149ca8
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| in_initial_phase = initial_chunk_size > 0 and initial_chunk_size < chunk_size and length < chunk_size | ||
| effective_chunk = initial_chunk_size if in_initial_phase else chunk_size |
There was a problem hiding this comment.
Keep phase transition from replaying first chunk
When length hits chunk_size, this condition switches out of initial mode and the next emit uses full-chunk cadence with left_context_size=0. If smaller initial chunks were already emitted (for example IC=10 emits at 10 and 20), the call at 25 re-sends frames 1..25, so downstream audio includes duplicated content. With default chunk_size=25, this affects all dynamic IC steps (2/4/8/16) and produces audible replay at the initial→normal boundary.
Useful? React with 👍 / 👎.
| if not per_request_override: | ||
| max_ic = max_ic_for_chunk_size(chunk_size) | ||
| active = sum(1 for v in transfer_manager.code_prompt_token_ids.values() if 0 < len(v) < chunk_size) | ||
| capacity = getattr(transfer_manager, "scheduler_max_num_seqs", 1) | ||
| initial_chunk_size = compute_dynamic_initial_chunk_size(active, capacity, max_ic) |
There was a problem hiding this comment.
Choose dynamic IC once per request
This recomputes initial_chunk_size on every call from global load, so a request can change IC mid-stream as concurrency fluctuates. Because chunk emission has no per-request sent-offset state, changing IC resets cadence and causes previously emitted frames to be emitted again (e.g., 2-frame emit first, then load rises and 8-frame emit replays frames 1..8). That makes streamed audio unstable under varying load.
Useful? React with 👍 / 👎.
|
Hey @JuanPZuluaga, nice work on this, the dynamic IC idea makes a lot of sense. I ran the code locally and tests all pass. Found two things worth looking at: 1. Load estimation misses requests past the initial phase In active = sum(1 for v in transfer_manager.code_prompt_token_ids.values() if 0 < len(v) < chunk_size)The I think this should just be: active = sum(1 for v in transfer_manager.code_prompt_token_ids.values() if len(v) > 0)2. Phase transition re-decodes frames that were already sent The old code tracked an Concrete example with IC=10, chunk_size=25. I simulated frame by frame accumulation locally:
At length=25 the new code switches to normal phase and sends frames 0~24 with |
… feat/qwen3tts-dynamic-ttfc
…oved from IC to main path Signed-off-by: pablo <pablo@agigo.ai>
Signed-off-by: pablo <pablo@agigo.ai>
|
A small issue noticed here:
the cause: is race condition between two threads: The re-created entry is never cleaned up (scheduler already ran cleanup), so stale entries accumulate indefinitely, thus, the loading wouldn't work at all. The fis is to pop the code_prompt_token_ids entry inside the gate function (talker2code2wav_async_chunk) when EDIT: i moved the cleanup to the |
Signed-off-by: pablo <pablo@agigo.ai>
Signed-off-by: pablo <pablo@agigo.ai>
Signed-off-by: pablo <pablo@agigo.ai>
Signed-off-by: pablo <pablo@agigo.ai>
|
Some new results:
Some audio samples for: default/baseline: stream_false_ic_0.wav |
… feat/qwen3tts-dynamic-ttfc
|
Hey @JuanPZuluaga, latest updates look good, stale entry fix and load counting are solid. Audio samples sound clean. One thing worth looking at: IC is recomputed statelessly each call, but The current tests only cover single-request progression. Would be good to add a test where one request goes through the full IC-to-normal lifecycle while other requests come and go between calls: def test_ic_stable_across_load_change():
tm = _tm(max_num_seqs=8)
p1 = _call(tm, "r", n_frames=2) # low load, IC=2
assert p1 is not None
for i in range(6):
tm.code_prompt_token_ids[f"other-{i}"] = [[0]] * 10
# "r" hits normal phase under high load, IC now recomputed as 16
p2 = _call(tm, "r", n_frames=25)
assert p2 is not NoneThat would document the current behavior and catch regressions if the transition logic changes. |
Signed-off-by: pablo <pablo@agigo.ai>
Signed-off-by: pablo <pablo@agigo.ai>
Signed-off-by: pablo <pablo@agigo.ai>
|
@linyueqian that's correct, the current setting would change the load mid-request. I added a small comment about IC changing mid request and added a test Let me know if we need another change. |
… feat/qwen3tts-dynamic-ttfc
… feat/qwen3tts-dynamic-ttfc
…ject#1714) Signed-off-by: pablo <pablo@agigo.ai> Co-authored-by: pablo <pablo@agigo.ai> Signed-off-by: lishunyang <lishunyang12@163.com>
Adapt to PR vllm-project#1714 which computes initial_codec_chunk_frames dynamically based on code2wav load. The static config entry is no longer needed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: linyueqian <linyueqian@outlook.com>
…ject#1714) Signed-off-by: pablo <pablo@agigo.ai> Co-authored-by: pablo <pablo@agigo.ai>


PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
When a Qwen3-TTS request starts, the Code2Wav stage must accumulate enough codec tokens before it can emit the first audio chunk. The size of that first batch (initial chunk, IC) directly controls TTFA: smaller IC = faster first audio, but more decode overhead per request.
we introduce a dynamic IC sizing: instead of a fixed
initial_codec_chunk_framesconfig value, the IC is selected at runtime from power of 2 (2, 4, 8, 16, 32) based on current server load. Under light load the IC is small (low TTFA); under heavy load it grows toward the configured max (amortizing decode cost across concurrent requests).The available IC steps are derived from chunk_size: powers of 2 up to (but not including) chunk_size. e.g.,
chunk_size=25gives steps[2, 4, 8, 16];chunk_size=50gives[2, 4, 8, 16, 32]. we also drop the manual config inyaml.The load factor is computed as
active_requests / max_batch_sizeand mapped linearly to the available steps. Per-request overrides viaadditional_information(field still is:initial_codec_chunk_frames) still take full priority, in case users want to really set this value. The chunking logic was also simplified: both phases share the same window calculation, and the normal phase is offset by initial_coverage to avoid replaying frames already emitted during the IC phase.Test Plan
We can also verify things by:
TTFAshould decrease noticeably compared to a fixed IC baseline.ICgrows towardmax_ic_for_chunk_size(chunk_size), matching the fixed IC behavior.Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)