Skip to content

[Fix][Qwen3-TTS] Preserve ref_code decoder context for Base ICL#1731

Merged
linyueqian merged 3 commits into
vllm-project:mainfrom
Sy0307:fix/qwen3-tts-base-icl-refcode-context
Mar 9, 2026
Merged

[Fix][Qwen3-TTS] Preserve ref_code decoder context for Base ICL#1731
linyueqian merged 3 commits into
vllm-project:mainfrom
Sy0307:fix/qwen3-tts-base-icl-refcode-context

Conversation

@Sy0307
Copy link
Copy Markdown
Contributor

@Sy0307 Sy0307 commented Mar 8, 2026

Purpose

Fix noisy first-chunk audio for Qwen3-TTS Base ICL in the multi-stage pipeline.

The official offline Qwen3-TTS Base voice-cloning path decodes ref_code + generated_codes and trims the reference prefix from the final waveform. That gives Code2Wav the same acoustic prefix context used by ICL prompt construction.

The multi-stage pipeline was not preserving that behavior:

  • Talker used ref_code when building the Base ICL prompt
  • but Stage-1 only received generated audio_codes
  • so Code2Wav decoded the first chunk without the reference codec prefix context

This showed up as noisy / unstable audio at the beginning of Base ICL outputs. x_vector_only_mode=True was unaffected because that mode only conditions on speaker embedding and does not rely on ref_code as decoder-side prefix context.

This PR restores the missing decoder context by:

  • preserving ref_code in the talker runtime/intermediate output for Base ICL
  • caching ref_code at request scope until the first Code2Wav chunk is emitted
  • prepending ref_code to the first Code2Wav input window
  • setting trim context so the prepended reference portion is removed from the final audio
  • applying the same fix to both async-chunk and non-async paths

Implementation note:

  • the async path now follows the same request-scoped state pattern used by qwen3_omni
  • instead of relying on request-side CPU side channels, the processor stores ref_code in transfer_manager.request_payload[request_id] until the first chunk is actually emitted

Root Cause Analysis

This PR follows up on the root-cause discussion in PR #1719 's comment. And thanks @iancarrasco-b10 :)

The issue was not caused by WebSocket transport or async chunk scheduling itself. The underlying problem was that Base ICL lost the decoder-side reference codec prefix when going through the multi-stage pipeline:

  • the talker still used ref_code to build the Base ICL prompt
  • but the downstream Code2Wav stage only received generated audio_codes
  • therefore the first decoded chunk no longer had the same acoustic prefix context as the official offline path

That mismatch explains why:

  • Base ICL showed noisy / unstable audio at the beginning
  • x_vector_only_mode=True did not show the same issue, because it only uses speaker embedding and does not require ref_code to be prepended before Code2Wav decoding
  • async chunking still needs request-scoped buffering, because the step where ref_code first appears is not guaranteed to be the same step where the first chunk is flushed

Test Plan

Run targeted stage input processor tests:

python -m pytest tests/model_executor/stage_input_processors/test_qwen3_tts_async_chunk.py -q

Test Result

Passed locally on this branch:

21 passed

PTAK @linyueqian

@Sy0307 Sy0307 requested a review from hsliuustc0106 as a code owner March 8, 2026 17:56
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e4caa8692a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +494 to +495
if any(ref_code.numel() > 0 for ref_code in ref_code_prompt_list):
mm["ref_code"] = ref_code_prompt_list
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep ref_code per request instead of list broadcasting

make_omni_output now writes ref_code as a Python list, but the AR runner path that builds per-request payloads treats list-valued multimodal outputs as v[0] for every request (gpu_ar_model_runner.py), so concurrent Base ICL requests will all receive the first request’s ref_code and decode with the wrong prompt context. This silently corrupts speaker/context conditioning for batched requests and was introduced by adding mm["ref_code"] as a list here.

Useful? React with 👍 / 👎.

codec_codes = audio_codes.transpose(0, 1).cpu().reshape(-1).tolist()
additional_information = None
if ref_code_len > 0:
additional_information = {"left_context_size": [ref_code_len]}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Pass scalar left_context_size for Code2Wav trimming

The non-async processor stores trim context as {"left_context_size": [ref_code_len]} (a list), but qwen3_tts_code2wav.py consumes left_context_size as an integer and later compares/multiplies it (if ctx_frames > 0, cut = ctx_frames * upsample). With this list payload, Base ICL non-async requests can hit a runtime type error during decode instead of producing audio.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Well-structured bugfix with good test coverage (+115 lines tests for +58 lines production code).

What's good:

  • Both async and non-async paths properly handled
  • CPU detach pattern correctly used for state storage
  • Tests cover edge cases (buffering before first emit, only-first-chunk behavior)
  • Return type change is safe (single caller updated in same PR)

One observation (not blocking):

The dictionary pattern at dynamically adds an attribute to . This works given the request-scoped lifecycle, but consider documenting this contract or adding as an explicit attribute on the transfer manager class for type safety.

🤖 Reviewed with vllm-omni-review skill

@linyueqian
Copy link
Copy Markdown
Collaborator

linyueqian commented Mar 9, 2026

Local testing results (Base ICL, async_chunk mode)

Tested both the PR branch and upstream/main with Qwen/Qwen3-TTS-12Hz-1.7B-Base, using the default qwen3_tts.yaml stage config and the official reference audio (clone_2.wav).

Setup:

  • Server: vllm-omni serve Qwen/Qwen3-TTS-12Hz-1.7B-Base with default async_chunk config
  • Reference audio: https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/clone_2.wav
  • Synthesis text: "Good one. Okay, fine, I'm just gonna leave this sock monkey here. Goodbye."

Unit tests: All 21 tests in test_qwen3_tts_async_chunk.py pass.

Audio quality: Both the PR branch and the baseline (main) still produce noisy audio at the beginning of Base ICL output. The first-chunk noise issue does not appear to be resolved by this change.

Note: The first request on both servers generated ~318s of audio (hit max_tokens=4096 without EOS - likely a warmup/compilation issue). The second request produced normal-length (~5s) audio, which was used for comparison. Both had audible noise.
baseline_base_icl_2.wav
fix_base_icl_2.wav

@linyueqian
Copy link
Copy Markdown
Collaborator

Reproduction steps:

# Start server (GPU 1, async_chunk mode)
VLLM_WORKER_MULTIPROC_METHOD=spawn CUDA_VISIBLE_DEVICES=1 \
  vllm-omni serve Qwen/Qwen3-TTS-12Hz-1.7B-Base \
  --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts.yaml \
  --host 0.0.0.0 --port 8092 --trust-remote-code --omni

# Send a warmup request first (first request hits max_tokens without EOS)
# Then send the actual test request:
curl -s http://localhost:8092/v1/audio/speech \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer EMPTY" \
  -d '{
    "model": "Qwen/Qwen3-TTS-12Hz-1.7B-Base",
    "input": "Good one. Okay, fine, I'm just gonna leave this sock monkey here. Goodbye.",
    "voice": "alloy",
    "response_format": "wav",
    "task_type": "Base",
    "ref_audio": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/clone_2.wav",
    "ref_text": "Okay. Yeah. I resent you. I love you. I respect you. But you know what? You blew it! And thanks to you."
  }' -o test_base_icl.wav

@Sy0307
Copy link
Copy Markdown
Contributor Author

Sy0307 commented Mar 9, 2026

Confirmed there is an issue here. Previously, due to some refactoring I did after testing was completed, it seems that the current ref code transmission is still causing problems. I will fix the issue here again. Thanks @linyueqian

Signed-off-by: Sy03 <1370724210@qq.com>
@Sy0307
Copy link
Copy Markdown
Contributor Author

Sy0307 commented Mar 9, 2026

Please re-check and I verified that the latest version can work well on my desktop. @linyueqian Thanks.

@linyueqian
Copy link
Copy Markdown
Collaborator

linyueqian commented Mar 9, 2026

Retested on latest commit (c83c975). Unit tests all pass (21/21). Served Base ICL with async_chunk using same setup as before and the noisy first chunk issue is gone. There's a tiny glitch on the first real request after warmup but that's likely just compilation, subsequent requests sound clean. LGTM.

fix_base_icl_pr1731.wav
fix_base_icl_pr1731_3rd.wav

@linyueqian linyueqian merged commit 761eff9 into vllm-project:main Mar 9, 2026
6 of 7 checks passed
lishunyang12 pushed a commit to lishunyang12/vllm-omni that referenced this pull request Mar 11, 2026
NickCao added a commit to NickCao/vllm-omni that referenced this pull request Mar 20, 2026
…sync Base path

talker2code2wav() wraps ref_code_len in a list when setting
additional_information["left_context_size"], but the consumer in
Qwen3TTSCode2Wav.forward() expects a plain int (line 287:
"if ctx_frames > 0"). This causes a TypeError when the non-async
Base path is used with max_model_len large enough to accept the
prompt.

The bug was introduced in PR vllm-project#1731 (761eff9, "Fix Base voice clone
streaming quality and stop-token crash") which added ref_code
support to the non-async path. The async chunk path in the same PR
correctly passes left_context_size as a plain int. The bug was
masked by the token overflow crash (max_model_len=32768 < prompt
tokens) which prevented the code from reaching the comparison.

Fixes: vllm-project#2030

Signed-off-by: Nick Cao <ncao@redhat.com>
Co-authored-by: Claude <noreply@anthropic.com>
NickCao added a commit to NickCao/vllm-omni that referenced this pull request Mar 20, 2026
…sync Base path

talker2code2wav() wraps ref_code_len in a list when setting
additional_information["left_context_size"], but the consumer in
Qwen3TTSCode2Wav.forward() expects a plain int (line 287:
"if ctx_frames > 0"). This causes a TypeError when the non-async
Base path is used with max_model_len large enough to accept the
prompt.

The bug was introduced in PR vllm-project#1731 (761eff9, "Fix Base voice clone
streaming quality and stop-token crash") which added ref_code
support to the non-async path. The async chunk path in the same PR
correctly passes left_context_size as a plain int. The bug was
masked by the token overflow crash (max_model_len=32768 < prompt
tokens) which prevented the code from reaching the comparison.

Fixes: vllm-project#2030

Signed-off-by: Nick Cao <ncao@redhat.com>
Co-authored-by: Claude <noreply@anthropic.com>
NickCao added a commit to NickCao/vllm-omni that referenced this pull request Mar 20, 2026
…av path

Qwen3TTSCode2Wav.forward() compares ctx_frames against 0 (line 287:
"if ctx_frames > 0"), but the non-async Base path passes
left_context_size as a single-element list [ref_code_len] to survive
serialize_additional_information(), which only supports tensor and
list values (plain ints are dropped). The async chunk path bypasses
serialization and passes a plain int directly.

The list wrapper in talker2code2wav() is intentional — without it
the serializer drops the key and ctx_frames silently falls back to
0, causing ref_code context to never be trimmed from the output
audio.

Fix the consumer (Qwen3TTSCode2Wav.forward) to unwrap the list when
present, handling both the serialized list form (non-async) and the
plain int form (async chunk path).

The bug was introduced in PR vllm-project#1731 (761eff9) which added ref_code
support to the non-async path but did not account for the type
mismatch between serialized list and the int comparison downstream.
It was masked by the token overflow crash (max_model_len=32768 <
prompt tokens) which prevented the code from reaching the
comparison.

Fixes: vllm-project#2030

Signed-off-by: Nick Cao <ncao@redhat.com>
Co-authored-by: Claude <noreply@anthropic.com>
NickCao added a commit to NickCao/vllm-omni that referenced this pull request Mar 20, 2026
…av path

Qwen3TTSCode2Wav.forward() compares ctx_frames against 0 (line 287:
"if ctx_frames > 0"), but the non-async Base path passes
left_context_size as a single-element list [ref_code_len] to survive
serialize_additional_information(), which only supports tensor and
list values (plain ints are dropped). The async chunk path bypasses
serialization and passes a plain int directly.

The list wrapper in talker2code2wav() is intentional — without it
the serializer drops the key and ctx_frames silently falls back to
0, causing ref_code context to never be trimmed from the output
audio.

Fix the consumer (Qwen3TTSCode2Wav.forward) to unwrap the list when
present, handling both the serialized list form (non-async) and the
plain int form (async chunk path).

The bug was introduced in PR vllm-project#1731 (761eff9) which added ref_code
support to the non-async path but did not account for the type
mismatch between serialized list and the int comparison downstream.
It was masked by the token overflow crash (max_model_len=32768 <
prompt tokens) which prevented the code from reaching the
comparison.

Fixes: vllm-project#2030

Signed-off-by: Nick Cao <ncao@redhat.com>
Co-authored-by: Claude <noreply@anthropic.com>
clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants