[Fix][Qwen3-TTS] Preserve ref_code decoder context for Base ICL#1731
Conversation
Signed-off-by: Sy03 <1370724210@qq.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e4caa8692a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if any(ref_code.numel() > 0 for ref_code in ref_code_prompt_list): | ||
| mm["ref_code"] = ref_code_prompt_list |
There was a problem hiding this comment.
Keep ref_code per request instead of list broadcasting
make_omni_output now writes ref_code as a Python list, but the AR runner path that builds per-request payloads treats list-valued multimodal outputs as v[0] for every request (gpu_ar_model_runner.py), so concurrent Base ICL requests will all receive the first request’s ref_code and decode with the wrong prompt context. This silently corrupts speaker/context conditioning for batched requests and was introduced by adding mm["ref_code"] as a list here.
Useful? React with 👍 / 👎.
| codec_codes = audio_codes.transpose(0, 1).cpu().reshape(-1).tolist() | ||
| additional_information = None | ||
| if ref_code_len > 0: | ||
| additional_information = {"left_context_size": [ref_code_len]} |
There was a problem hiding this comment.
Pass scalar left_context_size for Code2Wav trimming
The non-async processor stores trim context as {"left_context_size": [ref_code_len]} (a list), but qwen3_tts_code2wav.py consumes left_context_size as an integer and later compares/multiplies it (if ctx_frames > 0, cut = ctx_frames * upsample). With this list payload, Base ICL non-async requests can hit a runtime type error during decode instead of producing audio.
Useful? React with 👍 / 👎.
hsliuustc0106
left a comment
There was a problem hiding this comment.
Review Summary
Well-structured bugfix with good test coverage (+115 lines tests for +58 lines production code).
What's good:
- Both async and non-async paths properly handled
- CPU detach pattern correctly used for state storage
- Tests cover edge cases (buffering before first emit, only-first-chunk behavior)
- Return type change is safe (single caller updated in same PR)
One observation (not blocking):
The dictionary pattern at dynamically adds an attribute to . This works given the request-scoped lifecycle, but consider documenting this contract or adding as an explicit attribute on the transfer manager class for type safety.
🤖 Reviewed with vllm-omni-review skill
|
Local testing results (Base ICL, async_chunk mode) Tested both the PR branch and upstream/main with Setup:
Unit tests: All 21 tests in Audio quality: Both the PR branch and the baseline (main) still produce noisy audio at the beginning of Base ICL output. The first-chunk noise issue does not appear to be resolved by this change. Note: The first request on both servers generated ~318s of audio (hit |
|
Reproduction steps: # Start server (GPU 1, async_chunk mode)
VLLM_WORKER_MULTIPROC_METHOD=spawn CUDA_VISIBLE_DEVICES=1 \
vllm-omni serve Qwen/Qwen3-TTS-12Hz-1.7B-Base \
--stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts.yaml \
--host 0.0.0.0 --port 8092 --trust-remote-code --omni
# Send a warmup request first (first request hits max_tokens without EOS)
# Then send the actual test request:
curl -s http://localhost:8092/v1/audio/speech \
-H "Content-Type: application/json" \
-H "Authorization: Bearer EMPTY" \
-d '{
"model": "Qwen/Qwen3-TTS-12Hz-1.7B-Base",
"input": "Good one. Okay, fine, I'm just gonna leave this sock monkey here. Goodbye.",
"voice": "alloy",
"response_format": "wav",
"task_type": "Base",
"ref_audio": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/clone_2.wav",
"ref_text": "Okay. Yeah. I resent you. I love you. I respect you. But you know what? You blew it! And thanks to you."
}' -o test_base_icl.wav |
|
Confirmed there is an issue here. Previously, due to some refactoring I did after testing was completed, it seems that the current ref code transmission is still causing problems. I will fix the issue here again. Thanks @linyueqian |
Signed-off-by: Sy03 <1370724210@qq.com>
|
Please re-check and I verified that the latest version can work well on my desktop. @linyueqian Thanks. |
|
Retested on latest commit ( |
…-project#1731) Signed-off-by: lishunyang <lishunyang12@163.com>
…sync Base path talker2code2wav() wraps ref_code_len in a list when setting additional_information["left_context_size"], but the consumer in Qwen3TTSCode2Wav.forward() expects a plain int (line 287: "if ctx_frames > 0"). This causes a TypeError when the non-async Base path is used with max_model_len large enough to accept the prompt. The bug was introduced in PR vllm-project#1731 (761eff9, "Fix Base voice clone streaming quality and stop-token crash") which added ref_code support to the non-async path. The async chunk path in the same PR correctly passes left_context_size as a plain int. The bug was masked by the token overflow crash (max_model_len=32768 < prompt tokens) which prevented the code from reaching the comparison. Fixes: vllm-project#2030 Signed-off-by: Nick Cao <ncao@redhat.com> Co-authored-by: Claude <noreply@anthropic.com>
…sync Base path talker2code2wav() wraps ref_code_len in a list when setting additional_information["left_context_size"], but the consumer in Qwen3TTSCode2Wav.forward() expects a plain int (line 287: "if ctx_frames > 0"). This causes a TypeError when the non-async Base path is used with max_model_len large enough to accept the prompt. The bug was introduced in PR vllm-project#1731 (761eff9, "Fix Base voice clone streaming quality and stop-token crash") which added ref_code support to the non-async path. The async chunk path in the same PR correctly passes left_context_size as a plain int. The bug was masked by the token overflow crash (max_model_len=32768 < prompt tokens) which prevented the code from reaching the comparison. Fixes: vllm-project#2030 Signed-off-by: Nick Cao <ncao@redhat.com> Co-authored-by: Claude <noreply@anthropic.com>
…av path Qwen3TTSCode2Wav.forward() compares ctx_frames against 0 (line 287: "if ctx_frames > 0"), but the non-async Base path passes left_context_size as a single-element list [ref_code_len] to survive serialize_additional_information(), which only supports tensor and list values (plain ints are dropped). The async chunk path bypasses serialization and passes a plain int directly. The list wrapper in talker2code2wav() is intentional — without it the serializer drops the key and ctx_frames silently falls back to 0, causing ref_code context to never be trimmed from the output audio. Fix the consumer (Qwen3TTSCode2Wav.forward) to unwrap the list when present, handling both the serialized list form (non-async) and the plain int form (async chunk path). The bug was introduced in PR vllm-project#1731 (761eff9) which added ref_code support to the non-async path but did not account for the type mismatch between serialized list and the int comparison downstream. It was masked by the token overflow crash (max_model_len=32768 < prompt tokens) which prevented the code from reaching the comparison. Fixes: vllm-project#2030 Signed-off-by: Nick Cao <ncao@redhat.com> Co-authored-by: Claude <noreply@anthropic.com>
…av path Qwen3TTSCode2Wav.forward() compares ctx_frames against 0 (line 287: "if ctx_frames > 0"), but the non-async Base path passes left_context_size as a single-element list [ref_code_len] to survive serialize_additional_information(), which only supports tensor and list values (plain ints are dropped). The async chunk path bypasses serialization and passes a plain int directly. The list wrapper in talker2code2wav() is intentional — without it the serializer drops the key and ctx_frames silently falls back to 0, causing ref_code context to never be trimmed from the output audio. Fix the consumer (Qwen3TTSCode2Wav.forward) to unwrap the list when present, handling both the serialized list form (non-async) and the plain int form (async chunk path). The bug was introduced in PR vllm-project#1731 (761eff9) which added ref_code support to the non-async path but did not account for the type mismatch between serialized list and the int comparison downstream. It was masked by the token overflow crash (max_model_len=32768 < prompt tokens) which prevented the code from reaching the comparison. Fixes: vllm-project#2030 Signed-off-by: Nick Cao <ncao@redhat.com> Co-authored-by: Claude <noreply@anthropic.com>
Purpose
Fix noisy first-chunk audio for Qwen3-TTS Base ICL in the multi-stage pipeline.
The official offline Qwen3-TTS Base voice-cloning path decodes
ref_code + generated_codesand trims the reference prefix from the final waveform. That gives Code2Wav the same acoustic prefix context used by ICL prompt construction.The multi-stage pipeline was not preserving that behavior:
ref_codewhen building the Base ICL promptaudio_codesThis showed up as noisy / unstable audio at the beginning of Base ICL outputs.
x_vector_only_mode=Truewas unaffected because that mode only conditions on speaker embedding and does not rely onref_codeas decoder-side prefix context.This PR restores the missing decoder context by:
ref_codein the talker runtime/intermediate output for Base ICLref_codeat request scope until the first Code2Wav chunk is emittedref_codeto the first Code2Wav input windowImplementation note:
qwen3_omniref_codeintransfer_manager.request_payload[request_id]until the first chunk is actually emittedRoot Cause Analysis
This PR follows up on the root-cause discussion in PR #1719 's comment. And thanks @iancarrasco-b10 :)
The issue was not caused by WebSocket transport or async chunk scheduling itself. The underlying problem was that Base ICL lost the decoder-side reference codec prefix when going through the multi-stage pipeline:
ref_codeto build the Base ICL promptaudio_codesThat mismatch explains why:
x_vector_only_mode=Truedid not show the same issue, because it only uses speaker embedding and does not requireref_codeto be prepended before Code2Wav decodingref_codefirst appears is not guaranteed to be the same step where the first chunk is flushedTest Plan
Run targeted stage input processor tests:
Test Result
Passed locally on this branch:
PTAK @linyueqian