[Fix] [Qwen3-TTS] Qwen3-TTS streaming chunk-boundary artifacts#2480
Conversation
Signed-off-by: Sy03 <1370724210@qq.com>
gcanlin
left a comment
There was a problem hiding this comment.
Would it lead to performance degradation?
After some simple tests I verified it will take about 1% e2e latency. More test results to be given later. |
|
After deep research and lots of experiments, I do not think this fix will solve the noise issue completely. I will do some work to find out the root cause which may have related with cuda_graph mode in talker stage. Diverse cuda_graph mode will decide different precisions. |
hsliuustc0106
left a comment
There was a problem hiding this comment.
Good fix with regression tests covering context trimming on exact frame boundaries. The codec_left_context_frames alignment with decoder sliding window (72 vs 25) should resolve the streaming chunk-boundary noise reported in #2439.
linyueqian
left a comment
There was a problem hiding this comment.
LGTM. Verified streaming A/B test on H20 — before (codec_left_context_frames=25) and after (72) both produce clean audio for CustomVoice. The exact frame-boundary trimming and decoder sliding window alignment look correct.
Testing & Performance ReportAudio qualityWhy this method works Streaming chunk-boundary distortion manifests as localized spectral envelope anomalies — brief moments where the energy distribution across frequencies deviates from what the model would produce in a single-shot (non-streaming) decode. These anomalies are typically <100ms, making global waveform metrics (UTMOS/PESQ) insensitive to them, but they are audible as clicks, buzzing, or timbral shifts. I built a DTW-based spectral analysis pipeline to detect and quantify these. The idea is straightforward: extract 80-bin log-mel spectrograms and 13-dim MFCCs from both the streaming output and the HF baseline, align them in MFCC space via band-limited DTW (Sakoe-Chiba) to remove timing differences, then compute per-frame log-mel L2 distance along the aligned path. A spike in this distance curve means the spectral shape at that moment differs significantly from baseline. Silence regions are gated out (mel-energy below p10) to avoid false positives, and contiguous regions exceeding the p95 distance threshold for ≥50ms are flagged as candidate distortion segments. Verification I first identified distortion segments by A/B listening between ctx25 and ctx72 outputs from the same pipeline, then measured each against the HF baseline at the same timestamps. If ctx25 drifts further from baseline than ctx72, the distortion is real and ctx-dependent. Results on 7 high-confidence segments:
Confirms that PerformanceSame machine/GPU, same prompts, bs1 (single concurrency): 5 rounds × 30 prompts
bs16 (concurrent): 3 rounds × 50 prompts
Overall ctx72 introduces ~1–3% performance overhead, within acceptable range for eliminating the boundary artifacts reported in #2439. Unit testsAdded |
When codec_left_context_frames=25, I found some audio still have noise. After turning to 72 error audio seems go away.(I generated over 50 test audio for both 25 and 72) |
…project#2480) Signed-off-by: Sy03 <1370724210@qq.com>
…project#2480) Signed-off-by: Sy03 <1370724210@qq.com>
…project#2480) Signed-off-by: Sy03 <1370724210@qq.com>
…project#2480) Signed-off-by: Sy03 <1370724210@qq.com>
Summary
codec_left_context_frameswith the decoder sliding windowqwen3_tts_code2wavoutputs on exact frame boundaries instead of proportional slicingWhy
Issue #2439 reports noisy/distorted streaming outputs for Qwen3-TTS 0.6B CustomVoice. The current streaming config uses
codec_left_context_frames: 25, while the 12Hz decoder usessliding_window = 72. That mismatch can leave too little left context at chunk boundaries. The old waveform trim path also used proportional slicing, which can misalign the decoded output when context frames are present.Verification
ruff format --check vllm_omni/model_executor/models/qwen3_tts/qwen3_tts_code2wav.py tests/model_executor/models/qwen3_tts/test_qwen3_tts_code2wav.pypytest tests/model_executor/models/qwen3_tts/test_qwen3_tts_code2wav.pyNotes
ctx25path but did not match the worst uploaded sample on every runcc @linyueqian