Skip to content

[Fix] [Qwen3-TTS] Qwen3-TTS streaming chunk-boundary artifacts#2480

Merged
linyueqian merged 2 commits into
vllm-project:mainfrom
Sy0307:sy03/issue-2439-qwen3-tts-streaming-fix
Apr 5, 2026
Merged

[Fix] [Qwen3-TTS] Qwen3-TTS streaming chunk-boundary artifacts#2480
linyueqian merged 2 commits into
vllm-project:mainfrom
Sy0307:sy03/issue-2439-qwen3-tts-streaming-fix

Conversation

@Sy0307
Copy link
Copy Markdown
Contributor

@Sy0307 Sy0307 commented Apr 4, 2026

Summary

  • align Qwen3-TTS streaming codec_left_context_frames with the decoder sliding window
  • trim qwen3_tts_code2wav outputs on exact frame boundaries instead of proportional slicing
  • add unit tests for context trimming behavior

Why

Issue #2439 reports noisy/distorted streaming outputs for Qwen3-TTS 0.6B CustomVoice. The current streaming config uses codec_left_context_frames: 25, while the 12Hz decoder uses sliding_window = 72. That mismatch can leave too little left context at chunk boundaries. The old waveform trim path also used proportional slicing, which can misalign the decoded output when context frames are present.

Verification

Notes

  • the issue is conditional rather than trivially deterministic; local runs reproduced instability on the ctx25 path but did not match the worst uploaded sample on every run
  • this patch is intended as a robustness fix for streaming chunk boundaries

cc @linyueqian

Signed-off-by: Sy03 <1370724210@qq.com>
@Sy0307 Sy0307 requested a review from hsliuustc0106 as a code owner April 4, 2026 09:07
Copy link
Copy Markdown
Collaborator

@gcanlin gcanlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it lead to performance degradation?

Comment thread vllm_omni/model_executor/stage_configs/qwen3_tts_batch.yaml
@Sy0307 Sy0307 marked this pull request as draft April 4, 2026 13:07
@Sy0307
Copy link
Copy Markdown
Contributor Author

Sy0307 commented Apr 4, 2026

Would it lead to performance degradation?

After some simple tests I verified it will take about 1% e2e latency. More test results to be given later.

@Sy0307
Copy link
Copy Markdown
Contributor Author

Sy0307 commented Apr 4, 2026

After deep research and lots of experiments, I do not think this fix will solve the noise issue completely. I will do some work to find out the root cause which may have related with cuda_graph mode in talker stage. Diverse cuda_graph mode will decide different precisions.

@linyueqian linyueqian self-requested a review April 5, 2026 05:29
Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good fix with regression tests covering context trimming on exact frame boundaries. The codec_left_context_frames alignment with decoder sliding window (72 vs 25) should resolve the streaming chunk-boundary noise reported in #2439.

@Sy0307 Sy0307 marked this pull request as ready for review April 5, 2026 16:58
@linyueqian linyueqian added the ready label to trigger buildkite CI label Apr 5, 2026
Copy link
Copy Markdown
Collaborator

@linyueqian linyueqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Verified streaming A/B test on H20 — before (codec_left_context_frames=25) and after (72) both produce clean audio for CustomVoice. The exact frame-boundary trimming and decoder sliding window alignment look correct.

@Sy0307
Copy link
Copy Markdown
Contributor Author

Sy0307 commented Apr 5, 2026

Testing & Performance Report

Audio quality

Why this method works

Streaming chunk-boundary distortion manifests as localized spectral envelope anomalies — brief moments where the energy distribution across frequencies deviates from what the model would produce in a single-shot (non-streaming) decode. These anomalies are typically <100ms, making global waveform metrics (UTMOS/PESQ) insensitive to them, but they are audible as clicks, buzzing, or timbral shifts.

I built a DTW-based spectral analysis pipeline to detect and quantify these. The idea is straightforward: extract 80-bin log-mel spectrograms and 13-dim MFCCs from both the streaming output and the HF baseline, align them in MFCC space via band-limited DTW (Sakoe-Chiba) to remove timing differences, then compute per-frame log-mel L2 distance along the aligned path. A spike in this distance curve means the spectral shape at that moment differs significantly from baseline. Silence regions are gated out (mel-energy below p10) to avoid false positives, and contiguous regions exceeding the p95 distance threshold for ≥50ms are flagged as candidate distortion segments.

Verification

I first identified distortion segments by A/B listening between ctx25 and ctx72 outputs from the same pipeline, then measured each against the HF baseline at the same timestamps. If ctx25 drifts further from baseline than ctx72, the distortion is real and ctx-dependent.

Results on 7 high-confidence segments:

  • 5/7 windows show ctx72 is significantly closer to HF baseline
  • Mean spectral distance reduction: ~20% (median ~19%)
  • Best case: one segment dropped ~46% (35.5 → 19.0)

Confirms that left_context_frames=25 was insufficient for the decoder's sliding_window=72, causing spectral anomalies at chunk boundaries.

Performance

Same machine/GPU, same prompts, codec_chunk_frames=25 constant.

bs1 (single concurrency): 5 rounds × 30 prompts

Metric ctx25 ctx72 Δ
TTFP (ms) 38.29 38.81 +1.4%
E2E (ms) 690.92 701.73 +1.6%
RTF 0.12 0.12 +1.5%
Audio throughput (s/s) 8.21 8.09 -1.5%
Request throughput (r/s) 1.45 1.42 -1.5%

bs16 (concurrent): 3 rounds × 50 prompts

Metric concurrency=1 Δ concurrency=10 Δ
E2E +0.0% +3.6%
RTF +0.5% +3.0%
Audio throughput -0.5% -2.6%
Request throughput -0.0% -3.3%

Overall ctx72 introduces ~1–3% performance overhead, within acceptable range for eliminating the boundary artifacts reported in #2439.

Unit tests

Added test_qwen3_tts_code2wav.py covering exact frame-boundary trimming with and without left context.

@Sy0307
Copy link
Copy Markdown
Contributor Author

Sy0307 commented Apr 5, 2026

LGTM. Verified streaming A/B test on H20 — before (codec_left_context_frames=25) and after (72) both produce clean audio for CustomVoice. The exact frame-boundary trimming and decoder sliding window alignment look correct.

When codec_left_context_frames=25, I found some audio still have noise. After turning to 72 error audio seems go away.(I generated over 50 test audio for both 25 and 72)

@linyueqian linyueqian merged commit 025408f into vllm-project:main Apr 5, 2026
8 checks passed
skf-1999 pushed a commit to Semmer2/vllm-omni that referenced this pull request Apr 7, 2026
vraiti pushed a commit to vraiti/vllm-omni that referenced this pull request Apr 9, 2026
lengrongfu pushed a commit to lengrongfu/vllm-omni that referenced this pull request May 1, 2026
clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants