[Fix] [Qwen3-TTS] Qwen3-TTS streaming chunk-boundary artifacts by Sy0307 · Pull Request #2480 · vllm-project/vllm-omni

Sy0307 · 2026-04-04T09:07:12Z

Summary

align Qwen3-TTS streaming codec_left_context_frames with the decoder sliding window
trim qwen3_tts_code2wav outputs on exact frame boundaries instead of proportional slicing
add unit tests for context trimming behavior

Why

Issue #2439 reports noisy/distorted streaming outputs for Qwen3-TTS 0.6B CustomVoice. The current streaming config uses codec_left_context_frames: 25, while the 12Hz decoder uses sliding_window = 72. That mismatch can leave too little left context at chunk boundaries. The old waveform trim path also used proportional slicing, which can misalign the decoded output when context frames are present.

Verification

ruff format --check vllm_omni/model_executor/models/qwen3_tts/qwen3_tts_code2wav.py tests/model_executor/models/qwen3_tts/test_qwen3_tts_code2wav.py
pytest tests/model_executor/models/qwen3_tts/test_qwen3_tts_code2wav.py
local streaming repro and audio-quality analysis for issue [Bug]: QWen3-TTS, 0.6B-Custom generate audio with some noise in the audio. #2439

Notes

the issue is conditional rather than trivially deterministic; local runs reproduced instability on the ctx25 path but did not match the worst uploaded sample on every run
this patch is intended as a robustness fix for streaming chunk boundaries

cc @linyueqian

Signed-off-by: Sy03 <1370724210@qq.com>

gcanlin

Would it lead to performance degradation?

Sy0307 · 2026-04-04T13:08:57Z

Would it lead to performance degradation?

After some simple tests I verified it will take about 1% e2e latency. More test results to be given later.

Sy0307 · 2026-04-04T21:54:15Z

After deep research and lots of experiments, I do not think this fix will solve the noise issue completely. I will do some work to find out the root cause which may have related with cuda_graph mode in talker stage. Diverse cuda_graph mode will decide different precisions.

hsliuustc0106

Good fix with regression tests covering context trimming on exact frame boundaries. The codec_left_context_frames alignment with decoder sliding window (72 vs 25) should resolve the streaming chunk-boundary noise reported in #2439.

linyueqian

LGTM. Verified streaming A/B test on H20 — before (codec_left_context_frames=25) and after (72) both produce clean audio for CustomVoice. The exact frame-boundary trimming and decoder sliding window alignment look correct.

Sy0307 · 2026-04-05T18:22:21Z

Testing & Performance Report

Audio quality

Why this method works

Streaming chunk-boundary distortion manifests as localized spectral envelope anomalies — brief moments where the energy distribution across frequencies deviates from what the model would produce in a single-shot (non-streaming) decode. These anomalies are typically <100ms, making global waveform metrics (UTMOS/PESQ) insensitive to them, but they are audible as clicks, buzzing, or timbral shifts.

I built a DTW-based spectral analysis pipeline to detect and quantify these. The idea is straightforward: extract 80-bin log-mel spectrograms and 13-dim MFCCs from both the streaming output and the HF baseline, align them in MFCC space via band-limited DTW (Sakoe-Chiba) to remove timing differences, then compute per-frame log-mel L2 distance along the aligned path. A spike in this distance curve means the spectral shape at that moment differs significantly from baseline. Silence regions are gated out (mel-energy below p10) to avoid false positives, and contiguous regions exceeding the p95 distance threshold for ≥50ms are flagged as candidate distortion segments.

Verification

I first identified distortion segments by A/B listening between ctx25 and ctx72 outputs from the same pipeline, then measured each against the HF baseline at the same timestamps. If ctx25 drifts further from baseline than ctx72, the distortion is real and ctx-dependent.

Results on 7 high-confidence segments:

5/7 windows show ctx72 is significantly closer to HF baseline
Mean spectral distance reduction: ~20% (median ~19%)
Best case: one segment dropped ~46% (35.5 → 19.0)

Confirms that left_context_frames=25 was insufficient for the decoder's sliding_window=72, causing spectral anomalies at chunk boundaries.

Performance

Same machine/GPU, same prompts, codec_chunk_frames=25 constant.

bs1 (single concurrency): 5 rounds × 30 prompts

Metric	ctx25	ctx72	Δ
TTFP (ms)	38.29	38.81	+1.4%
E2E (ms)	690.92	701.73	+1.6%
RTF	0.12	0.12	+1.5%
Audio throughput (s/s)	8.21	8.09	-1.5%
Request throughput (r/s)	1.45	1.42	-1.5%

bs16 (concurrent): 3 rounds × 50 prompts

Metric	concurrency=1 Δ	concurrency=10 Δ
E2E	+0.0%	+3.6%
RTF	+0.5%	+3.0%
Audio throughput	-0.5%	-2.6%
Request throughput	-0.0%	-3.3%

Overall ctx72 introduces ~1–3% performance overhead, within acceptable range for eliminating the boundary artifacts reported in #2439.

Unit tests

Added test_qwen3_tts_code2wav.py covering exact frame-boundary trimming with and without left context.

Sy0307 · 2026-04-05T18:24:41Z

LGTM. Verified streaming A/B test on H20 — before (codec_left_context_frames=25) and after (72) both produce clean audio for CustomVoice. The exact frame-boundary trimming and decoder sliding window alignment look correct.

When codec_left_context_frames=25, I found some audio still have noise. After turning to 72 error audio seems go away.(I generated over 50 test audio for both 25 and 72)

…project#2480) Signed-off-by: Sy03 <1370724210@qq.com>

Fix Qwen3-TTS streaming chunk-boundary artifacts

03b2f7f

Signed-off-by: Sy03 <1370724210@qq.com>

Sy0307 requested a review from hsliuustc0106 as a code owner April 4, 2026 09:07

gcanlin reviewed Apr 4, 2026

View reviewed changes

Comment thread vllm_omni/model_executor/stage_configs/qwen3_tts_batch.yaml

Sy0307 marked this pull request as draft April 4, 2026 13:07

linyueqian self-requested a review April 5, 2026 05:29

hsliuustc0106 approved these changes Apr 5, 2026

View reviewed changes

Merge branch 'main' into sy03/issue-2439-qwen3-tts-streaming-fix

a9e4595

Sy0307 marked this pull request as ready for review April 5, 2026 16:58

linyueqian added the ready label to trigger buildkite CI label Apr 5, 2026

linyueqian approved these changes Apr 5, 2026

View reviewed changes

linyueqian merged commit 025408f into vllm-project:main Apr 5, 2026
8 checks passed

linyueqian mentioned this pull request Apr 5, 2026

[Bug]: QWen3-TTS, 0.6B-Custom generate audio with some noise in the audio. #2439

Closed

1 task

skf-1999 pushed a commit to Semmer2/vllm-omni that referenced this pull request Apr 7, 2026

[Fix] [Qwen3-TTS] Qwen3-TTS streaming chunk-boundary artifacts (vllm-…

5ba0492

…project#2480) Signed-off-by: Sy03 <1370724210@qq.com>

vraiti pushed a commit to vraiti/vllm-omni that referenced this pull request Apr 9, 2026

[Fix] [Qwen3-TTS] Qwen3-TTS streaming chunk-boundary artifacts (vllm-…

ff736a3

…project#2480) Signed-off-by: Sy03 <1370724210@qq.com>

lengrongfu pushed a commit to lengrongfu/vllm-omni that referenced this pull request May 1, 2026

[Fix] [Qwen3-TTS] Qwen3-TTS streaming chunk-boundary artifacts (vllm-…

2aa9afc

…project#2480) Signed-off-by: Sy03 <1370724210@qq.com>

clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026

[Fix] [Qwen3-TTS] Qwen3-TTS streaming chunk-boundary artifacts (vllm-…

3df8ebb

…project#2480) Signed-off-by: Sy03 <1370724210@qq.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix] [Qwen3-TTS] Qwen3-TTS streaming chunk-boundary artifacts#2480

[Fix] [Qwen3-TTS] Qwen3-TTS streaming chunk-boundary artifacts#2480
linyueqian merged 2 commits into
vllm-project:mainfrom
Sy0307:sy03/issue-2439-qwen3-tts-streaming-fix

Sy0307 commented Apr 4, 2026

Uh oh!

gcanlin left a comment

Uh oh!

Uh oh!

Sy0307 commented Apr 4, 2026

Uh oh!

Sy0307 commented Apr 4, 2026

Uh oh!

hsliuustc0106 left a comment

Uh oh!

linyueqian left a comment

Uh oh!

Sy0307 commented Apr 5, 2026

Uh oh!

Sy0307 commented Apr 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Sy0307 commented Apr 4, 2026

Summary

Why

Verification

Notes

Uh oh!

gcanlin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Sy0307 commented Apr 4, 2026

Uh oh!

Sy0307 commented Apr 4, 2026

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Uh oh!

linyueqian left a comment

Choose a reason for hiding this comment

Uh oh!

Sy0307 commented Apr 5, 2026

Testing & Performance Report

Audio quality

Performance

Unit tests

Uh oh!

Sy0307 commented Apr 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants