[Qwen3TTS][Bugfix] Replace vLLM fused layers with HF-compatible numerics in code predictor by linyueqian · Pull Request #2277 · vllm-project/vllm-omni

linyueqian · 2026-03-27T18:50:18Z

Summary

Fixes [Bug]: [Qwen3-TTS] Code predictor re-prefill (PR #1617) causes severe audio quality regression #2274 — severe audio quality regression from PR [Refactor][Perf] Qwen3-TTS: re-prefill Code Predictor with torch.compile + enable Code2Wav decoder CUDA Graph #1617's code predictor rewrite
Replaces vLLM's fused kernels with plain PyTorch equivalents matching HuggingFace reference numerics
Uses torch.compile(options={"epilogue_fusion": False}) + CUDA graphs to preserve speed

Root cause: Two sources of numerical divergence compound across 15 autoregressive code predictor steps:

vLLM fused kernels (QKVParallelLinear, fused RMSNorm, get_rope) differ from HF's pure PyTorch
torch.compile's epilogue fusion merges float32→bf16 casts after RMSNorm/RoPE into the next kernel, losing the precision HF's code relies on

Key changes:

_RMSNorm: float32 variance computation matching HF's Qwen3TTSRMSNorm
_RotaryEmbedding: float32 cos/sin matching HF's torch.autocast(enabled=False)
Separate nn.Linear for q/k/v/o projections (no fused QKV packing)
Separate nn.Linear for gate/up/down MLP (no fused gate_up packing)
torch.compile(epilogue_fusion=False) + CUDA graphs per batch-size bucket

Results (single prompt, temperature=0.9, top_k=50, seed=42):

Configuration	UTMOS ↑	RTF
HF reference (5 runs mean)	4.26	—
This PR	4.02	~same as main
Current main	3.10	baseline

Test Plan

Qwen3-TTS e2e offline inference (CustomVoice)
UTMOS evaluation via tarepan/SpeechMOS:v1.2.0 (utmos22_strong)
RTF comparison via gradio demo (serving)
Multi-prompt benchmark (awaiting reporter's test prompts)

🤖 Generated with Claude Code

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a2f2fd3671

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-27T18:57:02Z

+        rope_theta = getattr(config, "rope_theta", 10000.0)
+        inv_freq = 1.0 / (rope_theta ** (torch.arange(0, head_dim, 2, dtype=torch.float32) / head_dim))


Preserve configured RoPE scaling in code predictor

This hardcodes default RoPE frequencies from rope_theta and drops the previous get_rope(..., rope_parameters=...) behavior, so checkpoints that set non-default RoPE (rope_scaling/rope_parameters, e.g. linear/dynamic/yarn/llama3) will now get incorrect cos/sin values for every decoder layer. In those configurations, positional encoding no longer matches the model config, which can degrade or destabilize autoregressive code prediction quality.

Useful? React with 👍 / 👎.

…ics in code predictor Replace vLLM's fused kernels (QKVParallelLinear, MergedColumnParallelLinear, RowParallelLinear, fused RMSNorm, get_rope) with plain PyTorch equivalents that match the HuggingFace reference numerics exactly: - _RMSNorm: float32 variance computation matching HF's Qwen3TTSRMSNorm - _RotaryEmbedding: float32 cos/sin with torch.autocast(enabled=False) - Separate nn.Linear for q/k/v/o projections (no fused QKV packing) - Separate nn.Linear for gate/up/down MLP (no fused gate_up packing) - torch.compile with epilogue_fusion=False to preserve float32 precision in RMSNorm/RoPE while still fusing linear layers and SDPA - CUDA graph capture per batch-size bucket for launch overhead reduction The re-prefill architecture, pre-allocated buffers, projection caching, and inline sampling are preserved. UTMOS: 4.02 (up from 3.10 on main, HF reference ~4.26) RTF: comparable to main Fixes vllm-project#2274 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: linyueqian <linyueqian@outlook.com>

linyueqian · 2026-03-28T01:59:25Z

Benchmark Results (updated fix: torch.compile + epilogue_fusion=False + CUDA graphs)

Quality — 18 prompts, `temperature=0.9, top_k=50`, UTMOS via `tarepan/SpeechMOS:v1.2.0` (`utmos22_strong`)

Metric	PR #2277	Issue reporter (main)	HF reference (from issue)
Mean UTMOS	4.20	~2.66	~4.26
Median UTMOS	4.42	—	—
Min UTMOS	3.06	—	—
Max UTMOS	4.54	—	—

Per-prompt breakdown

#	UTMOS	RTF	Duration	Prompt
0	3.82	0.61	6.0s	其实我真的有发现，我是一个特别善于观察别人情绪的人。
1	4.50	0.62	4.7s	The quick brown fox jumps over the lazy dog...
2	4.33	0.62	4.8s	Hello, this is a test of the text to speech...
3	4.54	0.60	7.0s	I really enjoy spending time outdoors...
4	3.64	0.57	6.7s	Technology has completely transformed...
5	4.49	0.60	8.2s	Can you believe how fast this year has gone...
6	4.50	0.61	6.6s	The restaurant on the corner makes the best...
7	4.40	0.61	7.6s	Please make sure to bring your umbrella...
8	4.48	0.60	8.2s	I have been working on this project...
9	4.49	0.61	7.4s	She walked into the room with a big smile...
10	3.42	0.62	5.7s	我今天心情特别好，因为终于完成了一个很重要的项目。
11	3.06	0.62	5.0s	每次回到家乡，我都会感到一种说不出的温暖和幸福。
12	3.83	0.63	4.7s	Good morning everyone, I hope you all had...
13	4.28	0.56	9.4s	The sunset over the ocean was absolutely...
14	4.48	0.52	6.2s	I would like to schedule a meeting...
15	4.42	0.49	7.8s	Learning a new language takes patience...
16	4.51	0.52	6.1s	The children were playing in the park...
17	4.41	0.52	9.9s	Artificial intelligence is revolutionizing...

Performance — `vllm bench serve`, 10 prompts, concurrency=1

Metric	Value
Mean RTF	0.52
Median RTF	0.50
Mean TTFP	164 ms
Mean E2EL	31.2 s

Summary

UTMOS 4.20 — matches HF reference (~4.26), up from ~2.66 on current main
RTF 0.50–0.52 — no regression vs main
TTFP 164 ms — low latency first packet

david6666666 · 2026-03-28T02:46:50Z

LGTM

david6666666 · 2026-03-28T02:49:59Z

@yenuo26 need to add a test later, thx.

yenuo26 · 2026-03-28T03:55:20Z

@yenuo26 need to add a test later, thx.

yes, I have aligned with @amy-why-3459
In the short term: we will increase test case coverage
In the long term: we will add precision benchmarks to cover more scenarios. #2284

Sy0307 · 2026-03-28T05:09:59Z

Thanks for the fix! Two follow-ups:

1. Precision test

I'll add a precision regression test for the code predictor in Qwen3_tts/Qwen3_omni — I flagged this same torch.compile precision issue back in #2019, an automated check would have caught #2274 earlier.

2. Are the fused Linear replacements necessary?

The root cause is RMSNorm/RoPE float32 precision loss + epilogue fusion — both fixed by _RMSNorm + _RotaryEmbedding + epilogue_fusion=False.

But this PR also replaces QKVParallelLinear and MergedColumnParallelLinear with separate nn.Linear. Fused QKV packing is mathematically equivalent (concat weights → single matmul → split), so it shouldn't affect precision.

Removing them drops:

TP support plumbing
The load_weights stacked params mapping

I'd like to verify if keeping fused Linears (only replacing norm + rope) gives the same quality — if so, a smaller diff would be preferable.

linyueqian · 2026-03-28T05:17:39Z

Good point, fused QKV shouldn't affect precision. I swapped everything during debugging to isolate the root cause. Let me know what you find!

JuanPZuluaga · 2026-03-28T12:24:48Z

@linyueqian thanks for the fix! I closed #2275

…ics in code predictor (vllm-project#2277) Signed-off-by: linyueqian <linyueqian@outlook.com>

linyueqian requested a review from hsliuustc0106 as a code owner March 27, 2026 18:50

linyueqian mentioned this pull request Mar 27, 2026

[Bug]: [Qwen3-TTS] Code predictor re-prefill (PR #1617) causes severe audio quality regression #2274

Closed

1 task

chatgpt-codex-connector Bot reviewed Mar 27, 2026

View reviewed changes

hsliuustc0106 added the ready label to trigger buildkite CI label Mar 27, 2026

linyueqian force-pushed the fix/qwen3-tts-code-predictor-hf-numerics branch from a2f2fd3 to f7934eb Compare March 28, 2026 01:21

linyueqian force-pushed the fix/qwen3-tts-code-predictor-hf-numerics branch from f7934eb to 8341879 Compare March 28, 2026 01:51

david6666666 approved these changes Mar 28, 2026

View reviewed changes

david6666666 merged commit f55ea28 into vllm-project:main Mar 28, 2026
7 of 8 checks passed

Sy0307 mentioned this pull request Mar 28, 2026

[Bug]: The Qwen3-Omni outputs extremely long audio frequencies, resulting in decreased accuracy. #2286

Closed

1 task

LJH-LBJ mentioned this pull request Mar 28, 2026

Qwen3-Omni][Bugfix] Replace vLLM fused layers with HF-compatible numerics in code predictor #2291

Merged

5 tasks

skf-1999 pushed a commit to Semmer2/vllm-omni that referenced this pull request Mar 31, 2026

[Qwen3TTS][Bugfix] Replace vLLM fused layers with HF-compatible numer…

75bb46a

…ics in code predictor (vllm-project#2277) Signed-off-by: linyueqian <linyueqian@outlook.com>

vraiti pushed a commit to vraiti/vllm-omni that referenced this pull request Apr 9, 2026

[Qwen3TTS][Bugfix] Replace vLLM fused layers with HF-compatible numer…

7c5798f

…ics in code predictor (vllm-project#2277) Signed-off-by: linyueqian <linyueqian@outlook.com>

Sy0307 mentioned this pull request Apr 10, 2026

[Perf]: Speedup VoxCPM2 TTS performance and Support PagedAttention #2690

Merged

yzhu802 mentioned this pull request Apr 23, 2026

[Enhancement] Introduce High-Performance MoT (Mixture-of-Tokens) Kernels: Triton Implementation & A100 Tuning #1328

Open

5 tasks

lengrongfu pushed a commit to lengrongfu/vllm-omni that referenced this pull request May 1, 2026

[Qwen3TTS][Bugfix] Replace vLLM fused layers with HF-compatible numer…

fb09d94

…ics in code predictor (vllm-project#2277) Signed-off-by: linyueqian <linyueqian@outlook.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Qwen3TTS][Bugfix] Replace vLLM fused layers with HF-compatible numerics in code predictor#2277

[Qwen3TTS][Bugfix] Replace vLLM fused layers with HF-compatible numerics in code predictor#2277
david6666666 merged 1 commit intovllm-project:mainfrom
linyueqian:fix/qwen3-tts-code-predictor-hf-numerics

linyueqian commented Mar 27, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Mar 27, 2026

Uh oh!

linyueqian commented Mar 28, 2026

Uh oh!

david6666666 commented Mar 28, 2026

Uh oh!

Uh oh!

david6666666 commented Mar 28, 2026

Uh oh!

yenuo26 commented Mar 28, 2026

Uh oh!

Sy0307 commented Mar 28, 2026

Uh oh!

linyueqian commented Mar 28, 2026

Uh oh!

JuanPZuluaga commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

		rope_theta = getattr(config, "rope_theta", 10000.0)
		inv_freq = 1.0 / (rope_theta ** (torch.arange(0, head_dim, 2, dtype=torch.float32) / head_dim))

Conversation

linyueqian commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

linyueqian commented Mar 28, 2026

Benchmark Results (updated fix: torch.compile + epilogue_fusion=False + CUDA graphs)

Quality — 18 prompts, temperature=0.9, top_k=50, UTMOS via tarepan/SpeechMOS:v1.2.0 (utmos22_strong)

Per-prompt breakdown

Performance — vllm bench serve, 10 prompts, concurrency=1

Summary

Uh oh!

david6666666 commented Mar 28, 2026

Uh oh!

Uh oh!

david6666666 commented Mar 28, 2026

Uh oh!

yenuo26 commented Mar 28, 2026

Uh oh!

Sy0307 commented Mar 28, 2026

1. Precision test

2. Are the fused Linear replacements necessary?

Uh oh!

linyueqian commented Mar 28, 2026

Uh oh!

JuanPZuluaga commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

linyueqian commented Mar 27, 2026 •

edited

Loading

Quality — 18 prompts, `temperature=0.9, top_k=50`, UTMOS via `tarepan/SpeechMOS:v1.2.0` (`utmos22_strong`)

Performance — `vllm bench serve`, 10 prompts, concurrency=1