Skip to content

[Qwen3TTS][Bugfix] Replace vLLM fused layers with HF-compatible numerics in code predictor#2277

Merged
david6666666 merged 1 commit intovllm-project:mainfrom
linyueqian:fix/qwen3-tts-code-predictor-hf-numerics
Mar 28, 2026
Merged

[Qwen3TTS][Bugfix] Replace vLLM fused layers with HF-compatible numerics in code predictor#2277
david6666666 merged 1 commit intovllm-project:mainfrom
linyueqian:fix/qwen3-tts-code-predictor-hf-numerics

Conversation

@linyueqian
Copy link
Copy Markdown
Collaborator

@linyueqian linyueqian commented Mar 27, 2026

Summary

Root cause: Two sources of numerical divergence compound across 15 autoregressive code predictor steps:

  1. vLLM fused kernels (QKVParallelLinear, fused RMSNorm, get_rope) differ from HF's pure PyTorch
  2. torch.compile's epilogue fusion merges float32→bf16 casts after RMSNorm/RoPE into the next kernel, losing the precision HF's code relies on

Key changes:

  • _RMSNorm: float32 variance computation matching HF's Qwen3TTSRMSNorm
  • _RotaryEmbedding: float32 cos/sin matching HF's torch.autocast(enabled=False)
  • Separate nn.Linear for q/k/v/o projections (no fused QKV packing)
  • Separate nn.Linear for gate/up/down MLP (no fused gate_up packing)
  • torch.compile(epilogue_fusion=False) + CUDA graphs per batch-size bucket

Results (single prompt, temperature=0.9, top_k=50, seed=42):

Configuration UTMOS ↑ RTF
HF reference (5 runs mean) 4.26
This PR 4.02 ~same as main
Current main 3.10 baseline

Test Plan

  • Qwen3-TTS e2e offline inference (CustomVoice)
  • UTMOS evaluation via tarepan/SpeechMOS:v1.2.0 (utmos22_strong)
  • RTF comparison via gradio demo (serving)
  • Multi-prompt benchmark (awaiting reporter's test prompts)

🤖 Generated with Claude Code

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a2f2fd3671

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +76 to +77
rope_theta = getattr(config, "rope_theta", 10000.0)
inv_freq = 1.0 / (rope_theta ** (torch.arange(0, head_dim, 2, dtype=torch.float32) / head_dim))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve configured RoPE scaling in code predictor

This hardcodes default RoPE frequencies from rope_theta and drops the previous get_rope(..., rope_parameters=...) behavior, so checkpoints that set non-default RoPE (rope_scaling/rope_parameters, e.g. linear/dynamic/yarn/llama3) will now get incorrect cos/sin values for every decoder layer. In those configurations, positional encoding no longer matches the model config, which can degrade or destabilize autoregressive code prediction quality.

Useful? React with 👍 / 👎.

@hsliuustc0106 hsliuustc0106 added the ready label to trigger buildkite CI label Mar 27, 2026
@linyueqian linyueqian force-pushed the fix/qwen3-tts-code-predictor-hf-numerics branch from a2f2fd3 to f7934eb Compare March 28, 2026 01:21
…ics in code predictor

Replace vLLM's fused kernels (QKVParallelLinear, MergedColumnParallelLinear,
RowParallelLinear, fused RMSNorm, get_rope) with plain PyTorch equivalents
that match the HuggingFace reference numerics exactly:

- _RMSNorm: float32 variance computation matching HF's Qwen3TTSRMSNorm
- _RotaryEmbedding: float32 cos/sin with torch.autocast(enabled=False)
- Separate nn.Linear for q/k/v/o projections (no fused QKV packing)
- Separate nn.Linear for gate/up/down MLP (no fused gate_up packing)
- torch.compile with epilogue_fusion=False to preserve float32 precision
  in RMSNorm/RoPE while still fusing linear layers and SDPA
- CUDA graph capture per batch-size bucket for launch overhead reduction

The re-prefill architecture, pre-allocated buffers, projection caching,
and inline sampling are preserved.

UTMOS: 4.02 (up from 3.10 on main, HF reference ~4.26)
RTF: comparable to main

Fixes vllm-project#2274

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: linyueqian <linyueqian@outlook.com>
@linyueqian linyueqian force-pushed the fix/qwen3-tts-code-predictor-hf-numerics branch from f7934eb to 8341879 Compare March 28, 2026 01:51
@linyueqian
Copy link
Copy Markdown
Collaborator Author

Benchmark Results (updated fix: torch.compile + epilogue_fusion=False + CUDA graphs)

Quality — 18 prompts, temperature=0.9, top_k=50, UTMOS via tarepan/SpeechMOS:v1.2.0 (utmos22_strong)

Metric PR #2277 Issue reporter (main) HF reference (from issue)
Mean UTMOS 4.20 ~2.66 ~4.26
Median UTMOS 4.42
Min UTMOS 3.06
Max UTMOS 4.54

Per-prompt breakdown

# UTMOS RTF Duration Prompt
0 3.82 0.61 6.0s 其实我真的有发现,我是一个特别善于观察别人情绪的人。
1 4.50 0.62 4.7s The quick brown fox jumps over the lazy dog...
2 4.33 0.62 4.8s Hello, this is a test of the text to speech...
3 4.54 0.60 7.0s I really enjoy spending time outdoors...
4 3.64 0.57 6.7s Technology has completely transformed...
5 4.49 0.60 8.2s Can you believe how fast this year has gone...
6 4.50 0.61 6.6s The restaurant on the corner makes the best...
7 4.40 0.61 7.6s Please make sure to bring your umbrella...
8 4.48 0.60 8.2s I have been working on this project...
9 4.49 0.61 7.4s She walked into the room with a big smile...
10 3.42 0.62 5.7s 我今天心情特别好,因为终于完成了一个很重要的项目。
11 3.06 0.62 5.0s 每次回到家乡,我都会感到一种说不出的温暖和幸福。
12 3.83 0.63 4.7s Good morning everyone, I hope you all had...
13 4.28 0.56 9.4s The sunset over the ocean was absolutely...
14 4.48 0.52 6.2s I would like to schedule a meeting...
15 4.42 0.49 7.8s Learning a new language takes patience...
16 4.51 0.52 6.1s The children were playing in the park...
17 4.41 0.52 9.9s Artificial intelligence is revolutionizing...

Performance — vllm bench serve, 10 prompts, concurrency=1

Metric Value
Mean RTF 0.52
Median RTF 0.50
Mean TTFP 164 ms
Mean E2EL 31.2 s

Summary

  • UTMOS 4.20 — matches HF reference (~4.26), up from ~2.66 on current main
  • RTF 0.50–0.52 — no regression vs main
  • TTFP 164 ms — low latency first packet

@david6666666
Copy link
Copy Markdown
Collaborator

LGTM

@david6666666 david6666666 merged commit f55ea28 into vllm-project:main Mar 28, 2026
7 of 8 checks passed
@david6666666
Copy link
Copy Markdown
Collaborator

@yenuo26 need to add a test later, thx.

@yenuo26
Copy link
Copy Markdown
Collaborator

yenuo26 commented Mar 28, 2026

@yenuo26 need to add a test later, thx.

yes, I have aligned with @amy-why-3459
In the short term: we will increase test case coverage
In the long term: we will add precision benchmarks to cover more scenarios. #2284

@Sy0307
Copy link
Copy Markdown
Contributor

Sy0307 commented Mar 28, 2026

Thanks for the fix! Two follow-ups:

1. Precision test

I'll add a precision regression test for the code predictor in Qwen3_tts/Qwen3_omni — I flagged this same torch.compile precision issue back in #2019, an automated check would have caught #2274 earlier.

2. Are the fused Linear replacements necessary?

The root cause is RMSNorm/RoPE float32 precision loss + epilogue fusion — both fixed by _RMSNorm + _RotaryEmbedding + epilogue_fusion=False.

But this PR also replaces QKVParallelLinear and MergedColumnParallelLinear with separate nn.Linear. Fused QKV packing is mathematically equivalent (concat weights → single matmul → split), so it shouldn't affect precision.

Removing them drops:

  • TP support plumbing
  • The load_weights stacked params mapping

I'd like to verify if keeping fused Linears (only replacing norm + rope) gives the same quality — if so, a smaller diff would be preferable.

@linyueqian
Copy link
Copy Markdown
Collaborator Author

Good point, fused QKV shouldn't affect precision. I swapped everything during debugging to isolate the root cause. Let me know what you find!

@JuanPZuluaga
Copy link
Copy Markdown
Contributor

@linyueqian thanks for the fix! I closed #2275

skf-1999 pushed a commit to Semmer2/vllm-omni that referenced this pull request Mar 31, 2026
…ics in code predictor (vllm-project#2277)

Signed-off-by: linyueqian <linyueqian@outlook.com>
vraiti pushed a commit to vraiti/vllm-omni that referenced this pull request Apr 9, 2026
…ics in code predictor (vllm-project#2277)

Signed-off-by: linyueqian <linyueqian@outlook.com>
lengrongfu pushed a commit to lengrongfu/vllm-omni that referenced this pull request May 1, 2026
…ics in code predictor (vllm-project#2277)

Signed-off-by: linyueqian <linyueqian@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: [Qwen3-TTS] Code predictor re-prefill (PR #1617) causes severe audio quality regression

6 participants