[Qwen3TTS][Bugfix] Replace vLLM fused layers with HF-compatible numerics in code predictor#2277
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a2f2fd3671
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| rope_theta = getattr(config, "rope_theta", 10000.0) | ||
| inv_freq = 1.0 / (rope_theta ** (torch.arange(0, head_dim, 2, dtype=torch.float32) / head_dim)) |
There was a problem hiding this comment.
Preserve configured RoPE scaling in code predictor
This hardcodes default RoPE frequencies from rope_theta and drops the previous get_rope(..., rope_parameters=...) behavior, so checkpoints that set non-default RoPE (rope_scaling/rope_parameters, e.g. linear/dynamic/yarn/llama3) will now get incorrect cos/sin values for every decoder layer. In those configurations, positional encoding no longer matches the model config, which can degrade or destabilize autoregressive code prediction quality.
Useful? React with 👍 / 👎.
a2f2fd3 to
f7934eb
Compare
…ics in code predictor Replace vLLM's fused kernels (QKVParallelLinear, MergedColumnParallelLinear, RowParallelLinear, fused RMSNorm, get_rope) with plain PyTorch equivalents that match the HuggingFace reference numerics exactly: - _RMSNorm: float32 variance computation matching HF's Qwen3TTSRMSNorm - _RotaryEmbedding: float32 cos/sin with torch.autocast(enabled=False) - Separate nn.Linear for q/k/v/o projections (no fused QKV packing) - Separate nn.Linear for gate/up/down MLP (no fused gate_up packing) - torch.compile with epilogue_fusion=False to preserve float32 precision in RMSNorm/RoPE while still fusing linear layers and SDPA - CUDA graph capture per batch-size bucket for launch overhead reduction The re-prefill architecture, pre-allocated buffers, projection caching, and inline sampling are preserved. UTMOS: 4.02 (up from 3.10 on main, HF reference ~4.26) RTF: comparable to main Fixes vllm-project#2274 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: linyueqian <linyueqian@outlook.com>
f7934eb to
8341879
Compare
Benchmark Results (updated fix: torch.compile + epilogue_fusion=False + CUDA graphs)Quality — 18 prompts,
|
| Metric | PR #2277 | Issue reporter (main) | HF reference (from issue) |
|---|---|---|---|
| Mean UTMOS | 4.20 | ~2.66 | ~4.26 |
| Median UTMOS | 4.42 | — | — |
| Min UTMOS | 3.06 | — | — |
| Max UTMOS | 4.54 | — | — |
Per-prompt breakdown
| # | UTMOS | RTF | Duration | Prompt |
|---|---|---|---|---|
| 0 | 3.82 | 0.61 | 6.0s | 其实我真的有发现,我是一个特别善于观察别人情绪的人。 |
| 1 | 4.50 | 0.62 | 4.7s | The quick brown fox jumps over the lazy dog... |
| 2 | 4.33 | 0.62 | 4.8s | Hello, this is a test of the text to speech... |
| 3 | 4.54 | 0.60 | 7.0s | I really enjoy spending time outdoors... |
| 4 | 3.64 | 0.57 | 6.7s | Technology has completely transformed... |
| 5 | 4.49 | 0.60 | 8.2s | Can you believe how fast this year has gone... |
| 6 | 4.50 | 0.61 | 6.6s | The restaurant on the corner makes the best... |
| 7 | 4.40 | 0.61 | 7.6s | Please make sure to bring your umbrella... |
| 8 | 4.48 | 0.60 | 8.2s | I have been working on this project... |
| 9 | 4.49 | 0.61 | 7.4s | She walked into the room with a big smile... |
| 10 | 3.42 | 0.62 | 5.7s | 我今天心情特别好,因为终于完成了一个很重要的项目。 |
| 11 | 3.06 | 0.62 | 5.0s | 每次回到家乡,我都会感到一种说不出的温暖和幸福。 |
| 12 | 3.83 | 0.63 | 4.7s | Good morning everyone, I hope you all had... |
| 13 | 4.28 | 0.56 | 9.4s | The sunset over the ocean was absolutely... |
| 14 | 4.48 | 0.52 | 6.2s | I would like to schedule a meeting... |
| 15 | 4.42 | 0.49 | 7.8s | Learning a new language takes patience... |
| 16 | 4.51 | 0.52 | 6.1s | The children were playing in the park... |
| 17 | 4.41 | 0.52 | 9.9s | Artificial intelligence is revolutionizing... |
Performance — vllm bench serve, 10 prompts, concurrency=1
| Metric | Value |
|---|---|
| Mean RTF | 0.52 |
| Median RTF | 0.50 |
| Mean TTFP | 164 ms |
| Mean E2EL | 31.2 s |
Summary
- UTMOS 4.20 — matches HF reference (~4.26), up from ~2.66 on current main
- RTF 0.50–0.52 — no regression vs main
- TTFP 164 ms — low latency first packet
|
LGTM |
|
@yenuo26 need to add a test later, thx. |
yes, I have aligned with @amy-why-3459 |
|
Thanks for the fix! Two follow-ups: 1. Precision testI'll add a precision regression test for the code predictor in Qwen3_tts/Qwen3_omni — I flagged this same torch.compile precision issue back in #2019, an automated check would have caught #2274 earlier. 2. Are the fused Linear replacements necessary?The root cause is RMSNorm/RoPE float32 precision loss + epilogue fusion — both fixed by But this PR also replaces Removing them drops:
I'd like to verify if keeping fused Linears (only replacing norm + rope) gives the same quality — if so, a smaller diff would be preferable. |
|
Good point, fused QKV shouldn't affect precision. I swapped everything during debugging to isolate the root cause. Let me know what you find! |
|
@linyueqian thanks for the fix! I closed #2275 |
…ics in code predictor (vllm-project#2277) Signed-off-by: linyueqian <linyueqian@outlook.com>
…ics in code predictor (vllm-project#2277) Signed-off-by: linyueqian <linyueqian@outlook.com>
…ics in code predictor (vllm-project#2277) Signed-off-by: linyueqian <linyueqian@outlook.com>
Summary
torch.compile(options={"epilogue_fusion": False})+ CUDA graphs to preserve speedRoot cause: Two sources of numerical divergence compound across 15 autoregressive code predictor steps:
QKVParallelLinear, fusedRMSNorm,get_rope) differ from HF's pure PyTorchKey changes:
_RMSNorm: float32 variance computation matching HF'sQwen3TTSRMSNorm_RotaryEmbedding: float32 cos/sin matching HF'storch.autocast(enabled=False)nn.Linearfor q/k/v/o projections (no fused QKV packing)nn.Linearfor gate/up/down MLP (no fused gate_up packing)torch.compile(epilogue_fusion=False)+ CUDA graphs per batch-size bucketResults (single prompt,
temperature=0.9, top_k=50, seed=42):Test Plan
tarepan/SpeechMOS:v1.2.0(utmos22_strong)🤖 Generated with Claude Code