[Feat][Qwen3TTS][Code2wav] triton SnakeBeta and Cuda Graph by JuanPZuluaga · Pull Request #1797 · vllm-project/vllm-omni

JuanPZuluaga · 2026-03-10T20:07:37Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

The code2wav processes codecs to audio using a conv-based pipeline with 29 SnakeBeta activation layers. Each SnakeBeta executes 4 separate elementwise GPU kernels (exp, sin, pow, add), creating ~116 small kernel launches per forward pass. this makes Code2Wav a small bottleneck under high concurrent load. thus, we introduce two optimizations:

Fused Triton SnakeBeta kernel: replaces 4 elementwise PyTorch ops with a single Triton kernel that reads and writes memory once. Auto-detected at runtime — uses Triton when available on CUDA, falls back to eager PyTorch otherwise. Zero configuration needed.
Smart CUDA graph capture sizes: instead of a hardcoded list [25, 50, 100, 150, 200, 250, 300], capture sizes are computed dynamically from the streaming config (codec_chunk_frames, codec_left_context_frames). This ensures exact graph hits for streaming chunk sizes (e.g., 33 and 58 for c=33/ctx=25) and includes power-of-2 small sizes [2, 4, 8, 16, 32, 64] aligned with the dynamic IC sizing in [feat][Qwen3TTS] Simple dynamic TTFA based on Code2Wav load #1714.

The capture size computation also generates bucket sizes for variable-length last chunks, ensuring high graph hit rate across all decode calls.

WIP: batched decoding.

Test Plan

python -m pytest tests/model_executor/models/qwen3_tts/test_cuda_graph_decoder.py -v

Test Result

Benchmark Results

Metric	Concurrency	triton	main
TTFP (ms)	1	62.6	69.8
TTFP (ms)	2	91.2	95.5
TTFP (ms)	4	105.0	114.9
TTFP (ms)	6	117.8	147.0
TTFP (ms)	8	138.5	159.7
TTFP (ms)	10	499.5	562.8
E2E (ms)	1	1227.5	1248.4
E2E (ms)	2	1390.2	1447.2
E2E (ms)	4	1564.7	1665.2
E2E (ms)	6	1770.8	2002.2
E2E (ms)	8	1895.9	2236.1
E2E (ms)	10	2300.8	2591.4
RTF	1	0.217	0.221
RTF	2	0.246	0.253
RTF	4	0.274	0.295
RTF	6	0.314	0.356
RTF	8	0.338	0.386
RTF	10	0.407	0.465
Throughput (audio-s/s)	1	4.61	4.53
Throughput (audio-s/s)	2	7.99	7.79
Throughput (audio-s/s)	4	14.00	13.06
Throughput (audio-s/s)	6	17.29	15.92
Throughput (audio-s/s)	8	21.72	19.37
Throughput (audio-s/s)	10	21.61	18.99
Throughput (audio-s/s)	1	4.61	4.53
Throughput (audio-s/s)	2	7.99	7.79
Throughput (audio-s/s)	4	14.00	13.06
Throughput (audio-s/s)	6	17.29	15.92
Throughput (audio-s/s)	8	21.72	19.37
Throughput (audio-s/s)	10	21.61	18.99

Improvement (triton vs main)

Metric	Concurrency	Improvement
TTFP	1	+10.3%
TTFP	2	+4.6%
TTFP	4	+8.6%
TTFP	6	+19.9%
TTFP	8	+13.3%
TTFP	10	+11.2%
E2E	1	+1.7%
E2E	2	+3.9%
E2E	4	+6.0%
E2E	6	+11.6%
E2E	8	+15.2%
E2E	10	+11.2%
RTF	1	+1.6%
RTF	2	+2.8%
RTF	4	+7.1%
RTF	6	+11.8%
RTF	8	+12.6%
RTF	10	+12.5%
Plot saved to vllm_omni/results/comparison.png

in the plot we can see that the TTFP/E2E latency, and everything is better.

some audio files:

audio from main: sample_2_stream_false.wav
audio from this PR: sample_2_stream_false.wav

Config YAML

relevant params changed in YAML:

  - stage_id: 1
    stage_type: llm
    runtime:
      devices: "0"
      max_batch_size: 8
....
    max_inflight: 8
....
        codec_chunk_frames: 33
        codec_left_context_frames: 25
        initial_codec_chunk_frames: 2

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: pablo <pablo@agigo.ai>

… feat/code2wav-batch-cuda-graph

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2fc84a4135

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-10T20:14:08Z

+        if hidden_states.is_cuda and self._init_triton():
+            return self._triton_forward(hidden_states)


Add eager fallback around Triton execution failures

The CUDA path now calls _triton_forward whenever _init_triton() returns true, but there is no runtime fallback if Triton kernel compilation or launch fails. In environments where Triton imports successfully but cannot execute (for example unsupported GPU/driver combinations or Triton runtime incompatibilities), this will raise and break decoding instead of preserving the prior eager behavior, so requests can fail entirely rather than degrade gracefully.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-10T20:14:08Z

+        try:
+            import triton
+            import triton.language as tl
+        except ImportError:
+            return False


Memoize Triton-unavailable state after import failure

If Triton is not installed, _init_triton() returns False but leaves _triton_kernel as None, so every CUDA forward re-attempts import triton and pays repeated ImportError costs. Because SnakeBeta is called many times per decode, this repeated exception path can materially hurt throughput on non-Triton CUDA deployments; caching a negative detection result would keep fallback overhead to a one-time check.

Useful? React with 👍 / 👎.

Copilot

Pull request overview

This PR improves Qwen3-TTS decoder inference performance by adding a fused Triton implementation for the SnakeBeta activation and by making CUDA Graph capture sizing more adaptive to streaming/chunking configurations.

Changes:

Add a fused Triton kernel path for SnakeBeta (with eager fallback).
Extend CUDA graph enable/warmup plumbing to incorporate codec chunk/left-context sizes and compute better capture buckets.
Adjust Code2Wav CUDA graph enablement to pass chunk/left-context config directly; simplify per-request decode loop; add targeted tests.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
`vllm_omni/model_executor/models/qwen3_tts/tokenizer_12hz/modeling_qwen3_tts_tokenizer_v2.py`	Adds Triton-accelerated `SnakeBeta` and extends decoder CUDA-graph enablement parameters.
`vllm_omni/model_executor/models/qwen3_tts/qwen3_tts_code2wav.py`	Passes codec chunk/left-context config into decoder CUDA-graph warmup and simplifies decode iteration.
`vllm_omni/model_executor/models/qwen3_tts/cuda_graph_decoder_wrapper.py`	Adds adaptive `compute_capture_sizes`, refines warmup/capture behavior and decode fallback checks.
`tests/model_executor/models/qwen3_tts/test_cuda_graph_decoder.py`	Adds tests for `compute_capture_sizes` and Triton-vs-eager equivalence for `SnakeBeta`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-10T23:40:59Z

        Applies the function to the input elementwise.
        SnakeBeta ∶= x + 1/b * sin^2 (xa)
        """
+        if hidden_states.is_cuda and self._init_triton():


The Triton fast path will run whenever hidden_states is CUDA, even when autograd is enabled / hidden_states.requires_grad=True. Since _triton_forward writes into a freshly allocated tensor and there’s no custom backward, this will silently break gradients. Please gate the Triton path behind not torch.is_grad_enabled() (or not hidden_states.requires_grad) and fall back to _eager_forward when gradients are needed (or implement a proper autograd.Function).

Suggested change

if hidden_states.is_cuda and self._init_triton():

# Use Triton fast path only when gradients are not needed to avoid

# silently breaking autograd. When autograd is enabled, fall back

# to the eager PyTorch implementation, which is fully differentiable.

if hidden_states.is_cuda and not torch.is_grad_enabled() and self._init_triton():

hsliuustc0106 · 2026-03-10T23:43:17Z

+        except ImportError:
+            return False
+
+        @triton.jit


@linyueqian @tzhouam where should we place triton kernels?

I think we should follow vLLM IR if we plan to introduce many triton kernels. As a light abstraction, can consider CustomOp, but vLLM is deprecating it.

I think inline is fine for now since it's the only triton kernel in the repo.

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

… feat/code2wav-batch-cuda-graph

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

lishunyang12

the silent except: pass on the triton path could mask persistent failures

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

JuanPZuluaga · 2026-03-11T09:12:18Z

most things fixed.

… feat/code2wav-batch-cuda-graph

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

… feat/code2wav-batch-cuda-graph

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

… feat/code2wav-batch-cuda-graph

hsliuustc0106

Gate Status

Check	Status
DCO / pre-commit / build	✅
Main CI	✅
AMD CI	❌ (may be unrelated)

Evidence ✅

Comprehensive benchmark data provided:

TTFP improvement: 4.6%–19.9%
E2E improvement: 1.7%–15.2%
Audio samples and comparison plot included

Code Quality ✅

Triton fallback is properly handled with logger.warning(..., exc_info=True) + disables future attempts
Dynamic capture size computation is clean
Test coverage for both features

Prior concern about silent exception handling is addressed. LGTM once AMD CI is investigated (likely unrelated to this PR).

linyueqian · 2026-03-18T16:38:42Z

I was thinking about whether any of these would help further:

Cache the exp() calls. Alpha and beta are frozen at inference time, no reason to recompute exp() in every kernel launch. Just precompute exp(alpha) and 1/(exp(beta)+eps) once after weight loading, store as buffers, and have the kernel load them directly. That's 2 transcendental ops saved per element across all 29 layers.
Block size cap. Any reason for capping at 1024? After upsampling T is usually in the thousands, might be worth trying 2048/4096 to reduce grid launches.
Have you compared against torch.compile? For a pure pointwise fusion like this, torch.compile(mode="reduce-overhead") on the eager forward might get you close with zero custom code. Would be interesting to see as a baseline.

… feat/code2wav-batch-cuda-graph

…uanPZuluaga/vllm-omni into feat/code2wav-batch-cuda-graph

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

… feat/code2wav-batch-cuda-graph

JuanPZuluaga · 2026-03-19T07:20:32Z

@linyueqian thanks for the comments!

Cache the exp() calls. Alpha and beta are frozen at inference time, no reason to recompute exp() in every kernel launch. Just precompute exp(alpha) and 1/(exp(beta)+eps) once after weight loading, store as buffers, and have the kernel load them directly. That's 2 transcendental ops saved per element across all 29 layers.

done. i added precompute_exp_cache(): it pre-computes them as persistent buffers.

Block size cap. Any reason for capping at 1024? After upsampling T is usually in the thousands, might be worth trying 2048/4096 to reduce grid launches.

done. raised _TRITON_MAX_BLOCK_T from 1024 to 4096. i already noticed a bit of improvement. Thanks.

Have you compared against torch.compile? For a pure pointwise fusion like this, torch.compile(mode="reduce-overhead") on the eager forward might get you close with zero custom code. Would be interesting to see as a baseline.

I internally benchmarked torch.compile(mode="reduce-overhead") vs the Triton kernel. with the new caching and max_block the kernel is ~2x faster than compile across all configs.

I am adding the new results after rebasing merge main:

// EDIT

I used this YAML -- relevant params changed in YAML:

  - stage_id: 0
      max_batch_size: 16
      max_num_batched_tokens: 4096
  - stage_id: 1
      max_batch_size: 16
....
    max_inflight: 16
....
        codec_chunk_frames: 25
        codec_left_context_frames: 25

… feat/code2wav-batch-cuda-graph

linyueqian

lgtm

…ect#1797) Signed-off-by: pablo <pablo@agigo.ai> Signed-off-by: JuanPZuluaga <juanz9312@gmal.com> Co-authored-by: pablo <pablo@agigo.ai> Co-authored-by: JuanPZuluaga <juanz9312@gmal.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>

pablo added 4 commits March 10, 2026 19:29

optimize cuda graph capture script for code2wav

60ffb77

Signed-off-by: pablo <pablo@agigo.ai>

add triton kernel for SnakeBeta Code2Wav

4b6a949

Signed-off-by: pablo <pablo@agigo.ai>

add tests for cuda graph and triton snakebeta

1b9094e

Signed-off-by: pablo <pablo@agigo.ai>

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

2fc84a4

… feat/code2wav-batch-cuda-graph

JuanPZuluaga requested a review from hsliuustc0106 as a code owner March 10, 2026 20:07

JuanPZuluaga changed the title ~~[Feat][Qwen3TTS][Code2wav] cuda graph~~ [Feat][Qwen3TTS][Code2wav] triton SnakeBeta and Cuda Graph Mar 10, 2026

chatgpt-codex-connector Bot reviewed Mar 10, 2026

View reviewed changes

hsliuustc0106 requested a review from Copilot March 10, 2026 23:36

Copilot started reviewing on behalf of hsliuustc0106 March 10, 2026 23:37 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

hsliuustc0106 reviewed Mar 10, 2026

View reviewed changes

linyueqian mentioned this pull request Mar 11, 2026

[RFC]: TTS Development Roadmap - March 2026 #1795

Open

JuanPZuluaga added 3 commits March 11, 2026 06:46

add reviewers comments

1eefebf

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

c2329a9

… feat/code2wav-batch-cuda-graph

t_len constexpr

4ef54d1

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

lishunyang12 reviewed Mar 11, 2026

View reviewed changes

Comment thread vllm_omni/model_executor/models/qwen3_tts/tokenizer_12hz/modeling_qwen3_tts_tokenizer_v2.py Outdated

JuanPZuluaga added 2 commits March 11, 2026 07:45

solve except:pass

adc00c4

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

simplify logic in sizes captured by cuda graph

596af6b

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

JuanPZuluaga mentioned this pull request Mar 11, 2026

[RFC]: Qwen3-TTS Production Ready - February Milestone #938

Open

JuanPZuluaga added 4 commits March 11, 2026 12:57

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

d7dd80f

… feat/code2wav-batch-cuda-graph

capture more graph

621df73

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

fc0af3a

… feat/code2wav-batch-cuda-graph

fix pre commit in async_omni

b7d51e6

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

gcanlin reviewed Mar 11, 2026

View reviewed changes

Comment thread vllm_omni/entrypoints/async_omni.py Outdated

merge main

856dd74

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

Gaohan123 added this to the v0.18.0 milestone Mar 12, 2026

JuanPZuluaga added 3 commits March 12, 2026 12:55

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

fa5899d

… feat/code2wav-batch-cuda-graph

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

7d719f1

… feat/code2wav-batch-cuda-graph

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

c45ea0e

… feat/code2wav-batch-cuda-graph

JuanPZuluaga and others added 3 commits March 13, 2026 12:51

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

6d8de1d

… feat/code2wav-batch-cuda-graph

Merge branch 'main' into feat/code2wav-batch-cuda-graph

60cc2c5

Merge branch 'main' into feat/code2wav-batch-cuda-graph

6913060

JuanPZuluaga mentioned this pull request Mar 16, 2026

[Optim][Qwen3TTS][CodePredictor] support torch.compile with reduce-overhead and dynamic False #1913

Merged

5 tasks

JuanPZuluaga and others added 2 commits March 16, 2026 22:50

Merge branch 'main' into feat/code2wav-batch-cuda-graph

b4d415e

Merge branch 'main' into feat/code2wav-batch-cuda-graph

a4c06aa

hsliuustc0106 added the ready label to trigger buildkite CI label Mar 18, 2026

JuanPZuluaga added 2 commits March 18, 2026 12:10

Merge branch 'main' into feat/code2wav-batch-cuda-graph

37042bf

Merge branch 'main' into feat/code2wav-batch-cuda-graph

0ab4594

hsliuustc0106 reviewed Mar 18, 2026

View reviewed changes

linyueqian and others added 7 commits March 18, 2026 12:39

Merge branch 'main' into feat/code2wav-batch-cuda-graph

0bce467

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

d7556ea

… feat/code2wav-batch-cuda-graph

Merge branch 'feat/code2wav-batch-cuda-graph' of https://github.com/J…

6156b64

…uanPZuluaga/vllm-omni into feat/code2wav-batch-cuda-graph

cache exp call and increase block size cap

d165fde

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

remove 2 times called precompute cache

ad2af31

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

update docstring

5c56826

Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

cb95545

… feat/code2wav-batch-cuda-graph

JuanPZuluaga added 3 commits March 19, 2026 07:45

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

b586968

… feat/code2wav-batch-cuda-graph

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

cd18316

… feat/code2wav-batch-cuda-graph

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

63b665d

… feat/code2wav-batch-cuda-graph

linyueqian approved these changes Mar 19, 2026

View reviewed changes

hsliuustc0106 merged commit 81a90d2 into vllm-project:main Mar 20, 2026
7 checks passed

JuanPZuluaga deleted the feat/code2wav-batch-cuda-graph branch March 20, 2026 05:09

gcanlin mentioned this pull request Mar 30, 2026

[RFC]: Improving Qwen3-TTS Performance on NPU #2328

Open

9 tasks

		if hidden_states.is_cuda and self._init_triton():
		return self._triton_forward(hidden_states)

-        if hidden_states.is_cuda and self._init_triton():
+        # Use Triton fast path only when gradients are not needed to avoid
+        # silently breaking autograd. When autograd is enabled, fall back
+        # to the eager PyTorch implementation, which is fully differentiable.
+        if hidden_states.is_cuda and not torch.is_grad_enabled() and self._init_triton():

Conversation

JuanPZuluaga commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Benchmark Results

Improvement (triton vs main)

Config YAML

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hsliuustc0106 Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

gcanlin Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

linyueqian Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JuanPZuluaga commented Mar 11, 2026

Uh oh!

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Gate Status

Evidence ✅

Code Quality ✅

Uh oh!

linyueqian commented Mar 18, 2026

Uh oh!

JuanPZuluaga commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

linyueqian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

JuanPZuluaga commented Mar 10, 2026 •

edited

Loading

JuanPZuluaga commented Mar 19, 2026 •

edited

Loading