[Qwen3TTS][Bugfix] Guard inner CUDA graph replay during outer capture#2910
Merged
linyueqian merged 1 commit intoApr 22, 2026
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
Collaborator
|
fix dco please |
Stage 1 (Code2Wav) fails startup with cudagraph_mode: FULL on
Qwen3-TTS-12Hz-0.6B-Base:
RuntimeError: Cannot prepare for replay during capturing stage.
Current cudaStreamCaptureStatus: cudaStreamCaptureStatusActive
vLLM's FULL-mode warmup opens an outer CUDA stream capture and calls
the model's forward pass; inside that forward, CUDAGraphDecoderWrapper
attempts to replay its inner graph at line 151. PyTorch disallows
replaying a captured graph while another capture is active on the
stream.
Add a guard in decode() that falls back to eager when
torch.cuda.is_current_stream_capturing() is true. is_current_stream_capturing()
returns False outside warmup, so this is a zero-cost runtime change — the
graph fast path is hit for all normal inference.
Tested against:
- RTX 6000 Ada (48 GB) + RTX 6000 Pro Blackwell (96 GB, Nebius)
- Qwen3-TTS-12Hz-0.6B-Base
- vllm-omni 0.18.0 + vllm 0.18.0
- Both async_chunk=true and async_chunk=false configs
Blackwell measured +5 % throughput at N=12 with Stage 1 FULL graphs vs
eager; Ada gain is within noise (GPU already at 77 % util).
Backwards compatible: zero behaviour change when enforce_eager=true
(existing default for Stage 1) or when the outer capture is not active.
Signed-off-by: ChipMates <strasserm@chipmates.ai>
a67e304 to
b5561a3
Compare
1 task
qinganrice
pushed a commit
to qinganrice/vllm-omni
that referenced
this pull request
Apr 23, 2026
…vllm-project#2910) Signed-off-by: ChipMates <strasserm@chipmates.ai>
lengrongfu
pushed a commit
to lengrongfu/vllm-omni
that referenced
this pull request
May 1, 2026
…vllm-project#2910) Signed-off-by: ChipMates <strasserm@chipmates.ai>
clodaghwalsh17
pushed a commit
to clodaghwalsh17/nm-vllm-omni-ent
that referenced
this pull request
May 12, 2026
…vllm-project#2910) Signed-off-by: ChipMates <strasserm@chipmates.ai>
daixinning
pushed a commit
to daixinning/vllm-omni
that referenced
this pull request
May 28, 2026
…vllm-project#2910) Signed-off-by: ChipMates <strasserm@chipmates.ai>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
[Qwen3TTS][Bugfix] Guard inner CUDA graph replay during outer capture
Summary
CUDAGraphDecoderWrapper.decode()can fail to replay when called from inside an outer CUDA stream capture — specifically during vLLM'scudagraph_mode: FULLwarmup on Stage 1 (Code2Wav). This PR adds a 1-line guard that falls back to eager when a stream capture is already active, unblockingenforce_eager: falseon Stage 1 for Qwen3-TTS.Repro
Qwen/Qwen3-TTS-12Hz-0.6B-Base0.18.0+ vllm:0.18.0vllm-omni serve Qwen/Qwen3-TTS-12Hz-0.6B-Base --omni --stage-configs-path <cfg>Crash during startup:
Stack ends at
cuda_graph_decoder_wrapper.py:151→self.graphs[padded_size].replay().Root cause
vLLM's FULL-mode warmup opens an outer CUDA stream capture on the compute stream and then calls the model's forward pass inside that capture.
Qwen3TTSCode2Wavinternally invokesCUDAGraphDecoderWrapper.decode()for the speech tokenizer decoder, which (once warmed up) replays its inner captured graph. PyTorch does not allow replaying an inner captured graph while another capture is active on the same stream.Neither #2376 (Qwen3-Omni-7B scope, different code path) nor #2868 (fires earlier at
make_omni_output, tuple unpacking) addresses this specific failure — both must be in place for the graph capture path to even reachdecode()in some configs.Fix
Add a
torch.cuda.is_current_stream_capturing()guard at the top ofdecode(), after the existing early-return forenabled/warmed_up/shape. When an outer capture is active, fall back to eager; the outer capture then completes cleanly, and all futuredecode()calls (at inference time, outside any capture) hit the graph fast path as before.Runtime impact
Zero at inference time.
torch.cuda.is_current_stream_capturing()returnsFalsewhenever a stream capture is not active, which is the entire inference-time path. The fallback only activates during the startup warmup capture window, which is also the only window where it needs to activate.Backwards compatibility
enforce_eager=true(today's default for Stage 1 in the shippedqwen3_tts_no_async_chunk.yaml/qwen3_tts_batch.yamlconfigs).enforce_eager: falseon Stage 1 for users who want FULL-mode graphs there.Required config pairing (note for users, not part of this patch)
Users who enable
enforce_eager: falseon Stage 1 must also restrict CUDA graph bucket sizes to multiples ofnum_quantizers(16 for Qwen3-TTS-12Hz-0.6B).Qwen3TTSCode2Wav.forwardskips inputs whereinput_ids.numel() % num_quantizers != 0; vLLM's default bucket list includes sizes like 4, 8, 24, 40 which the vocoder would skip, leaving the outer capture in an invalid state. Restricting via:on both stages avoids the skip path entirely. This is a configuration-side fix only; no code change in this patch.
Measurements (FULL vs eager on Stage 1 at N=12)
Measured on both Ada and Blackwell, matching configs, 3 reps × 145-char DE sentence,
x_vector_only_mode=true,async_chunk=false.Ada gain is within noise because the GPU is already at ~77 % utilisation on Stage 0 + other stack components; Blackwell has more headroom for the vocoder graph speedup to propagate.
The larger value of this patch is architectural: it unblocks Stage 1 FULL graphs so downstream optimisations (FP8 Stage 0, larger Qwen3-TTS model variants, streaming async_chunk at higher concurrency) have a graph-accelerated Stage 1 to build on.
Test plan
enforce_eager: falseon Stage 1 +cudagraph_capture_sizes: [16, 32, 48, 64](previously:RuntimeErrorduring warmup).enforce_eager: trueon Stage 1 (the existing default). No behaviour change.async_chunk=false; no regressions.Related
qwen3_tts_code2wav.py::make_omni_output) to even reach thedecode()path in some configs.Credits
Discovered and characterised during a cross-server benchmark session against Nebius-hosted RTX 6000 Pro Blackwell, 2026-04-18. Reproduced and validated on RTX 6000 Ada against Hetzner GEX130 production stack.