[Qwen3TTS][Bugfix] Guard inner CUDA graph replay during outer capture by michael-chipmates · Pull Request #2910 · vllm-project/vllm-omni

michael-chipmates · 2026-04-19T03:56:56Z

[Qwen3TTS][Bugfix] Guard inner CUDA graph replay during outer capture

Summary

CUDAGraphDecoderWrapper.decode() can fail to replay when called from inside an outer CUDA stream capture — specifically during vLLM's cudagraph_mode: FULL warmup on Stage 1 (Code2Wav). This PR adds a 1-line guard that falls back to eager when a stream capture is already active, unblocking enforce_eager: false on Stage 1 for Qwen3-TTS.

Repro

Model: Qwen/Qwen3-TTS-12Hz-0.6B-Base
vllm-omni: 0.18.0 + vllm: 0.18.0

Stage config (Stage 1):

- stage_id: 1
  engine_args:
    model_stage: code2wav
    enforce_eager: false
    cudagraph_capture_sizes: [16, 32, 48, 64]
    # ...

Launch: vllm-omni serve Qwen/Qwen3-TTS-12Hz-0.6B-Base --omni --stage-configs-path <cfg>

Crash during startup:

RuntimeError: Cannot prepare for replay during capturing stage.
Current cudaStreamCaptureStatus: cudaStreamCaptureStatusActive

Stack ends at cuda_graph_decoder_wrapper.py:151 → self.graphs[padded_size].replay().

Root cause

vLLM's FULL-mode warmup opens an outer CUDA stream capture on the compute stream and then calls the model's forward pass inside that capture. Qwen3TTSCode2Wav internally invokes CUDAGraphDecoderWrapper.decode() for the speech tokenizer decoder, which (once warmed up) replays its inner captured graph. PyTorch does not allow replaying an inner captured graph while another capture is active on the same stream.

Neither #2376 (Qwen3-Omni-7B scope, different code path) nor #2868 (fires earlier at make_omni_output, tuple unpacking) addresses this specific failure — both must be in place for the graph capture path to even reach decode() in some configs.

Fix

Add a torch.cuda.is_current_stream_capturing() guard at the top of decode(), after the existing early-return for enabled/warmed_up/shape. When an outer capture is active, fall back to eager; the outer capture then completes cleanly, and all future decode() calls (at inference time, outside any capture) hit the graph fast path as before.

def decode(self, codes: torch.Tensor) -> torch.Tensor:
    if not self.enabled or not self._warmed_up or codes.shape[0] != 1:
        return self.decoder(codes)

    # Inner CUDA graph replay is illegal while an outer stream capture is
    # active (e.g. vLLM's cudagraph_mode=FULL warmup on Stage 1). Fall back
    # to eager in that case so the outer capture can complete. The guard is
    # a no-op at runtime: is_current_stream_capturing() returns False
    # outside the startup capture window, so normal inference still hits
    # the graph fast path.
    if torch.cuda.is_current_stream_capturing():
        return self.decoder(codes)

    actual_size = codes.shape[-1]
    # ...

Runtime impact

Zero at inference time. torch.cuda.is_current_stream_capturing() returns False whenever a stream capture is not active, which is the entire inference-time path. The fallback only activates during the startup warmup capture window, which is also the only window where it needs to activate.

Backwards compatibility

No change when enforce_eager=true (today's default for Stage 1 in the shipped qwen3_tts_no_async_chunk.yaml / qwen3_tts_batch.yaml configs).
No change when no outer capture is active (all inference calls).
Unblocks enforce_eager: false on Stage 1 for users who want FULL-mode graphs there.

Required config pairing (note for users, not part of this patch)

Users who enable enforce_eager: false on Stage 1 must also restrict CUDA graph bucket sizes to multiples of num_quantizers (16 for Qwen3-TTS-12Hz-0.6B). Qwen3TTSCode2Wav.forward skips inputs where input_ids.numel() % num_quantizers != 0; vLLM's default bucket list includes sizes like 4, 8, 24, 40 which the vocoder would skip, leaving the outer capture in an invalid state. Restricting via:

cudagraph_capture_sizes: [16, 32, 48, 64]

on both stages avoids the skip path entirely. This is a configuration-side fix only; no code change in this patch.

Measurements (FULL vs eager on Stage 1 at N=12)

Measured on both Ada and Blackwell, matching configs, 3 reps × 145-char DE sentence, x_vector_only_mode=true, async_chunk=false.

GPU	Config	N=12 P90	Throughput	vs eager
RTX 6000 Ada	Stage 1 eager (baseline)	2,505 ms	4.75 req/s	—
RTX 6000 Ada	Stage 1 FULL + this patch	2,485 ms	4.78 req/s	+0.6 % (noise)
RTX 6000 Pro Blackwell	Stage 1 eager	—	4.25 req/s	—
RTX 6000 Pro Blackwell	Stage 1 FULL + this patch	—	4.42 req/s	+5 %

Ada gain is within noise because the GPU is already at ~77 % utilisation on Stage 0 + other stack components; Blackwell has more headroom for the vocoder graph speedup to propagate.

The larger value of this patch is architectural: it unblocks Stage 1 FULL graphs so downstream optimisations (FP8 Stage 0, larger Qwen3-TTS model variants, streaming async_chunk at higher concurrency) have a graph-accelerated Stage 1 to build on.

Test plan

Startup succeeds with enforce_eager: false on Stage 1 + cudagraph_capture_sizes: [16, 32, 48, 64] (previously: RuntimeError during warmup).
Startup still succeeds with enforce_eager: true on Stage 1 (the existing default). No behaviour change.
Inference correctness: audio output numerically identical between eager and FULL configs (24 kHz mono, RMS / peak / duration match within noise).
Tested at N=1, 4, 8, 12, 32 concurrent with async_chunk=false; no regressions.

Requires [BugFix]: Fix Qwen3-TTS code2wav fails when enforce_eager: false #2868 (OmniOutput-tuple handling in qwen3_tts_code2wav.py::make_omni_output) to even reach the decode() path in some configs.
Adjacent to [Feat][Qwen3-Omni] Add CUDA graph support for Code2Wav decoder #2376 (Qwen3-Omni-7B scope, different path).

Credits

Discovered and characterised during a cross-server benchmark session against Nebius-hosted RTX 6000 Pro Blackwell, 2026-04-18. Reproduced and validated on RTX 6000 Ada against Hetzner GEX130 production stack.

chatgpt-codex-connector · 2026-04-19T03:57:00Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

linyueqian · 2026-04-19T13:53:53Z

fix dco please

Stage 1 (Code2Wav) fails startup with cudagraph_mode: FULL on Qwen3-TTS-12Hz-0.6B-Base: RuntimeError: Cannot prepare for replay during capturing stage. Current cudaStreamCaptureStatus: cudaStreamCaptureStatusActive vLLM's FULL-mode warmup opens an outer CUDA stream capture and calls the model's forward pass; inside that forward, CUDAGraphDecoderWrapper attempts to replay its inner graph at line 151. PyTorch disallows replaying a captured graph while another capture is active on the stream. Add a guard in decode() that falls back to eager when torch.cuda.is_current_stream_capturing() is true. is_current_stream_capturing() returns False outside warmup, so this is a zero-cost runtime change — the graph fast path is hit for all normal inference. Tested against: - RTX 6000 Ada (48 GB) + RTX 6000 Pro Blackwell (96 GB, Nebius) - Qwen3-TTS-12Hz-0.6B-Base - vllm-omni 0.18.0 + vllm 0.18.0 - Both async_chunk=true and async_chunk=false configs Blackwell measured +5 % throughput at N=12 with Stage 1 FULL graphs vs eager; Ada gain is within noise (GPU already at 77 % util). Backwards compatible: zero behaviour change when enforce_eager=true (existing default for Stage 1) or when the outer capture is not active. Signed-off-by: ChipMates <strasserm@chipmates.ai>

linyueqian

lgtm

…vllm-project#2910) Signed-off-by: ChipMates <strasserm@chipmates.ai>

michael-chipmates requested a review from hsliuustc0106 as a code owner April 19, 2026 03:56

linyueqian added the ready label to trigger buildkite CI label Apr 19, 2026

michael-chipmates force-pushed the fix-qwen3tts-decoder-wrapper-inner-capture branch from a67e304 to b5561a3 Compare April 19, 2026 15:36

Sy0307 mentioned this pull request Apr 20, 2026

[BugFix]: Fix Qwen3-TTS code2wav fails when enforce_eager: false #2868

Merged

Gaohan123 added this to the v0.20.0 milestone Apr 20, 2026

Gaohan123 mentioned this pull request Apr 20, 2026

[Bug]: Qwen3-TTS code2wav fails when enforce_eager: false #2866

Closed

1 task

linyueqian enabled auto-merge (squash) April 22, 2026 03:02

linyueqian approved these changes Apr 22, 2026

View reviewed changes

linyueqian merged commit e6cf26e into vllm-project:main Apr 22, 2026
7 of 8 checks passed

qinganrice pushed a commit to qinganrice/vllm-omni that referenced this pull request Apr 23, 2026

[Qwen3TTS][Bugfix] Guard inner CUDA graph replay during outer capture (…

6bb9d6b

…vllm-project#2910) Signed-off-by: ChipMates <strasserm@chipmates.ai>

lengrongfu pushed a commit to lengrongfu/vllm-omni that referenced this pull request May 1, 2026

[Qwen3TTS][Bugfix] Guard inner CUDA graph replay during outer capture (…

b6ca8d8

…vllm-project#2910) Signed-off-by: ChipMates <strasserm@chipmates.ai>

clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026

[Qwen3TTS][Bugfix] Guard inner CUDA graph replay during outer capture (…

bc03ec3

…vllm-project#2910) Signed-off-by: ChipMates <strasserm@chipmates.ai>

daixinning pushed a commit to daixinning/vllm-omni that referenced this pull request May 28, 2026

[Qwen3TTS][Bugfix] Guard inner CUDA graph replay during outer capture (…

de15d33

…vllm-project#2910) Signed-off-by: ChipMates <strasserm@chipmates.ai>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Qwen3TTS][Bugfix] Guard inner CUDA graph replay during outer capture#2910

[Qwen3TTS][Bugfix] Guard inner CUDA graph replay during outer capture#2910
linyueqian merged 1 commit into
vllm-project:mainfrom
michael-chipmates:fix-qwen3tts-decoder-wrapper-inner-capture

michael-chipmates commented Apr 19, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 19, 2026

Uh oh!

linyueqian commented Apr 19, 2026

Uh oh!

linyueqian left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

michael-chipmates commented Apr 19, 2026

[Qwen3TTS][Bugfix] Guard inner CUDA graph replay during outer capture

Summary

Repro

Root cause

Fix

Runtime impact

Backwards compatibility

Required config pairing (note for users, not part of this patch)

Measurements (FULL vs eager on Stage 1 at N=12)

Test plan

Related

Credits

Uh oh!

chatgpt-codex-connector Bot commented Apr 19, 2026

Uh oh!

linyueqian commented Apr 19, 2026

Uh oh!

linyueqian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants