Skip to content

[Qwen3TTS][Bugfix] Guard inner CUDA graph replay during outer capture#2910

Merged
linyueqian merged 1 commit into
vllm-project:mainfrom
michael-chipmates:fix-qwen3tts-decoder-wrapper-inner-capture
Apr 22, 2026
Merged

[Qwen3TTS][Bugfix] Guard inner CUDA graph replay during outer capture#2910
linyueqian merged 1 commit into
vllm-project:mainfrom
michael-chipmates:fix-qwen3tts-decoder-wrapper-inner-capture

Conversation

@michael-chipmates
Copy link
Copy Markdown
Contributor

[Qwen3TTS][Bugfix] Guard inner CUDA graph replay during outer capture

Summary

CUDAGraphDecoderWrapper.decode() can fail to replay when called from inside an outer CUDA stream capture — specifically during vLLM's cudagraph_mode: FULL warmup on Stage 1 (Code2Wav). This PR adds a 1-line guard that falls back to eager when a stream capture is already active, unblocking enforce_eager: false on Stage 1 for Qwen3-TTS.

Repro

  • Model: Qwen/Qwen3-TTS-12Hz-0.6B-Base
  • vllm-omni: 0.18.0 + vllm: 0.18.0
  • Stage config (Stage 1):
    - stage_id: 1
      engine_args:
        model_stage: code2wav
        enforce_eager: false
        cudagraph_capture_sizes: [16, 32, 48, 64]
        # ...
  • Launch: vllm-omni serve Qwen/Qwen3-TTS-12Hz-0.6B-Base --omni --stage-configs-path <cfg>

Crash during startup:

RuntimeError: Cannot prepare for replay during capturing stage.
Current cudaStreamCaptureStatus: cudaStreamCaptureStatusActive

Stack ends at cuda_graph_decoder_wrapper.py:151self.graphs[padded_size].replay().

Root cause

vLLM's FULL-mode warmup opens an outer CUDA stream capture on the compute stream and then calls the model's forward pass inside that capture. Qwen3TTSCode2Wav internally invokes CUDAGraphDecoderWrapper.decode() for the speech tokenizer decoder, which (once warmed up) replays its inner captured graph. PyTorch does not allow replaying an inner captured graph while another capture is active on the same stream.

Neither #2376 (Qwen3-Omni-7B scope, different code path) nor #2868 (fires earlier at make_omni_output, tuple unpacking) addresses this specific failure — both must be in place for the graph capture path to even reach decode() in some configs.

Fix

Add a torch.cuda.is_current_stream_capturing() guard at the top of decode(), after the existing early-return for enabled/warmed_up/shape. When an outer capture is active, fall back to eager; the outer capture then completes cleanly, and all future decode() calls (at inference time, outside any capture) hit the graph fast path as before.

def decode(self, codes: torch.Tensor) -> torch.Tensor:
    if not self.enabled or not self._warmed_up or codes.shape[0] != 1:
        return self.decoder(codes)

    # Inner CUDA graph replay is illegal while an outer stream capture is
    # active (e.g. vLLM's cudagraph_mode=FULL warmup on Stage 1). Fall back
    # to eager in that case so the outer capture can complete. The guard is
    # a no-op at runtime: is_current_stream_capturing() returns False
    # outside the startup capture window, so normal inference still hits
    # the graph fast path.
    if torch.cuda.is_current_stream_capturing():
        return self.decoder(codes)

    actual_size = codes.shape[-1]
    # ...

Runtime impact

Zero at inference time. torch.cuda.is_current_stream_capturing() returns False whenever a stream capture is not active, which is the entire inference-time path. The fallback only activates during the startup warmup capture window, which is also the only window where it needs to activate.

Backwards compatibility

  • No change when enforce_eager=true (today's default for Stage 1 in the shipped qwen3_tts_no_async_chunk.yaml / qwen3_tts_batch.yaml configs).
  • No change when no outer capture is active (all inference calls).
  • Unblocks enforce_eager: false on Stage 1 for users who want FULL-mode graphs there.

Required config pairing (note for users, not part of this patch)

Users who enable enforce_eager: false on Stage 1 must also restrict CUDA graph bucket sizes to multiples of num_quantizers (16 for Qwen3-TTS-12Hz-0.6B). Qwen3TTSCode2Wav.forward skips inputs where input_ids.numel() % num_quantizers != 0; vLLM's default bucket list includes sizes like 4, 8, 24, 40 which the vocoder would skip, leaving the outer capture in an invalid state. Restricting via:

cudagraph_capture_sizes: [16, 32, 48, 64]

on both stages avoids the skip path entirely. This is a configuration-side fix only; no code change in this patch.

Measurements (FULL vs eager on Stage 1 at N=12)

Measured on both Ada and Blackwell, matching configs, 3 reps × 145-char DE sentence, x_vector_only_mode=true, async_chunk=false.

GPU Config N=12 P90 Throughput vs eager
RTX 6000 Ada Stage 1 eager (baseline) 2,505 ms 4.75 req/s
RTX 6000 Ada Stage 1 FULL + this patch 2,485 ms 4.78 req/s +0.6 % (noise)
RTX 6000 Pro Blackwell Stage 1 eager 4.25 req/s
RTX 6000 Pro Blackwell Stage 1 FULL + this patch 4.42 req/s +5 %

Ada gain is within noise because the GPU is already at ~77 % utilisation on Stage 0 + other stack components; Blackwell has more headroom for the vocoder graph speedup to propagate.

The larger value of this patch is architectural: it unblocks Stage 1 FULL graphs so downstream optimisations (FP8 Stage 0, larger Qwen3-TTS model variants, streaming async_chunk at higher concurrency) have a graph-accelerated Stage 1 to build on.

Test plan

  • Startup succeeds with enforce_eager: false on Stage 1 + cudagraph_capture_sizes: [16, 32, 48, 64] (previously: RuntimeError during warmup).
  • Startup still succeeds with enforce_eager: true on Stage 1 (the existing default). No behaviour change.
  • Inference correctness: audio output numerically identical between eager and FULL configs (24 kHz mono, RMS / peak / duration match within noise).
  • Tested at N=1, 4, 8, 12, 32 concurrent with async_chunk=false; no regressions.

Related

Credits

Discovered and characterised during a cross-server benchmark session against Nebius-hosted RTX 6000 Pro Blackwell, 2026-04-18. Reproduced and validated on RTX 6000 Ada against Hetzner GEX130 production stack.

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@linyueqian
Copy link
Copy Markdown
Collaborator

fix dco please

@linyueqian linyueqian added the ready label to trigger buildkite CI label Apr 19, 2026
Stage 1 (Code2Wav) fails startup with cudagraph_mode: FULL on
Qwen3-TTS-12Hz-0.6B-Base:

    RuntimeError: Cannot prepare for replay during capturing stage.
    Current cudaStreamCaptureStatus: cudaStreamCaptureStatusActive

vLLM's FULL-mode warmup opens an outer CUDA stream capture and calls
the model's forward pass; inside that forward, CUDAGraphDecoderWrapper
attempts to replay its inner graph at line 151. PyTorch disallows
replaying a captured graph while another capture is active on the
stream.

Add a guard in decode() that falls back to eager when
torch.cuda.is_current_stream_capturing() is true. is_current_stream_capturing()
returns False outside warmup, so this is a zero-cost runtime change — the
graph fast path is hit for all normal inference.

Tested against:
  - RTX 6000 Ada (48 GB) + RTX 6000 Pro Blackwell (96 GB, Nebius)
  - Qwen3-TTS-12Hz-0.6B-Base
  - vllm-omni 0.18.0 + vllm 0.18.0
  - Both async_chunk=true and async_chunk=false configs

Blackwell measured +5 % throughput at N=12 with Stage 1 FULL graphs vs
eager; Ada gain is within noise (GPU already at 77 % util).

Backwards compatible: zero behaviour change when enforce_eager=true
(existing default for Stage 1) or when the outer capture is not active.

Signed-off-by: ChipMates <strasserm@chipmates.ai>
Copy link
Copy Markdown
Collaborator

@linyueqian linyueqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@linyueqian linyueqian merged commit e6cf26e into vllm-project:main Apr 22, 2026
7 of 8 checks passed
qinganrice pushed a commit to qinganrice/vllm-omni that referenced this pull request Apr 23, 2026
lengrongfu pushed a commit to lengrongfu/vllm-omni that referenced this pull request May 1, 2026
clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
daixinning pushed a commit to daixinning/vllm-omni that referenced this pull request May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants