From b5561a33fb9e669a8dfd6eb28b723cd230247def Mon Sep 17 00:00:00 2001
From: ChipMates <strasserm@chipmates.ai>
Date: Sun, 19 Apr 2026 05:32:25 +0200
Subject: [PATCH] [Qwen3TTS][Bugfix] Guard inner CUDA graph replay during outer
 capture
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Stage 1 (Code2Wav) fails startup with cudagraph_mode: FULL on
Qwen3-TTS-12Hz-0.6B-Base:

    RuntimeError: Cannot prepare for replay during capturing stage.
    Current cudaStreamCaptureStatus: cudaStreamCaptureStatusActive

vLLM's FULL-mode warmup opens an outer CUDA stream capture and calls
the model's forward pass; inside that forward, CUDAGraphDecoderWrapper
attempts to replay its inner graph at line 151. PyTorch disallows
replaying a captured graph while another capture is active on the
stream.

Add a guard in decode() that falls back to eager when
torch.cuda.is_current_stream_capturing() is true. is_current_stream_capturing()
returns False outside warmup, so this is a zero-cost runtime change — the
graph fast path is hit for all normal inference.

Tested against:
  - RTX 6000 Ada (48 GB) + RTX 6000 Pro Blackwell (96 GB, Nebius)
  - Qwen3-TTS-12Hz-0.6B-Base
  - vllm-omni 0.18.0 + vllm 0.18.0
  - Both async_chunk=true and async_chunk=false configs

Blackwell measured +5 % throughput at N=12 with Stage 1 FULL graphs vs
eager; Ada gain is within noise (GPU already at 77 % util).

Backwards compatible: zero behaviour change when enforce_eager=true
(existing default for Stage 1) or when the outer capture is not active.

Signed-off-by: ChipMates <strasserm@chipmates.ai>
---
 .../models/qwen3_tts/cuda_graph_decoder_wrapper.py       | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/vllm_omni/model_executor/models/qwen3_tts/cuda_graph_decoder_wrapper.py b/vllm_omni/model_executor/models/qwen3_tts/cuda_graph_decoder_wrapper.py
index 96f8c799c1..0e1df2aa7d 100644
--- a/vllm_omni/model_executor/models/qwen3_tts/cuda_graph_decoder_wrapper.py
+++ b/vllm_omni/model_executor/models/qwen3_tts/cuda_graph_decoder_wrapper.py
@@ -140,6 +140,15 @@ def decode(self, codes: torch.Tensor) -> torch.Tensor:
         if not self.enabled or not self._warmed_up or codes.shape[0] != 1:
             return self.decoder(codes)
 
+        # Inner CUDA graph replay is illegal while an outer stream capture is
+        # active (e.g. vLLM's cudagraph_mode=FULL warmup on Stage 1). Fall back
+        # to eager in that case so the outer capture can complete. The guard is
+        # a no-op at runtime: is_current_stream_capturing() returns False
+        # outside the startup capture window, so normal inference still hits
+        # the graph fast path.
+        if torch.cuda.is_current_stream_capturing():
+            return self.decoder(codes)
+
         actual_size = codes.shape[-1]
         padded_size = self._get_padded_size(actual_size)