From b5561a33fb9e669a8dfd6eb28b723cd230247def Mon Sep 17 00:00:00 2001 From: ChipMates Date: Sun, 19 Apr 2026 05:32:25 +0200 Subject: [PATCH] [Qwen3TTS][Bugfix] Guard inner CUDA graph replay during outer capture MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Stage 1 (Code2Wav) fails startup with cudagraph_mode: FULL on Qwen3-TTS-12Hz-0.6B-Base: RuntimeError: Cannot prepare for replay during capturing stage. Current cudaStreamCaptureStatus: cudaStreamCaptureStatusActive vLLM's FULL-mode warmup opens an outer CUDA stream capture and calls the model's forward pass; inside that forward, CUDAGraphDecoderWrapper attempts to replay its inner graph at line 151. PyTorch disallows replaying a captured graph while another capture is active on the stream. Add a guard in decode() that falls back to eager when torch.cuda.is_current_stream_capturing() is true. is_current_stream_capturing() returns False outside warmup, so this is a zero-cost runtime change — the graph fast path is hit for all normal inference. Tested against: - RTX 6000 Ada (48 GB) + RTX 6000 Pro Blackwell (96 GB, Nebius) - Qwen3-TTS-12Hz-0.6B-Base - vllm-omni 0.18.0 + vllm 0.18.0 - Both async_chunk=true and async_chunk=false configs Blackwell measured +5 % throughput at N=12 with Stage 1 FULL graphs vs eager; Ada gain is within noise (GPU already at 77 % util). Backwards compatible: zero behaviour change when enforce_eager=true (existing default for Stage 1) or when the outer capture is not active. Signed-off-by: ChipMates --- .../models/qwen3_tts/cuda_graph_decoder_wrapper.py | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/vllm_omni/model_executor/models/qwen3_tts/cuda_graph_decoder_wrapper.py b/vllm_omni/model_executor/models/qwen3_tts/cuda_graph_decoder_wrapper.py index 96f8c799c1..0e1df2aa7d 100644 --- a/vllm_omni/model_executor/models/qwen3_tts/cuda_graph_decoder_wrapper.py +++ b/vllm_omni/model_executor/models/qwen3_tts/cuda_graph_decoder_wrapper.py @@ -140,6 +140,15 @@ def decode(self, codes: torch.Tensor) -> torch.Tensor: if not self.enabled or not self._warmed_up or codes.shape[0] != 1: return self.decoder(codes) + # Inner CUDA graph replay is illegal while an outer stream capture is + # active (e.g. vLLM's cudagraph_mode=FULL warmup on Stage 1). Fall back + # to eager in that case so the outer capture can complete. The guard is + # a no-op at runtime: is_current_stream_capturing() returns False + # outside the startup capture window, so normal inference still hits + # the graph fast path. + if torch.cuda.is_current_stream_capturing(): + return self.decoder(codes) + actual_size = codes.shape[-1] padded_size = self._get_padded_size(actual_size)