Conversation
Signed-off-by: xjx <493337577@qq.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a change in vllm/model_executor/models/qwen3_next.py to skip the execution of the _forward_core method in Qwen3NextGatedDeltaNet when a CUDA graph is being captured. This is done by adding a condition that checks torch.cuda.is_current_stream_capturing(). This prevents Triton kernels within this method from being executed during graph capture, which is a necessary step as they are not typically graph-safe. The change is localized and addresses the stated purpose of the pull request.
Note: Security Review did not run due to the size of the PR.
|
@Isotr0py pls take a look, thanks |
There was a problem hiding this comment.
Which GPU and Triton version are you using? I can't reproduce this failure on RTX3090:
vllm serve /mnt/data0/LLM/Qwen3.5-0.8B/ --speculative_config '{"method": "mtp", "num_speculative_tokens":2}'
(EngineCore_DP0 pid=3874023) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(EngineCore_DP0 pid=3874023) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 2.68it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 2.68it/s]
(EngineCore_DP0 pid=3874023)
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:04 [default_loader.py:293] Loading weights took 0.44 seconds
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:04 [gpu_model_runner.py:4494] Loading drafter model...
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 8.54it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 8.52it/s]
(EngineCore_DP0 pid=3874023)
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:04 [default_loader.py:293] Loading weights took 0.14 seconds
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:04 [eagle.py:1378] Detected MTP model. Sharing target model embedding weights with the draft model.
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:04 [eagle.py:1432] Detected MTP model. Sharing target model lm_head weights with the draft model.
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:05 [gpu_model_runner.py:4553] Model loading took 1.76 GiB memory and 1.129761 seconds
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:05 [gpu_model_runner.py:5475] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:09 [backends.py:975] Using cache directory: /home/mozf/.cache/vllm/torch_compile_cache/9093e5f81b/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:09 [backends.py:1035] Dynamo bytecode transform time: 1.02 s
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [backends.py:284] Directly load the compiled graph(s) for compile range (1, 2048) from the cache, took 0.632 s
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [monitor.py:48] torch.compile took 1.80 s in total
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [decorators.py:477] Directly load AOT compilation from path /home/mozf/.cache/vllm/torch_compile_cache/torch_aot_compile/e19c1dec00a3f53db9a332380dc52a844d914f12adb5f36ca2b695d46a90c876/rank_0_0/model
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [monitor.py:76] Initial profiling/warmup run took 0.01 s
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [backends.py:975] Using cache directory: /home/mozf/.cache/vllm/torch_compile_cache/9093e5f81b/rank_0_0/eagle_head for vLLM's torch.compile
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [backends.py:1035] Dynamo bytecode transform time: 0.11 s
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [backends.py:284] Directly load the compiled graph(s) for compile range (1, 2048) from the cache, took 0.052 s
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [monitor.py:48] torch.compile took 0.17 s in total
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [decorators.py:477] Directly load AOT compilation from path /home/mozf/.cache/vllm/torch_compile_cache/torch_aot_compile/27be4f948aa418b8176df25af512a7572f985b510e99ee43fea23336385a7e5c/rank_0_0/model
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [monitor.py:76] Initial profiling/warmup run took 0.00 s
(EngineCore_DP0 pid=3874023) WARNING 03-10 13:58:12 [kv_cache_utils.py:1054] Add 3 padding layers, may waste at most 16.67% KV cache memory
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:12 [gpu_model_runner.py:5596] Profiling CUDA graph memory: PIECEWISE=49 (largest=498), FULL=49 (largest=498)
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:26 [gpu_model_runner.py:5675] Estimated CUDA graph memory: 0.45 GiB total
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:27 [gpu_worker.py:456] Available KV cache memory: 17.3 GiB
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:27 [gpu_worker.py:490] In v0.19, CUDA graph memory profiling will be enabled by default (VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1), which more accurately accounts for CUDA graph memory during KV cache allocation. To try it now, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 and increase --gpu-memory-utilization from 0.9000 to 0.9190 to maintain the same effective KV cache size.
(EngineCore_DP0 pid=3874023) WARNING 03-10 13:58:27 [kv_cache_utils.py:1054] Add 3 padding layers, may waste at most 16.67% KV cache memory
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:27 [kv_cache_utils.py:1314] GPU KV cache size: 323,680 tokens
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:27 [kv_cache_utils.py:1319] Maximum concurrency for 262,144 tokens per request: 4.85x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 49/49 [00:01<00:00, 31.31it/s]
Capturing CUDA graphs (decode, FULL): 94%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 46/49 [00:07<00:01, 1.57it/s]/home/mozf/develop-projects/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (9) < num_heads (16). This may indicate the inputs were passed in head-first format [B, H, T, ...] Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=3874023) return fn(*args, **kwargs)
(EngineCore_DP0 pid=3874023) /home/mozf/develop-projects/vllm/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (9) < num_heads (16). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=3874023) return fn(*contiguous_args, **contiguous_kwargs)
(EngineCore_DP0 pid=3874023) /home/mozf/develop-projects/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (6) < num_heads (16). This may indicate the inputs were passed in head-first format [B, H, T, ...] Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=3874023) return fn(*args, **kwargs)
(EngineCore_DP0 pid=3874023) /home/mozf/develop-projects/vllm/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (6) < num_heads (16). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=3874023) return fn(*contiguous_args, **contiguous_kwargs)
Capturing CUDA graphs (decode, FULL): 98%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 48/49 [00:11<00:01, 1.04s/it]/home/mozf/develop-projects/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (3) < num_heads (16). This may indicate the inputs were passed in head-first format [B, H, T, ...] Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=3874023) return fn(*args, **kwargs)
(EngineCore_DP0 pid=3874023) /home/mozf/develop-projects/vllm/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (3) < num_heads (16). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=3874023) return fn(*contiguous_args, **contiguous_kwargs)
Capturing CUDA graphs (decode, FULL): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 49/49 [00:11<00:00, 4.22it/s]
Signed-off-by: xjx <493337577@qq.com>
…into skip_triton_when_graph_capture
==============================
|
|
@flutist I can not reproduce this. I think it's perhaps related to you environmant. How do you install vllm? And these should be captured for decoding |
Yes, it's a known issue for qwen ima when mtp is enabled. But it's has beed fixed on main according to my test |
still error in main branch in my side, and i open a new pr to solve it.#36634 |


when launch vllm serve Qwen/Qwen3.5-0.8B --speculative_config '{"method": "mtp", "num_speculative_tokens":2}', console show error,


,after implement the code, everything work.
Fix: Skip _forward_core during CUDA Graph capture to avoid Triton kernel errors
Problem
When running the Qwen3Next model with speculative decoding (MTP method) in vLLM, CUDA Graph capture in FULL mode fails with:
RuntimeError: Triton Error [CUDA]: operation not permitted when stream is capturing
Root cause: During CUDA Graph capture, the _forward_core method of Qwen3NextGatedDeltaNet is invoked as a custom op. This method calls Triton JIT kernels — causal_conv1d_update and fused_sigmoid_gating_delta_rule_update — which internally trigger _init_handles() → load_binary() → cuModuleLoadData(). The CUDA driver forbids cuModuleLoadData while a stream is being captured, causing the runtime error.
Solution
Added an early return guard in _forward_core (vllm/model_executor/models/qwen3_next.py) that detects CUDA Graph capture state via torch.cuda.is_current_stream_capturing() and skips the entire method body during capture.
if torch.cuda.is_current_stream_capturing():
return