Skip to content

skip triton when graph capture#36587

Draft
flutist wants to merge 6 commits intovllm-project:mainfrom
flutist:skip_triton_when_graph_capture
Draft

skip triton when graph capture#36587
flutist wants to merge 6 commits intovllm-project:mainfrom
flutist:skip_triton_when_graph_capture

Conversation

@flutist
Copy link
Contributor

@flutist flutist commented Mar 10, 2026

when launch vllm serve Qwen/Qwen3.5-0.8B --speculative_config '{"method": "mtp", "num_speculative_tokens":2}', console show error,
image
,after implement the code, everything work.
image

Fix: Skip _forward_core during CUDA Graph capture to avoid Triton kernel errors
Problem
When running the Qwen3Next model with speculative decoding (MTP method) in vLLM, CUDA Graph capture in FULL mode fails with:

RuntimeError: Triton Error [CUDA]: operation not permitted when stream is capturing
Root cause: During CUDA Graph capture, the _forward_core method of Qwen3NextGatedDeltaNet is invoked as a custom op. This method calls Triton JIT kernels — causal_conv1d_update and fused_sigmoid_gating_delta_rule_update — which internally trigger _init_handles() → load_binary() → cuModuleLoadData(). The CUDA driver forbids cuModuleLoadData while a stream is being captured, causing the runtime error.

Solution
Added an early return guard in _forward_core (vllm/model_executor/models/qwen3_next.py) that detects CUDA Graph capture state via torch.cuda.is_current_stream_capturing() and skips the entire method body during capture.

if torch.cuda.is_current_stream_capturing():
return

Signed-off-by: xjx <493337577@qq.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a change in vllm/model_executor/models/qwen3_next.py to skip the execution of the _forward_core method in Qwen3NextGatedDeltaNet when a CUDA graph is being captured. This is done by adding a condition that checks torch.cuda.is_current_stream_capturing(). This prevents Triton kernels within this method from being executed during graph capture, which is a necessary step as they are not typically graph-safe. The change is localized and addresses the stated purpose of the pull request.

Note: Security Review did not run due to the size of the PR.

@flutist
Copy link
Contributor Author

flutist commented Mar 10, 2026

@Isotr0py pls take a look, thanks

@flutist
Copy link
Contributor Author

flutist commented Mar 10, 2026

when launch vllm serve Qwen/Qwen3.5-0.8B --speculative_config '{"method": "mtp", "num_speculative_tokens":2}', console show error,
image
,after implement the code, everything work.
image

Fix: Skip _forward_core during CUDA Graph capture to avoid Triton kernel errors

Problem

When running the Qwen3Next model with speculative decoding (MTP method) in vLLM, CUDA Graph capture in FULL mode fails with:

RuntimeError: Triton Error [CUDA]: operation not permitted when stream is capturing

Root cause: During CUDA Graph capture, the _forward_core method of Qwen3NextGatedDeltaNet is invoked as a custom op. This method calls Triton JIT kernels — causal_conv1d_update and fused_sigmoid_gating_delta_rule_update — which internally trigger _init_handles()load_binary()cuModuleLoadData(). The CUDA driver forbids cuModuleLoadData while a stream is being captured, causing the runtime error.

Solution

Added an early return guard in _forward_core (vllm/model_executor/models/qwen3_next.py) that detects CUDA Graph capture state via torch.cuda.is_current_stream_capturing() and skips the entire method body during capture.

if torch.cuda.is_current_stream_capturing():
    return

Copy link
Member

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which GPU and Triton version are you using? I can't reproduce this failure on RTX3090:

vllm serve /mnt/data0/LLM/Qwen3.5-0.8B/ --speculative_config '{"method": "mtp", "num_speculative_tokens":2}'
(EngineCore_DP0 pid=3874023) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(EngineCore_DP0 pid=3874023) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.68it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.68it/s]
(EngineCore_DP0 pid=3874023) 
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:04 [default_loader.py:293] Loading weights took 0.44 seconds
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:04 [gpu_model_runner.py:4494] Loading drafter model...
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  8.54it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  8.52it/s]
(EngineCore_DP0 pid=3874023) 
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:04 [default_loader.py:293] Loading weights took 0.14 seconds
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:04 [eagle.py:1378] Detected MTP model. Sharing target model embedding weights with the draft model.
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:04 [eagle.py:1432] Detected MTP model. Sharing target model lm_head weights with the draft model.
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:05 [gpu_model_runner.py:4553] Model loading took 1.76 GiB memory and 1.129761 seconds
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:05 [gpu_model_runner.py:5475] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:09 [backends.py:975] Using cache directory: /home/mozf/.cache/vllm/torch_compile_cache/9093e5f81b/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:09 [backends.py:1035] Dynamo bytecode transform time: 1.02 s
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [backends.py:284] Directly load the compiled graph(s) for compile range (1, 2048) from the cache, took 0.632 s
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [monitor.py:48] torch.compile took 1.80 s in total
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [decorators.py:477] Directly load AOT compilation from path /home/mozf/.cache/vllm/torch_compile_cache/torch_aot_compile/e19c1dec00a3f53db9a332380dc52a844d914f12adb5f36ca2b695d46a90c876/rank_0_0/model
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [monitor.py:76] Initial profiling/warmup run took 0.01 s
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [backends.py:975] Using cache directory: /home/mozf/.cache/vllm/torch_compile_cache/9093e5f81b/rank_0_0/eagle_head for vLLM's torch.compile
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [backends.py:1035] Dynamo bytecode transform time: 0.11 s
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [backends.py:284] Directly load the compiled graph(s) for compile range (1, 2048) from the cache, took 0.052 s
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [monitor.py:48] torch.compile took 0.17 s in total
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [decorators.py:477] Directly load AOT compilation from path /home/mozf/.cache/vllm/torch_compile_cache/torch_aot_compile/27be4f948aa418b8176df25af512a7572f985b510e99ee43fea23336385a7e5c/rank_0_0/model
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [monitor.py:76] Initial profiling/warmup run took 0.00 s
(EngineCore_DP0 pid=3874023) WARNING 03-10 13:58:12 [kv_cache_utils.py:1054] Add 3 padding layers, may waste at most 16.67% KV cache memory
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:12 [gpu_model_runner.py:5596] Profiling CUDA graph memory: PIECEWISE=49 (largest=498), FULL=49 (largest=498)
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:26 [gpu_model_runner.py:5675] Estimated CUDA graph memory: 0.45 GiB total
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:27 [gpu_worker.py:456] Available KV cache memory: 17.3 GiB
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:27 [gpu_worker.py:490] In v0.19, CUDA graph memory profiling will be enabled by default (VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1), which more accurately accounts for CUDA graph memory during KV cache allocation. To try it now, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 and increase --gpu-memory-utilization from 0.9000 to 0.9190 to maintain the same effective KV cache size.
(EngineCore_DP0 pid=3874023) WARNING 03-10 13:58:27 [kv_cache_utils.py:1054] Add 3 padding layers, may waste at most 16.67% KV cache memory
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:27 [kv_cache_utils.py:1314] GPU KV cache size: 323,680 tokens
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:27 [kv_cache_utils.py:1319] Maximum concurrency for 262,144 tokens per request: 4.85x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 49/49 [00:01<00:00, 31.31it/s]
Capturing CUDA graphs (decode, FULL):  94%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏       | 46/49 [00:07<00:01,  1.57it/s]/home/mozf/develop-projects/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (9) < num_heads (16). This may indicate the inputs were passed in head-first format [B, H, T, ...] Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=3874023)   return fn(*args, **kwargs)
(EngineCore_DP0 pid=3874023) /home/mozf/develop-projects/vllm/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (9) < num_heads (16). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=3874023)   return fn(*contiguous_args, **contiguous_kwargs)
(EngineCore_DP0 pid=3874023) /home/mozf/develop-projects/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (6) < num_heads (16). This may indicate the inputs were passed in head-first format [B, H, T, ...] Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=3874023)   return fn(*args, **kwargs)
(EngineCore_DP0 pid=3874023) /home/mozf/develop-projects/vllm/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (6) < num_heads (16). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=3874023)   return fn(*contiguous_args, **contiguous_kwargs)
Capturing CUDA graphs (decode, FULL):  98%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍  | 48/49 [00:11<00:01,  1.04s/it]/home/mozf/develop-projects/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (3) < num_heads (16). This may indicate the inputs were passed in head-first format [B, H, T, ...] Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=3874023)   return fn(*args, **kwargs)
(EngineCore_DP0 pid=3874023) /home/mozf/develop-projects/vllm/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (3) < num_heads (16). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=3874023)   return fn(*contiguous_args, **contiguous_kwargs)
Capturing CUDA graphs (decode, FULL): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 49/49 [00:11<00:00,  4.22it/s]

flutist added 2 commits March 10, 2026 14:21
Signed-off-by: xjx <493337577@qq.com>
@flutist
Copy link
Contributor Author

flutist commented Mar 10, 2026

Which GPU and Triton version are you using? I can't reproduce this failure on RTX3090:

vllm serve /mnt/data0/LLM/Qwen3.5-0.8B/ --speculative_config '{"method": "mtp", "num_speculative_tokens":2}'
(EngineCore_DP0 pid=3874023) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(EngineCore_DP0 pid=3874023) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.68it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.68it/s]
(EngineCore_DP0 pid=3874023) 
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:04 [default_loader.py:293] Loading weights took 0.44 seconds
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:04 [gpu_model_runner.py:4494] Loading drafter model...
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  8.54it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  8.52it/s]
(EngineCore_DP0 pid=3874023) 
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:04 [default_loader.py:293] Loading weights took 0.14 seconds
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:04 [eagle.py:1378] Detected MTP model. Sharing target model embedding weights with the draft model.
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:04 [eagle.py:1432] Detected MTP model. Sharing target model lm_head weights with the draft model.
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:05 [gpu_model_runner.py:4553] Model loading took 1.76 GiB memory and 1.129761 seconds
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:05 [gpu_model_runner.py:5475] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:09 [backends.py:975] Using cache directory: /home/mozf/.cache/vllm/torch_compile_cache/9093e5f81b/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:09 [backends.py:1035] Dynamo bytecode transform time: 1.02 s
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [backends.py:284] Directly load the compiled graph(s) for compile range (1, 2048) from the cache, took 0.632 s
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [monitor.py:48] torch.compile took 1.80 s in total
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [decorators.py:477] Directly load AOT compilation from path /home/mozf/.cache/vllm/torch_compile_cache/torch_aot_compile/e19c1dec00a3f53db9a332380dc52a844d914f12adb5f36ca2b695d46a90c876/rank_0_0/model
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [monitor.py:76] Initial profiling/warmup run took 0.01 s
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [backends.py:975] Using cache directory: /home/mozf/.cache/vllm/torch_compile_cache/9093e5f81b/rank_0_0/eagle_head for vLLM's torch.compile
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [backends.py:1035] Dynamo bytecode transform time: 0.11 s
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [backends.py:284] Directly load the compiled graph(s) for compile range (1, 2048) from the cache, took 0.052 s
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [monitor.py:48] torch.compile took 0.17 s in total
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [decorators.py:477] Directly load AOT compilation from path /home/mozf/.cache/vllm/torch_compile_cache/torch_aot_compile/27be4f948aa418b8176df25af512a7572f985b510e99ee43fea23336385a7e5c/rank_0_0/model
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [monitor.py:76] Initial profiling/warmup run took 0.00 s
(EngineCore_DP0 pid=3874023) WARNING 03-10 13:58:12 [kv_cache_utils.py:1054] Add 3 padding layers, may waste at most 16.67% KV cache memory
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:12 [gpu_model_runner.py:5596] Profiling CUDA graph memory: PIECEWISE=49 (largest=498), FULL=49 (largest=498)
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:26 [gpu_model_runner.py:5675] Estimated CUDA graph memory: 0.45 GiB total
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:27 [gpu_worker.py:456] Available KV cache memory: 17.3 GiB
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:27 [gpu_worker.py:490] In v0.19, CUDA graph memory profiling will be enabled by default (VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1), which more accurately accounts for CUDA graph memory during KV cache allocation. To try it now, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 and increase --gpu-memory-utilization from 0.9000 to 0.9190 to maintain the same effective KV cache size.
(EngineCore_DP0 pid=3874023) WARNING 03-10 13:58:27 [kv_cache_utils.py:1054] Add 3 padding layers, may waste at most 16.67% KV cache memory
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:27 [kv_cache_utils.py:1314] GPU KV cache size: 323,680 tokens
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:27 [kv_cache_utils.py:1319] Maximum concurrency for 262,144 tokens per request: 4.85x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 49/49 [00:01<00:00, 31.31it/s]
Capturing CUDA graphs (decode, FULL):  94%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏       | 46/49 [00:07<00:01,  1.57it/s]/home/mozf/develop-projects/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (9) < num_heads (16). This may indicate the inputs were passed in head-first format [B, H, T, ...] Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=3874023)   return fn(*args, **kwargs)
(EngineCore_DP0 pid=3874023) /home/mozf/develop-projects/vllm/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (9) < num_heads (16). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=3874023)   return fn(*contiguous_args, **contiguous_kwargs)
(EngineCore_DP0 pid=3874023) /home/mozf/develop-projects/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (6) < num_heads (16). This may indicate the inputs were passed in head-first format [B, H, T, ...] Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=3874023)   return fn(*args, **kwargs)
(EngineCore_DP0 pid=3874023) /home/mozf/develop-projects/vllm/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (6) < num_heads (16). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=3874023)   return fn(*contiguous_args, **contiguous_kwargs)
Capturing CUDA graphs (decode, FULL):  98%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍  | 48/49 [00:11<00:01,  1.04s/it]/home/mozf/develop-projects/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (3) < num_heads (16). This may indicate the inputs were passed in head-first format [B, H, T, ...] Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=3874023)   return fn(*args, **kwargs)
(EngineCore_DP0 pid=3874023) /home/mozf/develop-projects/vllm/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (3) < num_heads (16). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=3874023)   return fn(*contiguous_args, **contiguous_kwargs)
Capturing CUDA graphs (decode, FULL): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 49/49 [00:11<00:00,  4.22it/s]

==============================
CUDA / GPU Info

Is CUDA available : True
CUDA runtime version : 12.8.61
CUDA_MODULE_LOADING set to :
GPU models and configuration :
GPU 0: NVIDIA L20
GPU 1: NVIDIA L20
[conda] triton 3.6.0 pypi_0 pypi

these are error log :

(APIServer pid=1416898) INFO 03-10 14:41:51 [utils.py:302]        █     █     █▄   ▄█
(APIServer pid=1416898) INFO 03-10 14:41:51 [utils.py:302]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.17.0
(APIServer pid=1416898) INFO 03-10 14:41:51 [utils.py:302]   █▄█▀ █     █     █     █  model   Qwen/Qwen3.5-0.8B
(APIServer pid=1416898) INFO 03-10 14:41:51 [utils.py:302]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1416898) INFO 03-10 14:41:51 [utils.py:302] 
(APIServer pid=1416898) INFO 03-10 14:41:51 [utils.py:238] non-default args: {'model_tag': 'Qwen/Qwen3.5-0.8B', 'model': 'Qwen/Qwen3.5-0.8B', 'speculative_config': {'method': 'mtp', 'num_speculative_tokens': 2}}
(APIServer pid=1416898) INFO 03-10 14:42:00 [model.py:531] Resolved architecture: Qwen3_5ForConditionalGeneration
(APIServer pid=1416898) INFO 03-10 14:42:00 [model.py:1554] Using max model len 262144
(APIServer pid=1416898) INFO 03-10 14:42:08 [model.py:531] Resolved architecture: Qwen3_5MTP
(APIServer pid=1416898) INFO 03-10 14:42:08 [model.py:1554] Using max model len 262144
(APIServer pid=1416898) WARNING 03-10 14:42:08 [speculative.py:487] Enabling num_speculative_tokens > 1 will run multiple times of forward on same MTP layer,which may result in lower acceptance rate
(APIServer pid=1416898) INFO 03-10 14:42:08 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=1416898) INFO 03-10 14:42:08 [config.py:544] Setting attention block size to 544 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=1416898) INFO 03-10 14:42:08 [config.py:575] Padding mamba page size by 0.37% to ensure that mamba page size and attention page size are exactly equal.
(APIServer pid=1416898) INFO 03-10 14:42:08 [vllm.py:747] Asynchronous scheduling is enabled.
(EngineCore_DP0 pid=1418231) INFO 03-10 14:42:32 [core.py:101] Initializing a V1 LLM engine (v0.17.0) with config: model='Qwen/Qwen3.5-0.8B', speculative_config=SpeculativeConfig(method='mtp', model='Qwen/Qwen3.5-0.8B', num_spec_tokens=2), tokenizer='Qwen/Qwen3.5-0.8B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen3.5-0.8B, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=1418231) INFO 03-10 14:42:35 [parallel_state.py:1393] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://33.1.35.33:36557 backend=nccl
(EngineCore_DP0 pid=1418231) INFO 03-10 14:42:35 [parallel_state.py:1715] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore_DP0 pid=1418231) WARNING 03-10 14:42:36 [__init__.py:204] min_p and logit_bias parameters won't work with speculative decoding.
(EngineCore_DP0 pid=1418231) INFO 03-10 14:42:48 [base.py:106] Offloader set to NoopOffloader
(EngineCore_DP0 pid=1418231) INFO 03-10 14:42:48 [gpu_model_runner.py:4255] Starting to load model Qwen/Qwen3.5-0.8B...
(EngineCore_DP0 pid=1418231) INFO 03-10 14:42:49 [cuda.py:453] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(EngineCore_DP0 pid=1418231) INFO 03-10 14:42:49 [mm_encoder_attention.py:215] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore_DP0 pid=1418231) INFO 03-10 14:42:50 [cuda.py:405] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore_DP0 pid=1418231) INFO 03-10 14:42:50 [flash_attn.py:587] Using FlashAttention version 2
(EngineCore_DP0 pid=1418231) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(EngineCore_DP0 pid=1418231) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:27<00:00, 27.50s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:27<00:00, 27.50s/it]
(EngineCore_DP0 pid=1418231) 
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:19 [default_loader.py:293] Loading weights took 27.55 seconds
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:19 [gpu_model_runner.py:4279] Loading drafter model...
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.49it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.49it/s]
(EngineCore_DP0 pid=1418231) 
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:20 [default_loader.py:293] Loading weights took 0.71 seconds
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:20 [eagle.py:1381] Detected MTP model. Sharing target model embedding weights with the draft model.
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:20 [eagle.py:1435] Detected MTP model. Sharing target model lm_head weights with the draft model.
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:21 [gpu_model_runner.py:4338] Model loading took 1.76 GiB memory and 31.185971 seconds
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:21 [gpu_model_runner.py:5254] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:28 [backends.py:916] Using cache directory: /home/admin/.cache/vllm/torch_compile_cache/179c7b3119/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:28 [backends.py:976] Dynamo bytecode transform time: 3.96 s
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:29 [backends.py:350] Cache the graph of compile range (1, 2048) for later use
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:30 [backends.py:366] Compiling a graph for compile range (1, 2048) takes 1.01 s
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:30 [monitor.py:35] torch.compile takes 6.05 s in total
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:30 [decorators.py:580] saving AOT compiled function to /home/admin/.cache/vllm/torch_compile_cache/torch_aot_compile/1dd8e784de27f218399d872f85173023ec01f602ef1672dbf6fc5585654dacf2/rank_0_0/model
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:30 [decorators.py:588] saved AOT compiled function to /home/admin/.cache/vllm/torch_compile_cache/torch_aot_compile/1dd8e784de27f218399d872f85173023ec01f602ef1672dbf6fc5585654dacf2/rank_0_0/model
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:31 [backends.py:916] Using cache directory: /home/admin/.cache/vllm/torch_compile_cache/179c7b3119/rank_0_0/eagle_head for vLLM's torch.compile
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:31 [backends.py:976] Dynamo bytecode transform time: 0.58 s
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:31 [backends.py:366] Compiling a graph for compile range (1, 2048) takes 0.12 s
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:31 [monitor.py:35] torch.compile takes 0.79 s in total
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:31 [decorators.py:580] saving AOT compiled function to /home/admin/.cache/vllm/torch_compile_cache/torch_aot_compile/03438994f4c39ed4b1b0fa536801cf6f5dfeaedec23ca91b73efe97592f57cf8/rank_0_0/model
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:31 [decorators.py:588] saved AOT compiled function to /home/admin/.cache/vllm/torch_compile_cache/torch_aot_compile/03438994f4c39ed4b1b0fa536801cf6f5dfeaedec23ca91b73efe97592f57cf8/rank_0_0/model
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:32 [gpu_worker.py:424] Available KV cache memory: 36.18 GiB
(EngineCore_DP0 pid=1418231) WARNING 03-10 14:43:32 [kv_cache_utils.py:1054] Add 3 padding layers, may waste at most 16.67% KV cache memory
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:32 [kv_cache_utils.py:1314] GPU KV cache size: 677,280 tokens
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:32 [kv_cache_utils.py:1319] Maximum concurrency for 262,144 tokens per request: 10.14x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|████████████████████████████████████████| 49/49 [00:01<00:00, 34.29it/s]
Capturing CUDA graphs (decode, FULL):   0%|                                                                    | 0/49 [00:11<?, ?it/s]
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100] EngineCore failed to start.
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100] Traceback (most recent call last):
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/compilation/cuda_graph.py", line 281, in __call__
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     output = self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_5.py", line 738, in forward
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     hidden_states = self.language_model.model(
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 402, in __call__
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return self.aot_compiled_fn(self, *args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/_dynamo/aot_compile.py", line 124, in __call__
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return self.fn(*args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 1132, in forward
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     def forward(
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/compilation/caching.py", line 198, in __call__
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return self.optimized_call(*args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/fx/graph_module.py", line 936, in call_wrapped
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return self._wrapped_call(self, *args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/fx/graph_module.py", line 455, in __call__
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     raise e
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/fx/graph_module.py", line 442, in __call__
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "<eval_with_key>.51", line 208, in forward
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     submod_1 = self.submod_1(getitem, s59, getitem_1, getitem_2, getitem_3);  getitem = getitem_1 = getitem_2 = submod_1 = None
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/fx/graph_module.py", line 936, in call_wrapped
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return self._wrapped_call(self, *args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/fx/graph_module.py", line 455, in __call__
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     raise e
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/fx/graph_module.py", line 442, in __call__
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "<eval_with_key>.53", line 5, in forward
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     gdn_attention_core = torch.ops.vllm.gdn_attention_core(mixed_qkv, b_1, a_1, core_attn_out, 'language_model.model.layers.0.linear_attn');  mixed_qkv = b_1 = a_1 = core_attn_out = gdn_attention_core = None
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/_ops.py", line 1209, in __call__
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return self._op(*args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 1451, in gdn_attention_core
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     self._forward_core(
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 683, in _forward_core
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     mixed_qkv_spec = causal_conv1d_update(
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]                      ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/model_executor/layers/mamba/ops/causal_conv1d.py", line 1196, in causal_conv1d_update
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     _causal_conv1d_update_kernel[grid](
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/triton/runtime/jit.py", line 370, in <lambda>
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/triton/runtime/jit.py", line 743, in run
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     launch_metadata = kernel.launch_metadata(grid, stream, *bound_args.values())
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/triton/compiler/compiler.py", line 482, in launch_metadata
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     self._init_handles()
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/triton/compiler/compiler.py", line 465, in _init_handles
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     self.module, self.function, self.n_regs, self.n_spills, self.n_max_threads = driver.active.utils.load_binary(
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]                                                                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100] RuntimeError: Triton Error [CUDA]: operation not permitted when stream is capturing
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100] 
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100] During handling of the above exception, another exception occurred:
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100] 
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100] Traceback (most recent call last):
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1090, in run_engine_core
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return func(*args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 834, in __init__
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     super().__init__(
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 120, in __init__
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return func(*args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 279, in _initialize_kv_caches
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     self.model_executor.initialize_from_config(kv_cache_configs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 118, in initialize_from_config
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     compilation_times: list[float] = self.collective_rpc("compile_or_warm_up_model")
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 76, in collective_rpc
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return func(*args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return func(*args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 522, in compile_or_warm_up_model
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     cuda_graph_memory_bytes = self.model_runner.capture_model()
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return func(*args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5337, in capture_model
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     self._capture_cudagraphs(
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5438, in _capture_cudagraphs
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     dummy_run(
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return func(*args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4976, in _dummy_run
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     outputs = self.model(
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]               ^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/compilation/cuda_graph.py", line 275, in __call__
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     with torch.cuda.graph(
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]          ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/cuda/graphs.py", line 268, in __exit__
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     self.cuda_graph.capture_end()
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/cuda/graphs.py", line 130, in capture_end
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     super().capture_end()
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100] torch.AcceleratorError: CUDA error: operation failed due to a previous error during capture
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100] Search for `cudaErrorStreamCaptureInvalidated' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100] 
(EngineCore_DP0 pid=1418231) Process EngineCore_DP0:
(EngineCore_DP0 pid=1418231) Traceback (most recent call last):
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/compilation/cuda_graph.py", line 281, in __call__
(EngineCore_DP0 pid=1418231)     output = self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=1418231)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=1418231)     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore_DP0 pid=1418231)     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_5.py", line 738, in forward
(EngineCore_DP0 pid=1418231)     hidden_states = self.language_model.model(
(EngineCore_DP0 pid=1418231)                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 402, in __call__
(EngineCore_DP0 pid=1418231)     return self.aot_compiled_fn(self, *args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/_dynamo/aot_compile.py", line 124, in __call__
(EngineCore_DP0 pid=1418231)     return self.fn(*args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 1132, in forward
(EngineCore_DP0 pid=1418231)     def forward(
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/compilation/caching.py", line 198, in __call__
(EngineCore_DP0 pid=1418231)     return self.optimized_call(*args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/fx/graph_module.py", line 936, in call_wrapped
(EngineCore_DP0 pid=1418231)     return self._wrapped_call(self, *args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/fx/graph_module.py", line 455, in __call__
(EngineCore_DP0 pid=1418231)     raise e
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/fx/graph_module.py", line 442, in __call__
(EngineCore_DP0 pid=1418231)     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=1418231)     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore_DP0 pid=1418231)     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "<eval_with_key>.51", line 208, in forward
(EngineCore_DP0 pid=1418231)     submod_1 = self.submod_1(getitem, s59, getitem_1, getitem_2, getitem_3);  getitem = getitem_1 = getitem_2 = submod_1 = None
(EngineCore_DP0 pid=1418231)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/fx/graph_module.py", line 936, in call_wrapped
(EngineCore_DP0 pid=1418231)     return self._wrapped_call(self, *args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/fx/graph_module.py", line 455, in __call__
(EngineCore_DP0 pid=1418231)     raise e
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/fx/graph_module.py", line 442, in __call__
(EngineCore_DP0 pid=1418231)     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=1418231)     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore_DP0 pid=1418231)     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "<eval_with_key>.53", line 5, in forward
(EngineCore_DP0 pid=1418231)     gdn_attention_core = torch.ops.vllm.gdn_attention_core(mixed_qkv, b_1, a_1, core_attn_out, 'language_model.model.layers.0.linear_attn');  mixed_qkv = b_1 = a_1 = core_attn_out = gdn_attention_core = None
(EngineCore_DP0 pid=1418231)                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/_ops.py", line 1209, in __call__
(EngineCore_DP0 pid=1418231)     return self._op(*args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 1451, in gdn_attention_core
(EngineCore_DP0 pid=1418231)     self._forward_core(
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 683, in _forward_core
(EngineCore_DP0 pid=1418231)     mixed_qkv_spec = causal_conv1d_update(
(EngineCore_DP0 pid=1418231)                      ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/model_executor/layers/mamba/ops/causal_conv1d.py", line 1196, in causal_conv1d_update
(EngineCore_DP0 pid=1418231)     _causal_conv1d_update_kernel[grid](
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/triton/runtime/jit.py", line 370, in <lambda>
(EngineCore_DP0 pid=1418231)     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
(EngineCore_DP0 pid=1418231)                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/triton/runtime/jit.py", line 743, in run
(EngineCore_DP0 pid=1418231)     launch_metadata = kernel.launch_metadata(grid, stream, *bound_args.values())
(EngineCore_DP0 pid=1418231)                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/triton/compiler/compiler.py", line 482, in launch_metadata
(EngineCore_DP0 pid=1418231)     self._init_handles()
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/triton/compiler/compiler.py", line 465, in _init_handles
(EngineCore_DP0 pid=1418231)     self.module, self.function, self.n_regs, self.n_spills, self.n_max_threads = driver.active.utils.load_binary(
(EngineCore_DP0 pid=1418231)                                                                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) RuntimeError: Triton Error [CUDA]: operation not permitted when stream is capturing
(EngineCore_DP0 pid=1418231) 
(EngineCore_DP0 pid=1418231) During handling of the above exception, another exception occurred:
(EngineCore_DP0 pid=1418231) 
(EngineCore_DP0 pid=1418231) Traceback (most recent call last):
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=1418231)     self.run()
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=1418231)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1104, in run_engine_core
(EngineCore_DP0 pid=1418231)     raise e
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1090, in run_engine_core
(EngineCore_DP0 pid=1418231)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=1418231)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=1418231)     return func(*args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 834, in __init__
(EngineCore_DP0 pid=1418231)     super().__init__(
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 120, in __init__
(EngineCore_DP0 pid=1418231)     num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
(EngineCore_DP0 pid=1418231)                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=1418231)     return func(*args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 279, in _initialize_kv_caches
(EngineCore_DP0 pid=1418231)     self.model_executor.initialize_from_config(kv_cache_configs)
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 118, in initialize_from_config
(EngineCore_DP0 pid=1418231)     compilation_times: list[float] = self.collective_rpc("compile_or_warm_up_model")
(EngineCore_DP0 pid=1418231)                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 76, in collective_rpc
(EngineCore_DP0 pid=1418231)     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=1418231)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore_DP0 pid=1418231)     return func(*args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=1418231)     return func(*args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 522, in compile_or_warm_up_model
(EngineCore_DP0 pid=1418231)     cuda_graph_memory_bytes = self.model_runner.capture_model()
(EngineCore_DP0 pid=1418231)                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=1418231)     return func(*args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5337, in capture_model
(EngineCore_DP0 pid=1418231)     self._capture_cudagraphs(
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5438, in _capture_cudagraphs
(EngineCore_DP0 pid=1418231)     dummy_run(
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore_DP0 pid=1418231)     return func(*args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4976, in _dummy_run
(EngineCore_DP0 pid=1418231)     outputs = self.model(
(EngineCore_DP0 pid=1418231)               ^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/compilation/cuda_graph.py", line 275, in __call__
(EngineCore_DP0 pid=1418231)     with torch.cuda.graph(
(EngineCore_DP0 pid=1418231)          ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/cuda/graphs.py", line 268, in __exit__
(EngineCore_DP0 pid=1418231)     self.cuda_graph.capture_end()
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/cuda/graphs.py", line 130, in capture_end
(EngineCore_DP0 pid=1418231)     super().capture_end()
(EngineCore_DP0 pid=1418231) torch.AcceleratorError: CUDA error: operation failed due to a previous error during capture
(EngineCore_DP0 pid=1418231) Search for `cudaErrorStreamCaptureInvalidated' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
(EngineCore_DP0 pid=1418231) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(EngineCore_DP0 pid=1418231) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(EngineCore_DP0 pid=1418231) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(EngineCore_DP0 pid=1418231) 
[rank0]:[W310 14:43:47.999000372 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=1416898) Traceback (most recent call last):
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/bin/vllm", line 6, in <module>
(APIServer pid=1416898)     sys.exit(main())
(APIServer pid=1416898)              ^^^^^^
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 73, in main
(APIServer pid=1416898)     args.dispatch_function(args)
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 112, in cmd
(APIServer pid=1416898)     uvloop.run(run_server(args))
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=1416898)     return __asyncio.run(
(APIServer pid=1416898)            ^^^^^^^^^^^^^^
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=1416898)     return runner.run(main)
(APIServer pid=1416898)            ^^^^^^^^^^^^^^^^
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1416898)     return self._loop.run_until_complete(task)
(APIServer pid=1416898)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1416898)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1416898)     return await main
(APIServer pid=1416898)            ^^^^^^^^^^
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 471, in run_server
(APIServer pid=1416898)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 490, in run_server_worker
(APIServer pid=1416898)     async with build_async_engine_client(
(APIServer pid=1416898)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1416898)     return await anext(self.gen)
(APIServer pid=1416898)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 96, in build_async_engine_client
(APIServer pid=1416898)     async with build_async_engine_client_from_engine_args(
(APIServer pid=1416898)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1416898)     return await anext(self.gen)
(APIServer pid=1416898)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 137, in build_async_engine_client_from_engine_args
(APIServer pid=1416898)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1416898)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=1416898)     return cls(
(APIServer pid=1416898)            ^^^^
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 154, in __init__
(APIServer pid=1416898)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1416898)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1416898)     return func(*args, **kwargs)
(APIServer pid=1416898)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 127, in make_async_mp_client
(APIServer pid=1416898)     return AsyncMPClient(*client_args)
(APIServer pid=1416898)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1416898)     return func(*args, **kwargs)
(APIServer pid=1416898)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 911, in __init__
(APIServer pid=1416898)     super().__init__(
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 569, in __init__
(APIServer pid=1416898)     with launch_core_engines(
(APIServer pid=1416898)          ^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=1416898)     next(self.gen)
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 951, in launch_core_engines
(APIServer pid=1416898)     wait_for_engine_startup(
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 1010, in wait_for_engine_startup
(APIServer pid=1416898)     raise RuntimeError(
(APIServer pid=1416898) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
(official_deploy) ```

@ZJY0516
Copy link
Member

ZJY0516 commented Mar 10, 2026

@flutist I can not reproduce this. I think it's perhaps related to you environmant. How do you install vllm?

And these should be captured for decoding

@flutist
Copy link
Contributor Author

flutist commented Mar 10, 2026

@flutist I can not reproduce this. I think it's perhaps related to you environmant. How do you install vllm?

And these should be captured for decoding

the problem is reported in #36498, and only produce in 0.17.0 version, i install it by pip install vllm.

@flutist flutist marked this pull request as draft March 10, 2026 09:46
@ZJY0516
Copy link
Member

ZJY0516 commented Mar 10, 2026

@flutist I can not reproduce this. I think it's perhaps related to you environmant. How do you install vllm?
And these should be captured for decoding

the problem is reported in #36498, and only produce in 0.17.0 version, i install it by pip install vllm.

Yes, it's a known issue for qwen ima when mtp is enabled. But it's has beed fixed on main according to my test

@flutist
Copy link
Contributor Author

flutist commented Mar 10, 2026

@flutist I can not reproduce this. I think it's perhaps related to you environmant. How do you install vllm?
And these should be captured for decoding

the problem is reported in #36498, and only produce in 0.17.0 version, i install it by pip install vllm.

Yes, it's a known issue for qwen ima when mtp is enabled. But it's has beed fixed on main according to my test

still error in main branch in my side, and i open a new pr to solve it.#36634

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

qwen Related to Qwen models

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants