skip triton when graph capture by flutist · Pull Request #36587 · vllm-project/vllm

flutist · 2026-03-10T04:05:13Z

when launch vllm serve Qwen/Qwen3.5-0.8B --speculative_config '{"method": "mtp", "num_speculative_tokens":2}', console show error,

,after implement the code, everything work.

Fix: Skip _forward_core during CUDA Graph capture to avoid Triton kernel errors
Problem
When running the Qwen3Next model with speculative decoding (MTP method) in vLLM, CUDA Graph capture in FULL mode fails with:

RuntimeError: Triton Error [CUDA]: operation not permitted when stream is capturing
Root cause: During CUDA Graph capture, the _forward_core method of Qwen3NextGatedDeltaNet is invoked as a custom op. This method calls Triton JIT kernels — causal_conv1d_update and fused_sigmoid_gating_delta_rule_update — which internally trigger _init_handles() → load_binary() → cuModuleLoadData(). The CUDA driver forbids cuModuleLoadData while a stream is being captured, causing the runtime error.

Solution
Added an early return guard in _forward_core (vllm/model_executor/models/qwen3_next.py) that detects CUDA Graph capture state via torch.cuda.is_current_stream_capturing() and skips the entire method body during capture.

if torch.cuda.is_current_stream_capturing():
return

Signed-off-by: xjx <493337577@qq.com>

gemini-code-assist

Code Review

This pull request introduces a change in vllm/model_executor/models/qwen3_next.py to skip the execution of the _forward_core method in Qwen3NextGatedDeltaNet when a CUDA graph is being captured. This is done by adding a condition that checks torch.cuda.is_current_stream_capturing(). This prevents Triton kernels within this method from being executed during graph capture, which is a necessary step as they are not typically graph-safe. The change is localized and addresses the stated purpose of the pull request.

_{Note: Security Review did not run due to the size of the PR.}

flutist · 2026-03-10T04:07:07Z

@Isotr0py pls take a look, thanks

flutist · 2026-03-10T04:12:10Z

when launch vllm serve Qwen/Qwen3.5-0.8B --speculative_config '{"method": "mtp", "num_speculative_tokens":2}', console show error,

,after implement the code, everything work.

Fix: Skip `_forward_core` during CUDA Graph capture to avoid Triton kernel errors

Problem

When running the Qwen3Next model with speculative decoding (MTP method) in vLLM, CUDA Graph capture in FULL mode fails with:

RuntimeError: Triton Error [CUDA]: operation not permitted when stream is capturing

Root cause: During CUDA Graph capture, the _forward_core method of Qwen3NextGatedDeltaNet is invoked as a custom op. This method calls Triton JIT kernels — causal_conv1d_update and fused_sigmoid_gating_delta_rule_update — which internally trigger _init_handles() → load_binary() → cuModuleLoadData(). The CUDA driver forbids cuModuleLoadData while a stream is being captured, causing the runtime error.

Solution

Added an early return guard in _forward_core (vllm/model_executor/models/qwen3_next.py) that detects CUDA Graph capture state via torch.cuda.is_current_stream_capturing() and skips the entire method body during capture.

if torch.cuda.is_current_stream_capturing():
    return

Isotr0py

Which GPU and Triton version are you using? I can't reproduce this failure on RTX3090:

vllm serve /mnt/data0/LLM/Qwen3.5-0.8B/ --speculative_config '{"method": "mtp", "num_speculative_tokens":2}'

(EngineCore_DP0 pid=3874023) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(EngineCore_DP0 pid=3874023) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.68it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.68it/s]
(EngineCore_DP0 pid=3874023) 
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:04 [default_loader.py:293] Loading weights took 0.44 seconds
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:04 [gpu_model_runner.py:4494] Loading drafter model...
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  8.54it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  8.52it/s]
(EngineCore_DP0 pid=3874023) 
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:04 [default_loader.py:293] Loading weights took 0.14 seconds
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:04 [eagle.py:1378] Detected MTP model. Sharing target model embedding weights with the draft model.
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:04 [eagle.py:1432] Detected MTP model. Sharing target model lm_head weights with the draft model.
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:05 [gpu_model_runner.py:4553] Model loading took 1.76 GiB memory and 1.129761 seconds
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:05 [gpu_model_runner.py:5475] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:09 [backends.py:975] Using cache directory: /home/mozf/.cache/vllm/torch_compile_cache/9093e5f81b/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:09 [backends.py:1035] Dynamo bytecode transform time: 1.02 s
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [backends.py:284] Directly load the compiled graph(s) for compile range (1, 2048) from the cache, took 0.632 s
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [monitor.py:48] torch.compile took 1.80 s in total
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [decorators.py:477] Directly load AOT compilation from path /home/mozf/.cache/vllm/torch_compile_cache/torch_aot_compile/e19c1dec00a3f53db9a332380dc52a844d914f12adb5f36ca2b695d46a90c876/rank_0_0/model
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [monitor.py:76] Initial profiling/warmup run took 0.01 s
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [backends.py:975] Using cache directory: /home/mozf/.cache/vllm/torch_compile_cache/9093e5f81b/rank_0_0/eagle_head for vLLM's torch.compile
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [backends.py:1035] Dynamo bytecode transform time: 0.11 s
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [backends.py:284] Directly load the compiled graph(s) for compile range (1, 2048) from the cache, took 0.052 s
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [monitor.py:48] torch.compile took 0.17 s in total
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [decorators.py:477] Directly load AOT compilation from path /home/mozf/.cache/vllm/torch_compile_cache/torch_aot_compile/27be4f948aa418b8176df25af512a7572f985b510e99ee43fea23336385a7e5c/rank_0_0/model
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [monitor.py:76] Initial profiling/warmup run took 0.00 s
(EngineCore_DP0 pid=3874023) WARNING 03-10 13:58:12 [kv_cache_utils.py:1054] Add 3 padding layers, may waste at most 16.67% KV cache memory
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:12 [gpu_model_runner.py:5596] Profiling CUDA graph memory: PIECEWISE=49 (largest=498), FULL=49 (largest=498)
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:26 [gpu_model_runner.py:5675] Estimated CUDA graph memory: 0.45 GiB total
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:27 [gpu_worker.py:456] Available KV cache memory: 17.3 GiB
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:27 [gpu_worker.py:490] In v0.19, CUDA graph memory profiling will be enabled by default (VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1), which more accurately accounts for CUDA graph memory during KV cache allocation. To try it now, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 and increase --gpu-memory-utilization from 0.9000 to 0.9190 to maintain the same effective KV cache size.
(EngineCore_DP0 pid=3874023) WARNING 03-10 13:58:27 [kv_cache_utils.py:1054] Add 3 padding layers, may waste at most 16.67% KV cache memory
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:27 [kv_cache_utils.py:1314] GPU KV cache size: 323,680 tokens
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:27 [kv_cache_utils.py:1319] Maximum concurrency for 262,144 tokens per request: 4.85x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 49/49 [00:01<00:00, 31.31it/s]
Capturing CUDA graphs (decode, FULL):  94%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏       | 46/49 [00:07<00:01,  1.57it/s]/home/mozf/develop-projects/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (9) < num_heads (16). This may indicate the inputs were passed in head-first format [B, H, T, ...] Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=3874023)   return fn(*args, **kwargs)
(EngineCore_DP0 pid=3874023) /home/mozf/develop-projects/vllm/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (9) < num_heads (16). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=3874023)   return fn(*contiguous_args, **contiguous_kwargs)
(EngineCore_DP0 pid=3874023) /home/mozf/develop-projects/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (6) < num_heads (16). This may indicate the inputs were passed in head-first format [B, H, T, ...] Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=3874023)   return fn(*args, **kwargs)
(EngineCore_DP0 pid=3874023) /home/mozf/develop-projects/vllm/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (6) < num_heads (16). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=3874023)   return fn(*contiguous_args, **contiguous_kwargs)
Capturing CUDA graphs (decode, FULL):  98%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍  | 48/49 [00:11<00:01,  1.04s/it]/home/mozf/develop-projects/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (3) < num_heads (16). This may indicate the inputs were passed in head-first format [B, H, T, ...] Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=3874023)   return fn(*args, **kwargs)
(EngineCore_DP0 pid=3874023) /home/mozf/develop-projects/vllm/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (3) < num_heads (16). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=3874023)   return fn(*contiguous_args, **contiguous_kwargs)
Capturing CUDA graphs (decode, FULL): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 49/49 [00:11<00:00,  4.22it/s]

vllm/model_executor/models/qwen3_next.py

Signed-off-by: xjx <493337577@qq.com>

…into skip_triton_when_graph_capture

flutist · 2026-03-10T06:29:32Z

Which GPU and Triton version are you using? I can't reproduce this failure on RTX3090:

vllm serve /mnt/data0/LLM/Qwen3.5-0.8B/ --speculative_config '{"method": "mtp", "num_speculative_tokens":2}'

(EngineCore_DP0 pid=3874023) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(EngineCore_DP0 pid=3874023) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.68it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.68it/s]
(EngineCore_DP0 pid=3874023) 
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:04 [default_loader.py:293] Loading weights took 0.44 seconds
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:04 [gpu_model_runner.py:4494] Loading drafter model...
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  8.54it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  8.52it/s]
(EngineCore_DP0 pid=3874023) 
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:04 [default_loader.py:293] Loading weights took 0.14 seconds
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:04 [eagle.py:1378] Detected MTP model. Sharing target model embedding weights with the draft model.
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:04 [eagle.py:1432] Detected MTP model. Sharing target model lm_head weights with the draft model.
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:05 [gpu_model_runner.py:4553] Model loading took 1.76 GiB memory and 1.129761 seconds
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:05 [gpu_model_runner.py:5475] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:09 [backends.py:975] Using cache directory: /home/mozf/.cache/vllm/torch_compile_cache/9093e5f81b/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:09 [backends.py:1035] Dynamo bytecode transform time: 1.02 s
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [backends.py:284] Directly load the compiled graph(s) for compile range (1, 2048) from the cache, took 0.632 s
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [monitor.py:48] torch.compile took 1.80 s in total
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [decorators.py:477] Directly load AOT compilation from path /home/mozf/.cache/vllm/torch_compile_cache/torch_aot_compile/e19c1dec00a3f53db9a332380dc52a844d914f12adb5f36ca2b695d46a90c876/rank_0_0/model
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [monitor.py:76] Initial profiling/warmup run took 0.01 s
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [backends.py:975] Using cache directory: /home/mozf/.cache/vllm/torch_compile_cache/9093e5f81b/rank_0_0/eagle_head for vLLM's torch.compile
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [backends.py:1035] Dynamo bytecode transform time: 0.11 s
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [backends.py:284] Directly load the compiled graph(s) for compile range (1, 2048) from the cache, took 0.052 s
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [monitor.py:48] torch.compile took 0.17 s in total
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [decorators.py:477] Directly load AOT compilation from path /home/mozf/.cache/vllm/torch_compile_cache/torch_aot_compile/27be4f948aa418b8176df25af512a7572f985b510e99ee43fea23336385a7e5c/rank_0_0/model
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:10 [monitor.py:76] Initial profiling/warmup run took 0.00 s
(EngineCore_DP0 pid=3874023) WARNING 03-10 13:58:12 [kv_cache_utils.py:1054] Add 3 padding layers, may waste at most 16.67% KV cache memory
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:12 [gpu_model_runner.py:5596] Profiling CUDA graph memory: PIECEWISE=49 (largest=498), FULL=49 (largest=498)
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:26 [gpu_model_runner.py:5675] Estimated CUDA graph memory: 0.45 GiB total
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:27 [gpu_worker.py:456] Available KV cache memory: 17.3 GiB
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:27 [gpu_worker.py:490] In v0.19, CUDA graph memory profiling will be enabled by default (VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1), which more accurately accounts for CUDA graph memory during KV cache allocation. To try it now, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 and increase --gpu-memory-utilization from 0.9000 to 0.9190 to maintain the same effective KV cache size.
(EngineCore_DP0 pid=3874023) WARNING 03-10 13:58:27 [kv_cache_utils.py:1054] Add 3 padding layers, may waste at most 16.67% KV cache memory
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:27 [kv_cache_utils.py:1314] GPU KV cache size: 323,680 tokens
(EngineCore_DP0 pid=3874023) INFO 03-10 13:58:27 [kv_cache_utils.py:1319] Maximum concurrency for 262,144 tokens per request: 4.85x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 49/49 [00:01<00:00, 31.31it/s]
Capturing CUDA graphs (decode, FULL):  94%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏       | 46/49 [00:07<00:01,  1.57it/s]/home/mozf/develop-projects/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (9) < num_heads (16). This may indicate the inputs were passed in head-first format [B, H, T, ...] Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=3874023)   return fn(*args, **kwargs)
(EngineCore_DP0 pid=3874023) /home/mozf/develop-projects/vllm/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (9) < num_heads (16). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=3874023)   return fn(*contiguous_args, **contiguous_kwargs)
(EngineCore_DP0 pid=3874023) /home/mozf/develop-projects/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (6) < num_heads (16). This may indicate the inputs were passed in head-first format [B, H, T, ...] Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=3874023)   return fn(*args, **kwargs)
(EngineCore_DP0 pid=3874023) /home/mozf/develop-projects/vllm/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (6) < num_heads (16). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=3874023)   return fn(*contiguous_args, **contiguous_kwargs)
Capturing CUDA graphs (decode, FULL):  98%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍  | 48/49 [00:11<00:01,  1.04s/it]/home/mozf/develop-projects/vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (3) < num_heads (16). This may indicate the inputs were passed in head-first format [B, H, T, ...] Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=3874023)   return fn(*args, **kwargs)
(EngineCore_DP0 pid=3874023) /home/mozf/develop-projects/vllm/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (3) < num_heads (16). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=3874023)   return fn(*contiguous_args, **contiguous_kwargs)
Capturing CUDA graphs (decode, FULL): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 49/49 [00:11<00:00,  4.22it/s]

==============================
CUDA / GPU Info

Is CUDA available : True
CUDA runtime version : 12.8.61
CUDA_MODULE_LOADING set to :
GPU models and configuration :
GPU 0: NVIDIA L20
GPU 1: NVIDIA L20
[conda] triton 3.6.0 pypi_0 pypi

these are error log :

(APIServer pid=1416898) INFO 03-10 14:41:51 [utils.py:302]        █     █     █▄   ▄█
(APIServer pid=1416898) INFO 03-10 14:41:51 [utils.py:302]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.17.0
(APIServer pid=1416898) INFO 03-10 14:41:51 [utils.py:302]   █▄█▀ █     █     █     █  model   Qwen/Qwen3.5-0.8B
(APIServer pid=1416898) INFO 03-10 14:41:51 [utils.py:302]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1416898) INFO 03-10 14:41:51 [utils.py:302] 
(APIServer pid=1416898) INFO 03-10 14:41:51 [utils.py:238] non-default args: {'model_tag': 'Qwen/Qwen3.5-0.8B', 'model': 'Qwen/Qwen3.5-0.8B', 'speculative_config': {'method': 'mtp', 'num_speculative_tokens': 2}}
(APIServer pid=1416898) INFO 03-10 14:42:00 [model.py:531] Resolved architecture: Qwen3_5ForConditionalGeneration
(APIServer pid=1416898) INFO 03-10 14:42:00 [model.py:1554] Using max model len 262144
(APIServer pid=1416898) INFO 03-10 14:42:08 [model.py:531] Resolved architecture: Qwen3_5MTP
(APIServer pid=1416898) INFO 03-10 14:42:08 [model.py:1554] Using max model len 262144
(APIServer pid=1416898) WARNING 03-10 14:42:08 [speculative.py:487] Enabling num_speculative_tokens > 1 will run multiple times of forward on same MTP layer,which may result in lower acceptance rate
(APIServer pid=1416898) INFO 03-10 14:42:08 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=1416898) INFO 03-10 14:42:08 [config.py:544] Setting attention block size to 544 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=1416898) INFO 03-10 14:42:08 [config.py:575] Padding mamba page size by 0.37% to ensure that mamba page size and attention page size are exactly equal.
(APIServer pid=1416898) INFO 03-10 14:42:08 [vllm.py:747] Asynchronous scheduling is enabled.
(EngineCore_DP0 pid=1418231) INFO 03-10 14:42:32 [core.py:101] Initializing a V1 LLM engine (v0.17.0) with config: model='Qwen/Qwen3.5-0.8B', speculative_config=SpeculativeConfig(method='mtp', model='Qwen/Qwen3.5-0.8B', num_spec_tokens=2), tokenizer='Qwen/Qwen3.5-0.8B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen3.5-0.8B, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=1418231) INFO 03-10 14:42:35 [parallel_state.py:1393] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://33.1.35.33:36557 backend=nccl
(EngineCore_DP0 pid=1418231) INFO 03-10 14:42:35 [parallel_state.py:1715] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore_DP0 pid=1418231) WARNING 03-10 14:42:36 [__init__.py:204] min_p and logit_bias parameters won't work with speculative decoding.
(EngineCore_DP0 pid=1418231) INFO 03-10 14:42:48 [base.py:106] Offloader set to NoopOffloader
(EngineCore_DP0 pid=1418231) INFO 03-10 14:42:48 [gpu_model_runner.py:4255] Starting to load model Qwen/Qwen3.5-0.8B...
(EngineCore_DP0 pid=1418231) INFO 03-10 14:42:49 [cuda.py:453] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(EngineCore_DP0 pid=1418231) INFO 03-10 14:42:49 [mm_encoder_attention.py:215] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore_DP0 pid=1418231) INFO 03-10 14:42:50 [cuda.py:405] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore_DP0 pid=1418231) INFO 03-10 14:42:50 [flash_attn.py:587] Using FlashAttention version 2
(EngineCore_DP0 pid=1418231) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(EngineCore_DP0 pid=1418231) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:27<00:00, 27.50s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:27<00:00, 27.50s/it]
(EngineCore_DP0 pid=1418231) 
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:19 [default_loader.py:293] Loading weights took 27.55 seconds
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:19 [gpu_model_runner.py:4279] Loading drafter model...
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.49it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.49it/s]
(EngineCore_DP0 pid=1418231) 
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:20 [default_loader.py:293] Loading weights took 0.71 seconds
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:20 [eagle.py:1381] Detected MTP model. Sharing target model embedding weights with the draft model.
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:20 [eagle.py:1435] Detected MTP model. Sharing target model lm_head weights with the draft model.
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:21 [gpu_model_runner.py:4338] Model loading took 1.76 GiB memory and 31.185971 seconds
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:21 [gpu_model_runner.py:5254] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:28 [backends.py:916] Using cache directory: /home/admin/.cache/vllm/torch_compile_cache/179c7b3119/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:28 [backends.py:976] Dynamo bytecode transform time: 3.96 s
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:29 [backends.py:350] Cache the graph of compile range (1, 2048) for later use
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:30 [backends.py:366] Compiling a graph for compile range (1, 2048) takes 1.01 s
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:30 [monitor.py:35] torch.compile takes 6.05 s in total
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:30 [decorators.py:580] saving AOT compiled function to /home/admin/.cache/vllm/torch_compile_cache/torch_aot_compile/1dd8e784de27f218399d872f85173023ec01f602ef1672dbf6fc5585654dacf2/rank_0_0/model
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:30 [decorators.py:588] saved AOT compiled function to /home/admin/.cache/vllm/torch_compile_cache/torch_aot_compile/1dd8e784de27f218399d872f85173023ec01f602ef1672dbf6fc5585654dacf2/rank_0_0/model
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:31 [backends.py:916] Using cache directory: /home/admin/.cache/vllm/torch_compile_cache/179c7b3119/rank_0_0/eagle_head for vLLM's torch.compile
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:31 [backends.py:976] Dynamo bytecode transform time: 0.58 s
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:31 [backends.py:366] Compiling a graph for compile range (1, 2048) takes 0.12 s
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:31 [monitor.py:35] torch.compile takes 0.79 s in total
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:31 [decorators.py:580] saving AOT compiled function to /home/admin/.cache/vllm/torch_compile_cache/torch_aot_compile/03438994f4c39ed4b1b0fa536801cf6f5dfeaedec23ca91b73efe97592f57cf8/rank_0_0/model
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:31 [decorators.py:588] saved AOT compiled function to /home/admin/.cache/vllm/torch_compile_cache/torch_aot_compile/03438994f4c39ed4b1b0fa536801cf6f5dfeaedec23ca91b73efe97592f57cf8/rank_0_0/model
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:32 [gpu_worker.py:424] Available KV cache memory: 36.18 GiB
(EngineCore_DP0 pid=1418231) WARNING 03-10 14:43:32 [kv_cache_utils.py:1054] Add 3 padding layers, may waste at most 16.67% KV cache memory
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:32 [kv_cache_utils.py:1314] GPU KV cache size: 677,280 tokens
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:32 [kv_cache_utils.py:1319] Maximum concurrency for 262,144 tokens per request: 10.14x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|████████████████████████████████████████| 49/49 [00:01<00:00, 34.29it/s]
Capturing CUDA graphs (decode, FULL):   0%|                                                                    | 0/49 [00:11<?, ?it/s]
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100] EngineCore failed to start.
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100] Traceback (most recent call last):
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/compilation/cuda_graph.py", line 281, in __call__
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     output = self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_5.py", line 738, in forward
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     hidden_states = self.language_model.model(
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 402, in __call__
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return self.aot_compiled_fn(self, *args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/_dynamo/aot_compile.py", line 124, in __call__
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return self.fn(*args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 1132, in forward
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     def forward(
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/compilation/caching.py", line 198, in __call__
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return self.optimized_call(*args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/fx/graph_module.py", line 936, in call_wrapped
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return self._wrapped_call(self, *args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/fx/graph_module.py", line 455, in __call__
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     raise e
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/fx/graph_module.py", line 442, in __call__
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "<eval_with_key>.51", line 208, in forward
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     submod_1 = self.submod_1(getitem, s59, getitem_1, getitem_2, getitem_3);  getitem = getitem_1 = getitem_2 = submod_1 = None
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/fx/graph_module.py", line 936, in call_wrapped
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return self._wrapped_call(self, *args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/fx/graph_module.py", line 455, in __call__
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     raise e
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/fx/graph_module.py", line 442, in __call__
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "<eval_with_key>.53", line 5, in forward
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     gdn_attention_core = torch.ops.vllm.gdn_attention_core(mixed_qkv, b_1, a_1, core_attn_out, 'language_model.model.layers.0.linear_attn');  mixed_qkv = b_1 = a_1 = core_attn_out = gdn_attention_core = None
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/_ops.py", line 1209, in __call__
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return self._op(*args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 1451, in gdn_attention_core
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     self._forward_core(
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 683, in _forward_core
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     mixed_qkv_spec = causal_conv1d_update(
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]                      ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/model_executor/layers/mamba/ops/causal_conv1d.py", line 1196, in causal_conv1d_update
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     _causal_conv1d_update_kernel[grid](
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/triton/runtime/jit.py", line 370, in <lambda>
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/triton/runtime/jit.py", line 743, in run
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     launch_metadata = kernel.launch_metadata(grid, stream, *bound_args.values())
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/triton/compiler/compiler.py", line 482, in launch_metadata
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     self._init_handles()
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/triton/compiler/compiler.py", line 465, in _init_handles
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     self.module, self.function, self.n_regs, self.n_spills, self.n_max_threads = driver.active.utils.load_binary(
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]                                                                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100] RuntimeError: Triton Error [CUDA]: operation not permitted when stream is capturing
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100] 
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100] During handling of the above exception, another exception occurred:
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100] 
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100] Traceback (most recent call last):
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1090, in run_engine_core
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return func(*args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 834, in __init__
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     super().__init__(
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 120, in __init__
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return func(*args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 279, in _initialize_kv_caches
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     self.model_executor.initialize_from_config(kv_cache_configs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 118, in initialize_from_config
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     compilation_times: list[float] = self.collective_rpc("compile_or_warm_up_model")
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 76, in collective_rpc
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return func(*args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return func(*args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 522, in compile_or_warm_up_model
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     cuda_graph_memory_bytes = self.model_runner.capture_model()
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return func(*args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5337, in capture_model
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     self._capture_cudagraphs(
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5438, in _capture_cudagraphs
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     dummy_run(
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return func(*args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4976, in _dummy_run
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     outputs = self.model(
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]               ^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/compilation/cuda_graph.py", line 275, in __call__
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     with torch.cuda.graph(
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]          ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/cuda/graphs.py", line 268, in __exit__
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     self.cuda_graph.capture_end()
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/cuda/graphs.py", line 130, in capture_end
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     super().capture_end()
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100] torch.AcceleratorError: CUDA error: operation failed due to a previous error during capture
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100] Search for `cudaErrorStreamCaptureInvalidated' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100] 
(EngineCore_DP0 pid=1418231) Process EngineCore_DP0:
(EngineCore_DP0 pid=1418231) Traceback (most recent call last):
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/compilation/cuda_graph.py", line 281, in __call__
(EngineCore_DP0 pid=1418231)     output = self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=1418231)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=1418231)     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore_DP0 pid=1418231)     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_5.py", line 738, in forward
(EngineCore_DP0 pid=1418231)     hidden_states = self.language_model.model(
(EngineCore_DP0 pid=1418231)                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 402, in __call__
(EngineCore_DP0 pid=1418231)     return self.aot_compiled_fn(self, *args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/_dynamo/aot_compile.py", line 124, in __call__
(EngineCore_DP0 pid=1418231)     return self.fn(*args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 1132, in forward
(EngineCore_DP0 pid=1418231)     def forward(
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/compilation/caching.py", line 198, in __call__
(EngineCore_DP0 pid=1418231)     return self.optimized_call(*args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/fx/graph_module.py", line 936, in call_wrapped
(EngineCore_DP0 pid=1418231)     return self._wrapped_call(self, *args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/fx/graph_module.py", line 455, in __call__
(EngineCore_DP0 pid=1418231)     raise e
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/fx/graph_module.py", line 442, in __call__
(EngineCore_DP0 pid=1418231)     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=1418231)     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore_DP0 pid=1418231)     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "<eval_with_key>.51", line 208, in forward
(EngineCore_DP0 pid=1418231)     submod_1 = self.submod_1(getitem, s59, getitem_1, getitem_2, getitem_3);  getitem = getitem_1 = getitem_2 = submod_1 = None
(EngineCore_DP0 pid=1418231)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/fx/graph_module.py", line 936, in call_wrapped
(EngineCore_DP0 pid=1418231)     return self._wrapped_call(self, *args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/fx/graph_module.py", line 455, in __call__
(EngineCore_DP0 pid=1418231)     raise e
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/fx/graph_module.py", line 442, in __call__
(EngineCore_DP0 pid=1418231)     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=1418231)     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore_DP0 pid=1418231)     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "<eval_with_key>.53", line 5, in forward
(EngineCore_DP0 pid=1418231)     gdn_attention_core = torch.ops.vllm.gdn_attention_core(mixed_qkv, b_1, a_1, core_attn_out, 'language_model.model.layers.0.linear_attn');  mixed_qkv = b_1 = a_1 = core_attn_out = gdn_attention_core = None
(EngineCore_DP0 pid=1418231)                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/_ops.py", line 1209, in __call__
(EngineCore_DP0 pid=1418231)     return self._op(*args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 1451, in gdn_attention_core
(EngineCore_DP0 pid=1418231)     self._forward_core(
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 683, in _forward_core
(EngineCore_DP0 pid=1418231)     mixed_qkv_spec = causal_conv1d_update(
(EngineCore_DP0 pid=1418231)                      ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/model_executor/layers/mamba/ops/causal_conv1d.py", line 1196, in causal_conv1d_update
(EngineCore_DP0 pid=1418231)     _causal_conv1d_update_kernel[grid](
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/triton/runtime/jit.py", line 370, in <lambda>
(EngineCore_DP0 pid=1418231)     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
(EngineCore_DP0 pid=1418231)                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/triton/runtime/jit.py", line 743, in run
(EngineCore_DP0 pid=1418231)     launch_metadata = kernel.launch_metadata(grid, stream, *bound_args.values())
(EngineCore_DP0 pid=1418231)                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/triton/compiler/compiler.py", line 482, in launch_metadata
(EngineCore_DP0 pid=1418231)     self._init_handles()
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/triton/compiler/compiler.py", line 465, in _init_handles
(EngineCore_DP0 pid=1418231)     self.module, self.function, self.n_regs, self.n_spills, self.n_max_threads = driver.active.utils.load_binary(
(EngineCore_DP0 pid=1418231)                                                                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) RuntimeError: Triton Error [CUDA]: operation not permitted when stream is capturing
(EngineCore_DP0 pid=1418231) 
(EngineCore_DP0 pid=1418231) During handling of the above exception, another exception occurred:
(EngineCore_DP0 pid=1418231) 
(EngineCore_DP0 pid=1418231) Traceback (most recent call last):
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=1418231)     self.run()
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=1418231)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1104, in run_engine_core
(EngineCore_DP0 pid=1418231)     raise e
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1090, in run_engine_core
(EngineCore_DP0 pid=1418231)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=1418231)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=1418231)     return func(*args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 834, in __init__
(EngineCore_DP0 pid=1418231)     super().__init__(
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 120, in __init__
(EngineCore_DP0 pid=1418231)     num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
(EngineCore_DP0 pid=1418231)                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=1418231)     return func(*args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 279, in _initialize_kv_caches
(EngineCore_DP0 pid=1418231)     self.model_executor.initialize_from_config(kv_cache_configs)
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 118, in initialize_from_config
(EngineCore_DP0 pid=1418231)     compilation_times: list[float] = self.collective_rpc("compile_or_warm_up_model")
(EngineCore_DP0 pid=1418231)                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 76, in collective_rpc
(EngineCore_DP0 pid=1418231)     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=1418231)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore_DP0 pid=1418231)     return func(*args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=1418231)     return func(*args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 522, in compile_or_warm_up_model
(EngineCore_DP0 pid=1418231)     cuda_graph_memory_bytes = self.model_runner.capture_model()
(EngineCore_DP0 pid=1418231)                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=1418231)     return func(*args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5337, in capture_model
(EngineCore_DP0 pid=1418231)     self._capture_cudagraphs(
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5438, in _capture_cudagraphs
(EngineCore_DP0 pid=1418231)     dummy_run(
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore_DP0 pid=1418231)     return func(*args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4976, in _dummy_run
(EngineCore_DP0 pid=1418231)     outputs = self.model(
(EngineCore_DP0 pid=1418231)               ^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/compilation/cuda_graph.py", line 275, in __call__
(EngineCore_DP0 pid=1418231)     with torch.cuda.graph(
(EngineCore_DP0 pid=1418231)          ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/cuda/graphs.py", line 268, in __exit__
(EngineCore_DP0 pid=1418231)     self.cuda_graph.capture_end()
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/cuda/graphs.py", line 130, in capture_end
(EngineCore_DP0 pid=1418231)     super().capture_end()
(EngineCore_DP0 pid=1418231) torch.AcceleratorError: CUDA error: operation failed due to a previous error during capture
(EngineCore_DP0 pid=1418231) Search for `cudaErrorStreamCaptureInvalidated' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
(EngineCore_DP0 pid=1418231) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(EngineCore_DP0 pid=1418231) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(EngineCore_DP0 pid=1418231) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(EngineCore_DP0 pid=1418231) 
[rank0]:[W310 14:43:47.999000372 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=1416898) Traceback (most recent call last):
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/bin/vllm", line 6, in <module>
(APIServer pid=1416898)     sys.exit(main())
(APIServer pid=1416898)              ^^^^^^
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 73, in main
(APIServer pid=1416898)     args.dispatch_function(args)
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 112, in cmd
(APIServer pid=1416898)     uvloop.run(run_server(args))
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=1416898)     return __asyncio.run(
(APIServer pid=1416898)            ^^^^^^^^^^^^^^
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=1416898)     return runner.run(main)
(APIServer pid=1416898)            ^^^^^^^^^^^^^^^^
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1416898)     return self._loop.run_until_complete(task)
(APIServer pid=1416898)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1416898)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1416898)     return await main
(APIServer pid=1416898)            ^^^^^^^^^^
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 471, in run_server
(APIServer pid=1416898)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 490, in run_server_worker
(APIServer pid=1416898)     async with build_async_engine_client(
(APIServer pid=1416898)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1416898)     return await anext(self.gen)
(APIServer pid=1416898)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 96, in build_async_engine_client
(APIServer pid=1416898)     async with build_async_engine_client_from_engine_args(
(APIServer pid=1416898)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1416898)     return await anext(self.gen)
(APIServer pid=1416898)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 137, in build_async_engine_client_from_engine_args
(APIServer pid=1416898)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1416898)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=1416898)     return cls(
(APIServer pid=1416898)            ^^^^
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 154, in __init__
(APIServer pid=1416898)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1416898)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1416898)     return func(*args, **kwargs)
(APIServer pid=1416898)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 127, in make_async_mp_client
(APIServer pid=1416898)     return AsyncMPClient(*client_args)
(APIServer pid=1416898)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1416898)     return func(*args, **kwargs)
(APIServer pid=1416898)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 911, in __init__
(APIServer pid=1416898)     super().__init__(
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 569, in __init__
(APIServer pid=1416898)     with launch_core_engines(
(APIServer pid=1416898)          ^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=1416898)     next(self.gen)
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 951, in launch_core_engines
(APIServer pid=1416898)     wait_for_engine_startup(
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 1010, in wait_for_engine_startup
(APIServer pid=1416898)     raise RuntimeError(
(APIServer pid=1416898) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
(official_deploy) ```

ZJY0516 · 2026-03-10T07:11:19Z

@flutist I can not reproduce this. I think it's perhaps related to you environmant. How do you install vllm?

And these should be captured for decoding

flutist · 2026-03-10T07:22:37Z

@flutist I can not reproduce this. I think it's perhaps related to you environmant. How do you install vllm?

And these should be captured for decoding

the problem is reported in #36498, and only produce in 0.17.0 version, i install it by pip install vllm.

ZJY0516 · 2026-03-10T09:57:35Z

@flutist I can not reproduce this. I think it's perhaps related to you environmant. How do you install vllm?
And these should be captured for decoding

the problem is reported in #36498, and only produce in 0.17.0 version, i install it by pip install vllm.

Yes, it's a known issue for qwen ima when mtp is enabled. But it's has beed fixed on main according to my test

flutist · 2026-03-10T10:50:09Z

@flutist I can not reproduce this. I think it's perhaps related to you environmant. How do you install vllm?
And these should be captured for decoding

the problem is reported in #36498, and only produce in 0.17.0 version, i install it by pip install vllm.

Yes, it's a known issue for qwen ima when mtp is enabled. But it's has beed fixed on main according to my test

still error in main branch in my side, and i open a new pr to solve it.#36634

skip triton when graph capture

e9884ef

Signed-off-by: xjx <493337577@qq.com>

flutist requested a review from sighingnow as a code owner March 10, 2026 04:05

flutist mentioned this pull request Mar 10, 2026

[Bug]: There is something wrong with the use of mtp in qwen3.5-moe model: when it is changed to 0.17.0, it is wrong to directly report CudaError: an illegal memory access was encountered when reasoning with mtp. #36498

Open

1 task

mergify bot added the qwen Related to Qwen models label Mar 10, 2026

gemini-code-assist bot reviewed Mar 10, 2026

View reviewed changes

flutist added 2 commits March 10, 2026 12:12

Merge branch 'main' into skip_triton_when_graph_capture

741ce73

Merge branch 'main' into skip_triton_when_graph_capture

ef9ebf0

Isotr0py reviewed Mar 10, 2026

View reviewed changes

vllm/model_executor/models/qwen3_next.py Show resolved Hide resolved

flutist added 2 commits March 10, 2026 14:21

skip triton when graph capture

8c1e6c3

Signed-off-by: xjx <493337577@qq.com>

Merge remote-tracking branch 'origin/skip_triton_when_graph_capture' …

182eaa1

…into skip_triton_when_graph_capture

Merge branch 'main' into skip_triton_when_graph_capture

7b32052

flutist marked this pull request as draft March 10, 2026 09:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

skip triton when graph capture#36587

skip triton when graph capture#36587
flutist wants to merge 6 commits intovllm-project:mainfrom
flutist:skip_triton_when_graph_capture

flutist commented Mar 10, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

flutist commented Mar 10, 2026

Uh oh!

flutist commented Mar 10, 2026 •

edited

Loading

Uh oh!

Isotr0py left a comment •

edited

Loading

Uh oh!

Uh oh!

flutist commented Mar 10, 2026 •

edited

Loading

Uh oh!

ZJY0516 commented Mar 10, 2026

Uh oh!

flutist commented Mar 10, 2026

Uh oh!

ZJY0516 commented Mar 10, 2026

Uh oh!

flutist commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

flutist commented Mar 10, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

flutist commented Mar 10, 2026

Uh oh!

flutist commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix: Skip _forward_core during CUDA Graph capture to avoid Triton kernel errors

Problem

Solution

Uh oh!

Isotr0py left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

flutist commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

============================== CUDA / GPU Info

Uh oh!

ZJY0516 commented Mar 10, 2026

Uh oh!

flutist commented Mar 10, 2026

Uh oh!

ZJY0516 commented Mar 10, 2026

Uh oh!

flutist commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

flutist commented Mar 10, 2026 •

edited by github-actions bot

Loading

flutist commented Mar 10, 2026 •

edited

Loading

Fix: Skip `_forward_core` during CUDA Graph capture to avoid Triton kernel errors

Isotr0py left a comment •

edited

Loading

flutist commented Mar 10, 2026 •

edited

Loading

==============================
CUDA / GPU Info