Skip to content

fix mtp launch error in vllm-0.17.1-rc, about cuda graph during memory profile#36634

Open
flutist wants to merge 12 commits intovllm-project:mainfrom
flutist:warm_up_spec_before_capture
Open

fix mtp launch error in vllm-0.17.1-rc, about cuda graph during memory profile#36634
flutist wants to merge 12 commits intovllm-project:mainfrom
flutist:warm_up_spec_before_capture

Conversation

@flutist
Copy link
Contributor

@flutist flutist commented Mar 10, 2026

populate buffers so that GDN attention triggers JIT complication of spce-decode before graph capture, i thought the bug is introduced by #30515

Purpose

when launch vllm serve Qwen/Qwen3.5-0.8B --speculative_config '{"method": "mtp", "num_speculative_tokens":2}'
in vllm 0.17.0, console show error

(APIServer pid=1416898) INFO 03-10 14:41:51 [utils.py:302]        █     █     █▄   ▄█
(APIServer pid=1416898) INFO 03-10 14:41:51 [utils.py:302]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.17.0
(APIServer pid=1416898) INFO 03-10 14:41:51 [utils.py:302]   █▄█▀ █     █     █     █  model   Qwen/Qwen3.5-0.8B
(APIServer pid=1416898) INFO 03-10 14:41:51 [utils.py:302]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1416898) INFO 03-10 14:41:51 [utils.py:302] 
(APIServer pid=1416898) INFO 03-10 14:41:51 [utils.py:238] non-default args: {'model_tag': 'Qwen/Qwen3.5-0.8B', 'model': 'Qwen/Qwen3.5-0.8B', 'speculative_config': {'method': 'mtp', 'num_speculative_tokens': 2}}
(APIServer pid=1416898) INFO 03-10 14:42:00 [model.py:531] Resolved architecture: Qwen3_5ForConditionalGeneration
(APIServer pid=1416898) INFO 03-10 14:42:00 [model.py:1554] Using max model len 262144
(APIServer pid=1416898) INFO 03-10 14:42:08 [model.py:531] Resolved architecture: Qwen3_5MTP
(APIServer pid=1416898) INFO 03-10 14:42:08 [model.py:1554] Using max model len 262144
(APIServer pid=1416898) WARNING 03-10 14:42:08 [speculative.py:487] Enabling num_speculative_tokens > 1 will run multiple times of forward on same MTP layer,which may result in lower acceptance rate
(APIServer pid=1416898) INFO 03-10 14:42:08 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=1416898) INFO 03-10 14:42:08 [config.py:544] Setting attention block size to 544 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=1416898) INFO 03-10 14:42:08 [config.py:575] Padding mamba page size by 0.37% to ensure that mamba page size and attention page size are exactly equal.
(APIServer pid=1416898) INFO 03-10 14:42:08 [vllm.py:747] Asynchronous scheduling is enabled.
(EngineCore_DP0 pid=1418231) INFO 03-10 14:42:32 [core.py:101] Initializing a V1 LLM engine (v0.17.0) with config: model='Qwen/Qwen3.5-0.8B', speculative_config=SpeculativeConfig(method='mtp', model='Qwen/Qwen3.5-0.8B', num_spec_tokens=2), tokenizer='Qwen/Qwen3.5-0.8B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen3.5-0.8B, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=1418231) INFO 03-10 14:42:35 [parallel_state.py:1393] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://33.1.35.33:36557 backend=nccl
(EngineCore_DP0 pid=1418231) INFO 03-10 14:42:35 [parallel_state.py:1715] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore_DP0 pid=1418231) WARNING 03-10 14:42:36 [__init__.py:204] min_p and logit_bias parameters won't work with speculative decoding.
(EngineCore_DP0 pid=1418231) INFO 03-10 14:42:48 [base.py:106] Offloader set to NoopOffloader
(EngineCore_DP0 pid=1418231) INFO 03-10 14:42:48 [gpu_model_runner.py:4255] Starting to load model Qwen/Qwen3.5-0.8B...
(EngineCore_DP0 pid=1418231) INFO 03-10 14:42:49 [cuda.py:453] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(EngineCore_DP0 pid=1418231) INFO 03-10 14:42:49 [mm_encoder_attention.py:215] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore_DP0 pid=1418231) INFO 03-10 14:42:50 [cuda.py:405] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore_DP0 pid=1418231) INFO 03-10 14:42:50 [flash_attn.py:587] Using FlashAttention version 2
(EngineCore_DP0 pid=1418231) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(EngineCore_DP0 pid=1418231) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:27<00:00, 27.50s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:27<00:00, 27.50s/it]
(EngineCore_DP0 pid=1418231) 
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:19 [default_loader.py:293] Loading weights took 27.55 seconds
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:19 [gpu_model_runner.py:4279] Loading drafter model...
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.49it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.49it/s]
(EngineCore_DP0 pid=1418231) 
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:20 [default_loader.py:293] Loading weights took 0.71 seconds
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:20 [eagle.py:1381] Detected MTP model. Sharing target model embedding weights with the draft model.
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:20 [eagle.py:1435] Detected MTP model. Sharing target model lm_head weights with the draft model.
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:21 [gpu_model_runner.py:4338] Model loading took 1.76 GiB memory and 31.185971 seconds
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:21 [gpu_model_runner.py:5254] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:28 [backends.py:916] Using cache directory: /home/admin/.cache/vllm/torch_compile_cache/179c7b3119/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:28 [backends.py:976] Dynamo bytecode transform time: 3.96 s
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:29 [backends.py:350] Cache the graph of compile range (1, 2048) for later use
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:30 [backends.py:366] Compiling a graph for compile range (1, 2048) takes 1.01 s
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:30 [monitor.py:35] torch.compile takes 6.05 s in total
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:30 [decorators.py:580] saving AOT compiled function to /home/admin/.cache/vllm/torch_compile_cache/torch_aot_compile/1dd8e784de27f218399d872f85173023ec01f602ef1672dbf6fc5585654dacf2/rank_0_0/model
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:30 [decorators.py:588] saved AOT compiled function to /home/admin/.cache/vllm/torch_compile_cache/torch_aot_compile/1dd8e784de27f218399d872f85173023ec01f602ef1672dbf6fc5585654dacf2/rank_0_0/model
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:31 [backends.py:916] Using cache directory: /home/admin/.cache/vllm/torch_compile_cache/179c7b3119/rank_0_0/eagle_head for vLLM's torch.compile
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:31 [backends.py:976] Dynamo bytecode transform time: 0.58 s
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:31 [backends.py:366] Compiling a graph for compile range (1, 2048) takes 0.12 s
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:31 [monitor.py:35] torch.compile takes 0.79 s in total
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:31 [decorators.py:580] saving AOT compiled function to /home/admin/.cache/vllm/torch_compile_cache/torch_aot_compile/03438994f4c39ed4b1b0fa536801cf6f5dfeaedec23ca91b73efe97592f57cf8/rank_0_0/model
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:31 [decorators.py:588] saved AOT compiled function to /home/admin/.cache/vllm/torch_compile_cache/torch_aot_compile/03438994f4c39ed4b1b0fa536801cf6f5dfeaedec23ca91b73efe97592f57cf8/rank_0_0/model
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:32 [gpu_worker.py:424] Available KV cache memory: 36.18 GiB
(EngineCore_DP0 pid=1418231) WARNING 03-10 14:43:32 [kv_cache_utils.py:1054] Add 3 padding layers, may waste at most 16.67% KV cache memory
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:32 [kv_cache_utils.py:1314] GPU KV cache size: 677,280 tokens
(EngineCore_DP0 pid=1418231) INFO 03-10 14:43:32 [kv_cache_utils.py:1319] Maximum concurrency for 262,144 tokens per request: 10.14x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|████████████████████████████████████████| 49/49 [00:01<00:00, 34.29it/s]
Capturing CUDA graphs (decode, FULL):   0%|                                                                    | 0/49 [00:11<?, ?it/s]
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100] EngineCore failed to start.
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100] Traceback (most recent call last):
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/compilation/cuda_graph.py", line 281, in __call__
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     output = self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_5.py", line 738, in forward
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     hidden_states = self.language_model.model(
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 402, in __call__
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return self.aot_compiled_fn(self, *args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/_dynamo/aot_compile.py", line 124, in __call__
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return self.fn(*args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 1132, in forward
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     def forward(
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/compilation/caching.py", line 198, in __call__
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return self.optimized_call(*args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/fx/graph_module.py", line 936, in call_wrapped
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return self._wrapped_call(self, *args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/fx/graph_module.py", line 455, in __call__
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     raise e
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/fx/graph_module.py", line 442, in __call__
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "<eval_with_key>.51", line 208, in forward
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     submod_1 = self.submod_1(getitem, s59, getitem_1, getitem_2, getitem_3);  getitem = getitem_1 = getitem_2 = submod_1 = None
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/fx/graph_module.py", line 936, in call_wrapped
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return self._wrapped_call(self, *args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/fx/graph_module.py", line 455, in __call__
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     raise e
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/fx/graph_module.py", line 442, in __call__
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "<eval_with_key>.53", line 5, in forward
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     gdn_attention_core = torch.ops.vllm.gdn_attention_core(mixed_qkv, b_1, a_1, core_attn_out, 'language_model.model.layers.0.linear_attn');  mixed_qkv = b_1 = a_1 = core_attn_out = gdn_attention_core = None
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/_ops.py", line 1209, in __call__
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return self._op(*args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 1451, in gdn_attention_core
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     self._forward_core(
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 683, in _forward_core
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     mixed_qkv_spec = causal_conv1d_update(
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]                      ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/model_executor/layers/mamba/ops/causal_conv1d.py", line 1196, in causal_conv1d_update
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     _causal_conv1d_update_kernel[grid](
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/triton/runtime/jit.py", line 370, in <lambda>
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/triton/runtime/jit.py", line 743, in run
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     launch_metadata = kernel.launch_metadata(grid, stream, *bound_args.values())
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/triton/compiler/compiler.py", line 482, in launch_metadata
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     self._init_handles()
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/triton/compiler/compiler.py", line 465, in _init_handles
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     self.module, self.function, self.n_regs, self.n_spills, self.n_max_threads = driver.active.utils.load_binary(
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]                                                                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100] RuntimeError: Triton Error [CUDA]: operation not permitted when stream is capturing
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100] 
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100] During handling of the above exception, another exception occurred:
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100] 
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100] Traceback (most recent call last):
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1090, in run_engine_core
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return func(*args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 834, in __init__
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     super().__init__(
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 120, in __init__
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return func(*args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 279, in _initialize_kv_caches
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     self.model_executor.initialize_from_config(kv_cache_configs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 118, in initialize_from_config
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     compilation_times: list[float] = self.collective_rpc("compile_or_warm_up_model")
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 76, in collective_rpc
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return func(*args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return func(*args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 522, in compile_or_warm_up_model
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     cuda_graph_memory_bytes = self.model_runner.capture_model()
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return func(*args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5337, in capture_model
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     self._capture_cudagraphs(
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5438, in _capture_cudagraphs
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     dummy_run(
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     return func(*args, **kwargs)
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4976, in _dummy_run
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     outputs = self.model(
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]               ^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/compilation/cuda_graph.py", line 275, in __call__
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     with torch.cuda.graph(
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]          ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/cuda/graphs.py", line 268, in __exit__
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     self.cuda_graph.capture_end()
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/cuda/graphs.py", line 130, in capture_end
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100]     super().capture_end()
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100] torch.AcceleratorError: CUDA error: operation failed due to a previous error during capture
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100] Search for `cudaErrorStreamCaptureInvalidated' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(EngineCore_DP0 pid=1418231) ERROR 03-10 14:43:46 [core.py:1100] 
(EngineCore_DP0 pid=1418231) Process EngineCore_DP0:
(EngineCore_DP0 pid=1418231) Traceback (most recent call last):
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/compilation/cuda_graph.py", line 281, in __call__
(EngineCore_DP0 pid=1418231)     output = self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=1418231)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=1418231)     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore_DP0 pid=1418231)     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_5.py", line 738, in forward
(EngineCore_DP0 pid=1418231)     hidden_states = self.language_model.model(
(EngineCore_DP0 pid=1418231)                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 402, in __call__
(EngineCore_DP0 pid=1418231)     return self.aot_compiled_fn(self, *args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/_dynamo/aot_compile.py", line 124, in __call__
(EngineCore_DP0 pid=1418231)     return self.fn(*args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 1132, in forward
(EngineCore_DP0 pid=1418231)     def forward(
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/compilation/caching.py", line 198, in __call__
(EngineCore_DP0 pid=1418231)     return self.optimized_call(*args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/fx/graph_module.py", line 936, in call_wrapped
(EngineCore_DP0 pid=1418231)     return self._wrapped_call(self, *args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/fx/graph_module.py", line 455, in __call__
(EngineCore_DP0 pid=1418231)     raise e
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/fx/graph_module.py", line 442, in __call__
(EngineCore_DP0 pid=1418231)     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=1418231)     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore_DP0 pid=1418231)     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "<eval_with_key>.51", line 208, in forward
(EngineCore_DP0 pid=1418231)     submod_1 = self.submod_1(getitem, s59, getitem_1, getitem_2, getitem_3);  getitem = getitem_1 = getitem_2 = submod_1 = None
(EngineCore_DP0 pid=1418231)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/fx/graph_module.py", line 936, in call_wrapped
(EngineCore_DP0 pid=1418231)     return self._wrapped_call(self, *args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/fx/graph_module.py", line 455, in __call__
(EngineCore_DP0 pid=1418231)     raise e
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/fx/graph_module.py", line 442, in __call__
(EngineCore_DP0 pid=1418231)     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=1418231)     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore_DP0 pid=1418231)     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "<eval_with_key>.53", line 5, in forward
(EngineCore_DP0 pid=1418231)     gdn_attention_core = torch.ops.vllm.gdn_attention_core(mixed_qkv, b_1, a_1, core_attn_out, 'language_model.model.layers.0.linear_attn');  mixed_qkv = b_1 = a_1 = core_attn_out = gdn_attention_core = None
(EngineCore_DP0 pid=1418231)                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/_ops.py", line 1209, in __call__
(EngineCore_DP0 pid=1418231)     return self._op(*args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 1451, in gdn_attention_core
(EngineCore_DP0 pid=1418231)     self._forward_core(
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 683, in _forward_core
(EngineCore_DP0 pid=1418231)     mixed_qkv_spec = causal_conv1d_update(
(EngineCore_DP0 pid=1418231)                      ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/model_executor/layers/mamba/ops/causal_conv1d.py", line 1196, in causal_conv1d_update
(EngineCore_DP0 pid=1418231)     _causal_conv1d_update_kernel[grid](
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/triton/runtime/jit.py", line 370, in <lambda>
(EngineCore_DP0 pid=1418231)     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
(EngineCore_DP0 pid=1418231)                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/triton/runtime/jit.py", line 743, in run
(EngineCore_DP0 pid=1418231)     launch_metadata = kernel.launch_metadata(grid, stream, *bound_args.values())
(EngineCore_DP0 pid=1418231)                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/triton/compiler/compiler.py", line 482, in launch_metadata
(EngineCore_DP0 pid=1418231)     self._init_handles()
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/triton/compiler/compiler.py", line 465, in _init_handles
(EngineCore_DP0 pid=1418231)     self.module, self.function, self.n_regs, self.n_spills, self.n_max_threads = driver.active.utils.load_binary(
(EngineCore_DP0 pid=1418231)                                                                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231) RuntimeError: Triton Error [CUDA]: operation not permitted when stream is capturing
(EngineCore_DP0 pid=1418231) 
(EngineCore_DP0 pid=1418231) During handling of the above exception, another exception occurred:
(EngineCore_DP0 pid=1418231) 
(EngineCore_DP0 pid=1418231) Traceback (most recent call last):
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=1418231)     self.run()
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=1418231)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1104, in run_engine_core
(EngineCore_DP0 pid=1418231)     raise e
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1090, in run_engine_core
(EngineCore_DP0 pid=1418231)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=1418231)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=1418231)     return func(*args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 834, in __init__
(EngineCore_DP0 pid=1418231)     super().__init__(
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 120, in __init__
(EngineCore_DP0 pid=1418231)     num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
(EngineCore_DP0 pid=1418231)                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=1418231)     return func(*args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 279, in _initialize_kv_caches
(EngineCore_DP0 pid=1418231)     self.model_executor.initialize_from_config(kv_cache_configs)
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 118, in initialize_from_config
(EngineCore_DP0 pid=1418231)     compilation_times: list[float] = self.collective_rpc("compile_or_warm_up_model")
(EngineCore_DP0 pid=1418231)                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 76, in collective_rpc
(EngineCore_DP0 pid=1418231)     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=1418231)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore_DP0 pid=1418231)     return func(*args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=1418231)     return func(*args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 522, in compile_or_warm_up_model
(EngineCore_DP0 pid=1418231)     cuda_graph_memory_bytes = self.model_runner.capture_model()
(EngineCore_DP0 pid=1418231)                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=1418231)     return func(*args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5337, in capture_model
(EngineCore_DP0 pid=1418231)     self._capture_cudagraphs(
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5438, in _capture_cudagraphs
(EngineCore_DP0 pid=1418231)     dummy_run(
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore_DP0 pid=1418231)     return func(*args, **kwargs)
(EngineCore_DP0 pid=1418231)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4976, in _dummy_run
(EngineCore_DP0 pid=1418231)     outputs = self.model(
(EngineCore_DP0 pid=1418231)               ^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/compilation/cuda_graph.py", line 275, in __call__
(EngineCore_DP0 pid=1418231)     with torch.cuda.graph(
(EngineCore_DP0 pid=1418231)          ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/cuda/graphs.py", line 268, in __exit__
(EngineCore_DP0 pid=1418231)     self.cuda_graph.capture_end()
(EngineCore_DP0 pid=1418231)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/torch/cuda/graphs.py", line 130, in capture_end
(EngineCore_DP0 pid=1418231)     super().capture_end()
(EngineCore_DP0 pid=1418231) torch.AcceleratorError: CUDA error: operation failed due to a previous error during capture
(EngineCore_DP0 pid=1418231) Search for `cudaErrorStreamCaptureInvalidated' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
(EngineCore_DP0 pid=1418231) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(EngineCore_DP0 pid=1418231) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(EngineCore_DP0 pid=1418231) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(EngineCore_DP0 pid=1418231) 
[rank0]:[W310 14:43:47.999000372 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=1416898) Traceback (most recent call last):
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/bin/vllm", line 6, in <module>
(APIServer pid=1416898)     sys.exit(main())
(APIServer pid=1416898)              ^^^^^^
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 73, in main
(APIServer pid=1416898)     args.dispatch_function(args)
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 112, in cmd
(APIServer pid=1416898)     uvloop.run(run_server(args))
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=1416898)     return __asyncio.run(
(APIServer pid=1416898)            ^^^^^^^^^^^^^^
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=1416898)     return runner.run(main)
(APIServer pid=1416898)            ^^^^^^^^^^^^^^^^
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1416898)     return self._loop.run_until_complete(task)
(APIServer pid=1416898)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1416898)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1416898)     return await main
(APIServer pid=1416898)            ^^^^^^^^^^
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 471, in run_server
(APIServer pid=1416898)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 490, in run_server_worker
(APIServer pid=1416898)     async with build_async_engine_client(
(APIServer pid=1416898)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1416898)     return await anext(self.gen)
(APIServer pid=1416898)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 96, in build_async_engine_client
(APIServer pid=1416898)     async with build_async_engine_client_from_engine_args(
(APIServer pid=1416898)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1416898)     return await anext(self.gen)
(APIServer pid=1416898)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 137, in build_async_engine_client_from_engine_args
(APIServer pid=1416898)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1416898)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=1416898)     return cls(
(APIServer pid=1416898)            ^^^^
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 154, in __init__
(APIServer pid=1416898)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1416898)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1416898)     return func(*args, **kwargs)
(APIServer pid=1416898)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 127, in make_async_mp_client
(APIServer pid=1416898)     return AsyncMPClient(*client_args)
(APIServer pid=1416898)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1416898)     return func(*args, **kwargs)
(APIServer pid=1416898)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 911, in __init__
(APIServer pid=1416898)     super().__init__(
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 569, in __init__
(APIServer pid=1416898)     with launch_core_engines(
(APIServer pid=1416898)          ^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=1416898)     next(self.gen)
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 951, in launch_core_engines
(APIServer pid=1416898)     wait_for_engine_startup(
(APIServer pid=1416898)   File "/home/admin/miniconda3/envs/official_deploy/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 1010, in wait_for_engine_startup
(APIServer pid=1416898)     raise RuntimeError(
(APIServer pid=1416898) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
(official_deploy) 

the root cause is the _dummy_run warmup never initialized the spec-decode draft-token buffers, so the GDN attention builder always saw num_decode_draft_tokens == 0 and skipped the spec-decode code path, leaving the IS_SPEC_DECODING=True Triton kernel variants uncompiled until CUDA graph capture — where JIT compilation is forbidden.
So by populating num_decode_draft_tokens > 0 in the dummy-run buffers before calling _build_attention_metadata, the GDN builder produces a non-None spec_sequence_masks, which causes the model forward pass to take the spec-decode code path and JIT-compile the IS_SPEC_DECODING=True Triton kernel variants during warmup (outside CUDA graph capture), so they are already compiled when graph capture begins.

Test Plan

image

Test Result

everything is fine after deploy revised code.

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

…pce-decode before graph capture

Signed-off-by: xjx <493337577@qq.com>
@flutist flutist requested a review from njhill as a code owner March 10, 2026 10:23
@mergify mergify bot added the v1 label Mar 10, 2026
@flutist flutist changed the title populate buffers so that GDN attention triggers JIT complication of s… fix mtp launch error in vllm-0.17.0 Mar 10, 2026
@flutist
Copy link
Contributor Author

flutist commented Mar 10, 2026

@mgoin @benchislett @LucasWilkinson @NickLucche PTAL, thanks

@flutist flutist changed the title fix mtp launch error in vllm-0.17.0 fix mtp launch error in vllm-0.17.0, about cuda graph during memory profile Mar 10, 2026
@mergify mergify bot added the nvidia label Mar 10, 2026
@flutist
Copy link
Contributor Author

flutist commented Mar 10, 2026

@Isotr0py I changed the implementation,PTAL, thanks

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a crash that occurs during CUDA graph capture when using MTP speculative decoding. The root cause was that certain Triton kernels required for speculative decoding were not being JIT-compiled during the warmup phase. The fix correctly populates the necessary draft token buffers during the dummy run, ensuring these kernels are compiled before graph capture begins. The change is well-targeted and effectively resolves the issue.

Note: Security Review did not run due to the size of the PR.

@MatthewBonanni
Copy link
Collaborator

#30515 is not included in v0.17.0, so it could not have caused this

@flutist
Copy link
Contributor Author

flutist commented Mar 11, 2026

#30515 is not included in v0.17.0, so it could not have caused this

Anyway, could you please take a look at this PR and see if it can solve the issue? Thanks.

@flutist
Copy link
Contributor Author

flutist commented Mar 11, 2026

it still happen in v0.17.1-rc version

(APIServer pid=60641) INFO 03-11 18:16:33 [utils.py:292] 
(APIServer pid=60641) INFO 03-11 18:16:33 [utils.py:292]        █     █     █▄   ▄█
(APIServer pid=60641) INFO 03-11 18:16:33 [utils.py:292]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.17.1rc1.dev32+ge661b9ee8
(APIServer pid=60641) INFO 03-11 18:16:33 [utils.py:292]   █▄█▀ █     █     █     █  model   Qwen/Qwen3.5-0.8B
(APIServer pid=60641) INFO 03-11 18:16:33 [utils.py:292]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=60641) INFO 03-11 18:16:33 [utils.py:292] 
(APIServer pid=60641) INFO 03-11 18:16:33 [utils.py:228] non-default args: {'model_tag': 'Qwen/Qwen3.5-0.8B', 'model': 'Qwen/Qwen3.5-0.8B', 'speculative_config': {'method': 'mtp', 'num_speculative_tokens': 2}}










(APIServer pid=60641) INFO 03-11 18:17:11 [model.py:532] Resolved architecture: Qwen3_5ForConditionalGeneration
(APIServer pid=60641) INFO 03-11 18:17:11 [model.py:1562] Using max model len 262144
(APIServer pid=60641) INFO 03-11 18:17:25 [model.py:532] Resolved architecture: Qwen3_5MTP
(APIServer pid=60641) INFO 03-11 18:17:25 [model.py:1562] Using max model len 262144
(APIServer pid=60641) WARNING 03-11 18:17:25 [speculative.py:492] Enabling num_speculative_tokens > 1 will run multiple times of forward on same MTP layer,which may result in lower acceptance rate
(APIServer pid=60641) INFO 03-11 18:17:25 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=60641) INFO 03-11 18:17:25 [config.py:224] Setting attention block size to 544 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=60641) INFO 03-11 18:17:25 [config.py:255] Padding mamba page size by 0.37% to ensure that mamba page size and attention page size are exactly equal.
(APIServer pid=60641) INFO 03-11 18:17:25 [vllm.py:748] Asynchronous scheduling is enabled.
(EngineCore_DP0 pid=67019) INFO 03-11 18:17:58 [core.py:101] Initializing a V1 LLM engine (v0.17.1rc1.dev32+ge661b9ee8) with config: model='Qwen/Qwen3.5-0.8B', speculative_config=SpeculativeConfig(method='mtp', model='Qwen/Qwen3.5-0.8B', num_spec_tokens=2), tokenizer='Qwen/Qwen3.5-0.8B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen3.5-0.8B, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=67019) INFO 03-11 18:18:05 [parallel_state.py:1395] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://33.1.35.33:33735 backend=nccl
(EngineCore_DP0 pid=67019) INFO 03-11 18:18:06 [parallel_state.py:1717] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore_DP0 pid=67019) WARNING 03-11 18:18:08 [__init__.py:204] min_p and logit_bias parameters won't work with speculative decoding.
(EngineCore_DP0 pid=67019) INFO 03-11 18:18:21 [gpu_model_runner.py:4496] Starting to load model Qwen/Qwen3.5-0.8B...
(EngineCore_DP0 pid=67019) INFO 03-11 18:18:22 [cuda.py:373] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(EngineCore_DP0 pid=67019) INFO 03-11 18:18:22 [mm_encoder_attention.py:215] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore_DP0 pid=67019) INFO 03-11 18:18:39 [cuda.py:317] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore_DP0 pid=67019) INFO 03-11 18:18:39 [flash_attn.py:593] Using FlashAttention version 2
(EngineCore_DP0 pid=67019) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(EngineCore_DP0 pid=67019) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:30<00:00, 30.08s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:30<00:00, 30.08s/it]
(EngineCore_DP0 pid=67019) 
(EngineCore_DP0 pid=67019) INFO 03-11 18:19:12 [default_loader.py:293] Loading weights took 30.15 seconds
(EngineCore_DP0 pid=67019) INFO 03-11 18:19:12 [gpu_model_runner.py:4520] Loading drafter model...
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.31it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.31it/s]
(EngineCore_DP0 pid=67019) 
(EngineCore_DP0 pid=67019) INFO 03-11 18:19:14 [default_loader.py:293] Loading weights took 0.48 seconds
(EngineCore_DP0 pid=67019) INFO 03-11 18:19:14 [eagle.py:1354] Detected MTP model. Sharing target model embedding weights with the draft model.
(EngineCore_DP0 pid=67019) INFO 03-11 18:19:14 [eagle.py:1408] Detected MTP model. Sharing target model lm_head weights with the draft model.
(EngineCore_DP0 pid=67019) INFO 03-11 18:19:15 [gpu_model_runner.py:4579] Model loading took 1.76 GiB memory and 52.580014 seconds
(EngineCore_DP0 pid=67019) INFO 03-11 18:19:15 [gpu_model_runner.py:5501] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(EngineCore_DP0 pid=67019) INFO 03-11 18:19:49 [backends.py:988] Using cache directory: /home/admin/.cache/vllm/torch_compile_cache/5deb6febe8/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=67019) INFO 03-11 18:19:49 [backends.py:1048] Dynamo bytecode transform time: 23.81 s
(EngineCore_DP0 pid=67019) INFO 03-11 18:19:52 [backends.py:371] Cache the graph of compile range (1, 2048) for later use
(EngineCore_DP0 pid=67019) INFO 03-11 18:20:09 [backends.py:387] Compiling a graph for compile range (1, 2048) takes 20.22 s
(EngineCore_DP0 pid=67019) INFO 03-11 18:20:10 [decorators.py:611] saved AOT compiled function to /home/admin/.cache/vllm/torch_compile_cache/torch_aot_compile/78ca6b65b9d11d63a8c5fb57593768f7d456e04a33ee7fcc05da5c011813d5c6/rank_0_0/model
(EngineCore_DP0 pid=67019) INFO 03-11 18:20:10 [monitor.py:48] torch.compile took 45.04 s in total
(EngineCore_DP0 pid=67019) INFO 03-11 18:20:11 [monitor.py:76] Initial profiling/warmup run took 1.12 s
(EngineCore_DP0 pid=67019) INFO 03-11 18:20:12 [backends.py:988] Using cache directory: /home/admin/.cache/vllm/torch_compile_cache/5deb6febe8/rank_0_0/eagle_head for vLLM's torch.compile
(EngineCore_DP0 pid=67019) INFO 03-11 18:20:12 [backends.py:1048] Dynamo bytecode transform time: 0.55 s
(EngineCore_DP0 pid=67019) INFO 03-11 18:20:17 [backends.py:387] Compiling a graph for compile range (1, 2048) takes 5.62 s
(EngineCore_DP0 pid=67019) INFO 03-11 18:20:17 [decorators.py:611] saved AOT compiled function to /home/admin/.cache/vllm/torch_compile_cache/torch_aot_compile/ba18043d7a509ea12be2d3277782a45a9b52b248f27339b50ac9477e04ac053c/rank_0_0/model
(EngineCore_DP0 pid=67019) INFO 03-11 18:20:17 [monitor.py:48] torch.compile took 6.30 s in total
(EngineCore_DP0 pid=67019) INFO 03-11 18:20:18 [monitor.py:76] Initial profiling/warmup run took 0.35 s
(EngineCore_DP0 pid=67019) WARNING 03-11 18:20:27 [kv_cache_utils.py:1054] Add 3 padding layers, may waste at most 16.67% KV cache memory
(EngineCore_DP0 pid=67019) INFO 03-11 18:20:27 [kv_cache_utils.py:826] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=512
(EngineCore_DP0 pid=67019) INFO 03-11 18:20:27 [gpu_model_runner.py:5620] Profiling CUDA graph memory: PIECEWISE=49 (largest=498), FULL=49 (largest=498)
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098] EngineCore failed to start.
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098] Traceback (most recent call last):
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/compilation/cuda_graph.py", line 301, in __call__
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]     output = self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_5.py", line 754, in forward
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]     hidden_states = self.language_model.model(
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 407, in __call__
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]     return self.aot_compiled_fn(self, *args, **kwargs)
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/aot_compile.py", line 124, in __call__
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]     return self.fn(*args, **kwargs)
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]            ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 1154, in forward
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]     def forward(
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/compilation/caching.py", line 206, in __call__
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]     return self.optimized_call(*args, **kwargs)
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/torch/fx/graph_module.py", line 936, in call_wrapped
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]     return self._wrapped_call(self, *args, **kwargs)
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/torch/fx/graph_module.py", line 455, in __call__
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]     raise e
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/torch/fx/graph_module.py", line 442, in __call__
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]   File "<eval_with_key>.51", line 208, in forward
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]     submod_1 = self.submod_1(getitem, s59, getitem_1, getitem_2, getitem_3);  getitem = getitem_1 = getitem_2 = submod_1 = None
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/torch/fx/graph_module.py", line 936, in call_wrapped
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]     return self._wrapped_call(self, *args, **kwargs)
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/torch/fx/graph_module.py", line 455, in __call__
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]     raise e
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/torch/fx/graph_module.py", line 442, in __call__
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]   File "<eval_with_key>.53", line 5, in forward
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]     gdn_attention_core = torch.ops.vllm.gdn_attention_core(mixed_qkv, b_1, a_1, core_attn_out, 'language_model.model.layers.0.linear_attn');  mixed_qkv = b_1 = a_1 = core_attn_out = gdn_attention_core = None
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/torch/_ops.py", line 1209, in __call__
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]     return self._op(*args, **kwargs)
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 1473, in gdn_attention_core
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]     self._forward_core(
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 704, in _forward_core
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]     mixed_qkv_spec = causal_conv1d_update(
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]                      ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/mamba/ops/causal_conv1d.py", line 1196, in causal_conv1d_update
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]     _causal_conv1d_update_kernel[grid](
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/triton/runtime/jit.py", line 370, in <lambda>
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/triton/runtime/jit.py", line 743, in run
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]     launch_metadata = kernel.launch_metadata(grid, stream, *bound_args.values())
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/triton/compiler/compiler.py", line 482, in launch_metadata
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]     self._init_handles()
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/triton/compiler/compiler.py", line 465, in _init_handles
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]     self.module, self.function, self.n_regs, self.n_spills, self.n_max_threads = driver.active.utils.load_binary(
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]                                                                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098] RuntimeError: Triton Error [CUDA]: operation not permitted when stream is capturing
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098] 
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098] During handling of the above exception, another exception occurred:
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098] 
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098] Traceback (most recent call last):
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1088, in run_engine_core
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]     return func(*args, **kwargs)
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 832, in __init__
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]     super().__init__(
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 120, in __init__
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]     kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]     return func(*args, **kwargs)
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 243, in _initialize_kv_caches
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]     available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 136, in determine_available_memory
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]     return self.collective_rpc("determine_available_memory")
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 78, in collective_rpc
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]     return func(*args, **kwargs)
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]     return func(*args, **kwargs)
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 397, in determine_available_memory
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]     cudagraph_memory_estimate = self.model_runner.profile_cudagraph_memory()
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]     return func(*args, **kwargs)
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5649, in profile_cudagraph_memory
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]     self._warmup_and_capture(
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5789, in _warmup_and_capture
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]     self._dummy_run(
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]     return func(*args, **kwargs)
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5223, in _dummy_run
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]     outputs = self.model(
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]               ^^^^^^^^^^^
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/compilation/cuda_graph.py", line 295, in __call__
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]     with torch.cuda.graph(
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]          ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/torch/cuda/graphs.py", line 268, in __exit__
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]     self.cuda_graph.capture_end()
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/torch/cuda/graphs.py", line 130, in capture_end
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098]     super().capture_end()
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098] torch.AcceleratorError: CUDA error: operation failed due to a previous error during capture
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098] Search for `cudaErrorStreamCaptureInvalidated' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(EngineCore_DP0 pid=67019) ERROR 03-11 18:21:28 [core.py:1098] 
(EngineCore_DP0 pid=67019) Process EngineCore_DP0:
(EngineCore_DP0 pid=67019) Traceback (most recent call last):
(EngineCore_DP0 pid=67019)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/compilation/cuda_graph.py", line 301, in __call__
(EngineCore_DP0 pid=67019)     output = self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=67019)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=67019)     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=67019)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore_DP0 pid=67019)     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=67019)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_5.py", line 754, in forward
(EngineCore_DP0 pid=67019)     hidden_states = self.language_model.model(
(EngineCore_DP0 pid=67019)                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 407, in __call__
(EngineCore_DP0 pid=67019)     return self.aot_compiled_fn(self, *args, **kwargs)
(EngineCore_DP0 pid=67019)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/aot_compile.py", line 124, in __call__
(EngineCore_DP0 pid=67019)     return self.fn(*args, **kwargs)
(EngineCore_DP0 pid=67019)            ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 1154, in forward
(EngineCore_DP0 pid=67019)     def forward(
(EngineCore_DP0 pid=67019)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/compilation/caching.py", line 206, in __call__
(EngineCore_DP0 pid=67019)     return self.optimized_call(*args, **kwargs)
(EngineCore_DP0 pid=67019)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/torch/fx/graph_module.py", line 936, in call_wrapped
(EngineCore_DP0 pid=67019)     return self._wrapped_call(self, *args, **kwargs)
(EngineCore_DP0 pid=67019)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/torch/fx/graph_module.py", line 455, in __call__
(EngineCore_DP0 pid=67019)     raise e
(EngineCore_DP0 pid=67019)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/torch/fx/graph_module.py", line 442, in __call__
(EngineCore_DP0 pid=67019)     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(EngineCore_DP0 pid=67019)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=67019)     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=67019)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore_DP0 pid=67019)     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=67019)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019)   File "<eval_with_key>.51", line 208, in forward
(EngineCore_DP0 pid=67019)     submod_1 = self.submod_1(getitem, s59, getitem_1, getitem_2, getitem_3);  getitem = getitem_1 = getitem_2 = submod_1 = None
(EngineCore_DP0 pid=67019)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/torch/fx/graph_module.py", line 936, in call_wrapped
(EngineCore_DP0 pid=67019)     return self._wrapped_call(self, *args, **kwargs)
(EngineCore_DP0 pid=67019)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/torch/fx/graph_module.py", line 455, in __call__
(EngineCore_DP0 pid=67019)     raise e
(EngineCore_DP0 pid=67019)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/torch/fx/graph_module.py", line 442, in __call__
(EngineCore_DP0 pid=67019)     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(EngineCore_DP0 pid=67019)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=67019)     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=67019)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore_DP0 pid=67019)     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=67019)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019)   File "<eval_with_key>.53", line 5, in forward
(EngineCore_DP0 pid=67019)     gdn_attention_core = torch.ops.vllm.gdn_attention_core(mixed_qkv, b_1, a_1, core_attn_out, 'language_model.model.layers.0.linear_attn');  mixed_qkv = b_1 = a_1 = core_attn_out = gdn_attention_core = None
(EngineCore_DP0 pid=67019)                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/torch/_ops.py", line 1209, in __call__
(EngineCore_DP0 pid=67019)     return self._op(*args, **kwargs)
(EngineCore_DP0 pid=67019)            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 1473, in gdn_attention_core
(EngineCore_DP0 pid=67019)     self._forward_core(
(EngineCore_DP0 pid=67019)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 704, in _forward_core
(EngineCore_DP0 pid=67019)     mixed_qkv_spec = causal_conv1d_update(
(EngineCore_DP0 pid=67019)                      ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/mamba/ops/causal_conv1d.py", line 1196, in causal_conv1d_update
(EngineCore_DP0 pid=67019)     _causal_conv1d_update_kernel[grid](
(EngineCore_DP0 pid=67019)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/triton/runtime/jit.py", line 370, in <lambda>
(EngineCore_DP0 pid=67019)     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
(EngineCore_DP0 pid=67019)                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/triton/runtime/jit.py", line 743, in run
(EngineCore_DP0 pid=67019)     launch_metadata = kernel.launch_metadata(grid, stream, *bound_args.values())
(EngineCore_DP0 pid=67019)                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/triton/compiler/compiler.py", line 482, in launch_metadata
(EngineCore_DP0 pid=67019)     self._init_handles()
(EngineCore_DP0 pid=67019)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/triton/compiler/compiler.py", line 465, in _init_handles
(EngineCore_DP0 pid=67019)     self.module, self.function, self.n_regs, self.n_spills, self.n_max_threads = driver.active.utils.load_binary(
(EngineCore_DP0 pid=67019)                                                                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019) RuntimeError: Triton Error [CUDA]: operation not permitted when stream is capturing
(EngineCore_DP0 pid=67019) 
(EngineCore_DP0 pid=67019) During handling of the above exception, another exception occurred:
(EngineCore_DP0 pid=67019) 
(EngineCore_DP0 pid=67019) Traceback (most recent call last):
(EngineCore_DP0 pid=67019)   File "/home/admin/.local/share/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=67019)     self.run()
(EngineCore_DP0 pid=67019)   File "/home/admin/.local/share/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=67019)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=67019)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1102, in run_engine_core
(EngineCore_DP0 pid=67019)     raise e
(EngineCore_DP0 pid=67019)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1088, in run_engine_core
(EngineCore_DP0 pid=67019)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=67019)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=67019)     return func(*args, **kwargs)
(EngineCore_DP0 pid=67019)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 832, in __init__
(EngineCore_DP0 pid=67019)     super().__init__(
(EngineCore_DP0 pid=67019)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 120, in __init__
(EngineCore_DP0 pid=67019)     kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore_DP0 pid=67019)                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=67019)     return func(*args, **kwargs)
(EngineCore_DP0 pid=67019)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 243, in _initialize_kv_caches
(EngineCore_DP0 pid=67019)     available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore_DP0 pid=67019)                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 136, in determine_available_memory
(EngineCore_DP0 pid=67019)     return self.collective_rpc("determine_available_memory")
(EngineCore_DP0 pid=67019)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 78, in collective_rpc
(EngineCore_DP0 pid=67019)     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=67019)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore_DP0 pid=67019)     return func(*args, **kwargs)
(EngineCore_DP0 pid=67019)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore_DP0 pid=67019)     return func(*args, **kwargs)
(EngineCore_DP0 pid=67019)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 397, in determine_available_memory
(EngineCore_DP0 pid=67019)     cudagraph_memory_estimate = self.model_runner.profile_cudagraph_memory()
(EngineCore_DP0 pid=67019)                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore_DP0 pid=67019)     return func(*args, **kwargs)
(EngineCore_DP0 pid=67019)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5649, in profile_cudagraph_memory
(EngineCore_DP0 pid=67019)     self._warmup_and_capture(
(EngineCore_DP0 pid=67019)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5789, in _warmup_and_capture
(EngineCore_DP0 pid=67019)     self._dummy_run(
(EngineCore_DP0 pid=67019)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore_DP0 pid=67019)     return func(*args, **kwargs)
(EngineCore_DP0 pid=67019)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5223, in _dummy_run
(EngineCore_DP0 pid=67019)     outputs = self.model(
(EngineCore_DP0 pid=67019)               ^^^^^^^^^^^
(EngineCore_DP0 pid=67019)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/compilation/cuda_graph.py", line 295, in __call__
(EngineCore_DP0 pid=67019)     with torch.cuda.graph(
(EngineCore_DP0 pid=67019)          ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=67019)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/torch/cuda/graphs.py", line 268, in __exit__
(EngineCore_DP0 pid=67019)     self.cuda_graph.capture_end()
(EngineCore_DP0 pid=67019)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/torch/cuda/graphs.py", line 130, in capture_end
(EngineCore_DP0 pid=67019)     super().capture_end()
(EngineCore_DP0 pid=67019) torch.AcceleratorError: CUDA error: operation failed due to a previous error during capture
(EngineCore_DP0 pid=67019) Search for `cudaErrorStreamCaptureInvalidated' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
(EngineCore_DP0 pid=67019) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(EngineCore_DP0 pid=67019) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(EngineCore_DP0 pid=67019) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(EngineCore_DP0 pid=67019) 
[rank0]:[W311 18:21:29.449043988 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=60641) Traceback (most recent call last):
(APIServer pid=60641)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/bin/vllm", line 10, in <module>
(APIServer pid=60641)     sys.exit(main())
(APIServer pid=60641)              ^^^^^^
(APIServer pid=60641)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 75, in main
(APIServer pid=60641)     args.dispatch_function(args)
(APIServer pid=60641)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 118, in cmd
(APIServer pid=60641)     uvloop.run(run_server(args))
(APIServer pid=60641)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=60641)     return __asyncio.run(
(APIServer pid=60641)            ^^^^^^^^^^^^^^
(APIServer pid=60641)   File "/home/admin/.local/share/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=60641)     return runner.run(main)
(APIServer pid=60641)            ^^^^^^^^^^^^^^^^
(APIServer pid=60641)   File "/home/admin/.local/share/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=60641)     return self._loop.run_until_complete(task)
(APIServer pid=60641)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=60641)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=60641)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=60641)     return await main
(APIServer pid=60641)            ^^^^^^^^^^
(APIServer pid=60641)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 642, in run_server
(APIServer pid=60641)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=60641)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 656, in run_server_worker
(APIServer pid=60641)     async with build_async_engine_client(
(APIServer pid=60641)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=60641)   File "/home/admin/.local/share/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=60641)     return await anext(self.gen)
(APIServer pid=60641)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=60641)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 101, in build_async_engine_client
(APIServer pid=60641)     async with build_async_engine_client_from_engine_args(
(APIServer pid=60641)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=60641)   File "/home/admin/.local/share/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=60641)     return await anext(self.gen)
(APIServer pid=60641)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=60641)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 142, in build_async_engine_client_from_engine_args
(APIServer pid=60641)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=60641)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=60641)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=60641)     return cls(
(APIServer pid=60641)            ^^^^
(APIServer pid=60641)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 154, in __init__
(APIServer pid=60641)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=60641)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=60641)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=60641)     return func(*args, **kwargs)
(APIServer pid=60641)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=60641)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 128, in make_async_mp_client
(APIServer pid=60641)     return AsyncMPClient(*client_args)
(APIServer pid=60641)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=60641)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=60641)     return func(*args, **kwargs)
(APIServer pid=60641)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=60641)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 912, in __init__
(APIServer pid=60641)     super().__init__(
(APIServer pid=60641)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 570, in __init__
(APIServer pid=60641)     with launch_core_engines(
(APIServer pid=60641)          ^^^^^^^^^^^^^^^^^^^^
(APIServer pid=60641)   File "/home/admin/.local/share/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=60641)     next(self.gen)
(APIServer pid=60641)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 951, in launch_core_engines
(APIServer pid=60641)     wait_for_engine_startup(
(APIServer pid=60641)   File "/home/admin/workspace/aop_lab/app_source/test_vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 1010, in wait_for_engine_startup
(APIServer pid=60641)     raise RuntimeError(
(APIServer pid=60641) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
(test_vllm) (base) 

@flutist flutist changed the title fix mtp launch error in vllm-0.17.0, about cuda graph during memory profile fix mtp launch error in vllm-0.17.1-rc, about cuda graph during memory profile Mar 11, 2026
@ZJY0516
Copy link
Member

ZJY0516 commented Mar 13, 2026

We have fixed warmup issue in #36599

@JaheimLee
Copy link

It works for me.

@JaheimLee
Copy link

Found an assert error after some time.

/data/lijinghui/uv_projects/.venv/lib/python3.13/site-packages/torch/cuda/__init__.py:65: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
INFO 03-13 17:29:36 [model.py:533] Resolved architecture: Qwen3_5ForConditionalGeneration
INFO 03-13 17:29:36 [model.py:1580] Using max model len 20000
INFO 03-13 17:29:38 [model.py:533] Resolved architecture: Qwen3_5MTP
INFO 03-13 17:29:38 [model.py:1580] Using max model len 262144
WARNING 03-13 17:29:38 [speculative.py:499] Enabling num_speculative_tokens > 1 will run multiple times of forward on same MTP layer,which may result in lower acceptance rate
INFO 03-13 17:29:38 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
WARNING 03-13 17:29:38 [config.py:384] Mamba cache mode is set to 'align' for Qwen3_5ForConditionalGeneration by default when prefix caching is enabled
INFO 03-13 17:29:38 [config.py:404] Warning: Prefix caching in Mamba cache 'align' mode is currently enabled. Its support for Mamba layers is experimental. Please report any issues you may observe.
INFO 03-13 17:29:38 [config.py:224] Setting attention block size to 800 tokens to ensure that attention page size is >= mamba page size.
INFO 03-13 17:29:38 [config.py:255] Padding mamba page size by 0.25% to ensure that mamba page size and attention page size are exactly equal.
INFO 03-13 17:29:38 [vllm.py:748] Asynchronous scheduling is enabled.
WARNING 03-13 17:29:38 [vllm.py:1236] Batch sizes [1] are removed because they are not multiple of tp_size 2 when sequence parallelism is enabled
INFO 03-13 17:29:38 [compilation.py:289] Enabled custom fusions: norm_quant, act_quant, attn_quant, gemm_comms
INFO 03-13 17:29:40 [registry.py:126] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.
/data/lijinghui/uv_projects/.venv/lib/python3.13/site-packages/torch/cuda/__init__.py:65: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
(EngineCore pid=3216204) INFO 03-13 17:29:50 [core.py:101] Initializing a V1 LLM engine (v0.17.1rc1.dev123+g10f08dedf) with config: model='/data/pretrained_models/Qwen3.5-27B-FP8', speculative_config=SpeculativeConfig(method='mtp', model='/data/pretrained_models/Qwen3.5-27B-FP8', num_spec_tokens=3), tokenizer='/data/pretrained_models/Qwen3.5-27B-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=20000, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/data/pretrained_models/Qwen3.5-27B-FP8, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+rms_norm', '+quant_fp8', '+rotary_embedding', '+quant_fp8', 'none', '+quant_fp8'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [2, 4, 8, 16], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': True, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': True, 'enable_sp': True, 'fuse_gemm_comms': True, 'fuse_allreduce_rms': False, 'enable_qk_norm_rope_fusion': True, 'sp_min_token_num': 4096}, 'max_cudagraph_capture_size': 16, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}
(EngineCore pid=3216204) INFO 03-13 17:29:50 [multiproc_executor.py:134] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=192.168.8.57 (local), world_size=2, local_world_size=2
/data/lijinghui/uv_projects/.venv/lib/python3.13/site-packages/torch/cuda/__init__.py:65: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
/data/lijinghui/uv_projects/.venv/lib/python3.13/site-packages/torch/cuda/__init__.py:65: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
INFO 03-13 17:30:01 [registry.py:126] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.
(Worker pid=3216417) INFO 03-13 17:30:01 [parallel_state.py:1395] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:54271 backend=nccl
INFO 03-13 17:30:01 [registry.py:126] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.
(Worker pid=3216418) INFO 03-13 17:30:04 [parallel_state.py:1395] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:54271 backend=nccl
(Worker pid=3216417) <frozen importlib._bootstrap_external>:1328: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(Worker pid=3216417) <frozen importlib._bootstrap_external>:1328: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
(Worker pid=3216417) INFO 03-13 17:30:05 [pynccl.py:111] vLLM is using nccl==2.29.7
(Worker pid=3216418) <frozen importlib._bootstrap_external>:1328: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(Worker pid=3216418) <frozen importlib._bootstrap_external>:1328: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
(Worker pid=3216417) WARNING 03-13 17:30:08 [symm_mem.py:67] SymmMemCommunicator: Device capability 8.6 not supported, communicator is not available.
(Worker pid=3216418) WARNING 03-13 17:30:08 [symm_mem.py:67] SymmMemCommunicator: Device capability 8.6 not supported, communicator is not available.
(Worker pid=3216417) WARNING 03-13 17:30:08 [custom_all_reduce.py:165] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(Worker pid=3216418) WARNING 03-13 17:30:08 [custom_all_reduce.py:165] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(Worker pid=3216417) INFO 03-13 17:30:08 [parallel_state.py:1717] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(Worker pid=3216418) INFO 03-13 17:30:08 [parallel_state.py:1717] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 1, EP rank N/A, EPLB rank N/A
(Worker pid=3216417) INFO 03-13 17:30:09 [topk_topp_sampler.py:51] Using FlashInfer for top-p & top-k sampling.
(Worker pid=3216417) WARNING 03-13 17:30:09 [__init__.py:204] min_p and logit_bias parameters won't work with speculative decoding.
(Worker_TP0 pid=3216417) INFO 03-13 17:30:09 [gpu_model_runner.py:4501] Starting to load model /data/pretrained_models/Qwen3.5-27B-FP8...
(Worker pid=3216418) WARNING 03-13 17:30:09 [__init__.py:204] min_p and logit_bias parameters won't work with speculative decoding.
(Worker_TP0 pid=3216417) INFO 03-13 17:30:09 [cuda.py:373] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(Worker_TP0 pid=3216417) INFO 03-13 17:30:09 [mm_encoder_attention.py:230] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(Worker_TP1 pid=3216418) INFO 03-13 17:30:09 [cuda.py:373] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(Worker_TP1 pid=3216418) INFO 03-13 17:30:09 [mm_encoder_attention.py:230] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(Worker_TP0 pid=3216417) INFO 03-13 17:30:09 [cuda.py:257] Using AttentionBackendEnum.FLASH_ATTN backend.
(Worker_TP0 pid=3216417) INFO 03-13 17:30:09 [flash_attn.py:593] Using FlashAttention version 2
(Worker_TP1 pid=3216418) INFO 03-13 17:30:09 [cuda.py:257] Using AttentionBackendEnum.FLASH_ATTN backend.
(Worker_TP0 pid=3216417) WARNING 03-13 17:30:10 [compilation.py:1136] Op 'rms_norm' not present in model, enabling with '+rms_norm' has no effect
(Worker_TP1 pid=3216418) WARNING 03-13 17:30:10 [compilation.py:1136] Op 'rms_norm' not present in model, enabling with '+rms_norm' has no effect
(Worker_TP0 pid=3216417) 
Loading safetensors checkpoint shards:   0% Completed | 0/11 [00:00<?, ?it/s]
(Worker_TP0 pid=3216417) 
Loading safetensors checkpoint shards:   9% Completed | 1/11 [00:00<00:05,  1.73it/s]
(Worker_TP0 pid=3216417) 
Loading safetensors checkpoint shards:  18% Completed | 2/11 [00:01<00:05,  1.67it/s]
(Worker_TP0 pid=3216417) 
Loading safetensors checkpoint shards:  27% Completed | 3/11 [00:01<00:04,  1.62it/s]
(Worker_TP0 pid=3216417) 
Loading safetensors checkpoint shards:  36% Completed | 4/11 [00:02<00:04,  1.59it/s]
(Worker_TP0 pid=3216417) 
Loading safetensors checkpoint shards:  45% Completed | 5/11 [00:03<00:03,  1.57it/s]
(Worker_TP0 pid=3216417) 
Loading safetensors checkpoint shards:  55% Completed | 6/11 [00:03<00:03,  1.55it/s]
(Worker_TP0 pid=3216417) 
Loading safetensors checkpoint shards:  64% Completed | 7/11 [00:04<00:02,  1.56it/s]
(Worker_TP0 pid=3216417) 
Loading safetensors checkpoint shards:  73% Completed | 8/11 [00:04<00:01,  1.68it/s]
(Worker_TP0 pid=3216417) 
Loading safetensors checkpoint shards:  82% Completed | 9/11 [00:05<00:01,  1.85it/s]
(Worker_TP0 pid=3216417) 
Loading safetensors checkpoint shards:  91% Completed | 10/11 [00:05<00:00,  1.85it/s]
(Worker_TP1 pid=3216418) WARNING 03-13 17:30:16 [marlin_utils_fp8.py:97] Your GPU does not have native support for FP8 computation but FP8 quantization is being used. Weight-only FP8 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.
(Worker_TP0 pid=3216417) 
Loading safetensors checkpoint shards: 100% Completed | 11/11 [00:06<00:00,  2.19it/s]
(Worker_TP0 pid=3216417) 
Loading safetensors checkpoint shards: 100% Completed | 11/11 [00:06<00:00,  1.79it/s]
(Worker_TP0 pid=3216417) 
(Worker_TP0 pid=3216417) INFO 03-13 17:30:16 [default_loader.py:293] Loading weights took 6.15 seconds
(Worker_TP0 pid=3216417) WARNING 03-13 17:30:16 [marlin_utils_fp8.py:97] Your GPU does not have native support for FP8 computation but FP8 quantization is being used. Weight-only FP8 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.
(Worker_TP1 pid=3216418) INFO 03-13 17:30:16 [gpu_model_runner.py:4525] Loading drafter model...
(Worker_TP0 pid=3216417) INFO 03-13 17:30:16 [gpu_model_runner.py:4525] Loading drafter model...
(Worker_TP0 pid=3216417) 
Loading safetensors checkpoint shards:   0% Completed | 0/11 [00:00<?, ?it/s]
(Worker_TP0 pid=3216417) 
Loading safetensors checkpoint shards:   9% Completed | 1/11 [00:00<00:05,  1.92it/s]
(Worker_TP1 pid=3216418) INFO 03-13 17:30:17 [eagle.py:1365] Detected MTP model. Sharing target model embedding weights with the draft model.
(Worker_TP1 pid=3216418) INFO 03-13 17:30:17 [eagle.py:1419] Detected MTP model. Sharing target model lm_head weights with the draft model.
(Worker_TP0 pid=3216417) 
Loading safetensors checkpoint shards:  18% Completed | 2/11 [00:00<00:02,  3.54it/s]
(Worker_TP0 pid=3216417) 
Loading safetensors checkpoint shards: 100% Completed | 11/11 [00:00<00:00, 21.84it/s]
(Worker_TP0 pid=3216417) 
Loading safetensors checkpoint shards: 100% Completed | 11/11 [00:00<00:00, 14.34it/s]
(Worker_TP0 pid=3216417) 
(Worker_TP0 pid=3216417) INFO 03-13 17:30:17 [default_loader.py:293] Loading weights took 0.77 seconds
(Worker_TP0 pid=3216417) INFO 03-13 17:30:17 [eagle.py:1365] Detected MTP model. Sharing target model embedding weights with the draft model.
(Worker_TP0 pid=3216417) INFO 03-13 17:30:17 [eagle.py:1419] Detected MTP model. Sharing target model lm_head weights with the draft model.
(Worker_TP0 pid=3216417) INFO 03-13 17:30:18 [gpu_model_runner.py:4584] Model loading took 14.39 GiB memory and 7.954239 seconds
(Worker_TP0 pid=3216417) INFO 03-13 17:30:25 [backends.py:988] Using cache directory: /home/mosh/.cache/vllm/torch_compile_cache/6922666318/rank_0_0/backbone for vLLM's torch.compile
(Worker_TP0 pid=3216417) INFO 03-13 17:30:25 [backends.py:1048] Dynamo bytecode transform time: 6.71 s
(Worker_TP1 pid=3216418) INFO 03-13 17:30:27 [decorators.py:296] Directly load AOT compilation from path /home/mosh/.cache/vllm/torch_compile_cache/torch_aot_compile/b4ee554bf18c035e549938c994fc1ec663a9f5d5520cb602335225806ff84f95/rank_1_0/model
(Worker_TP0 pid=3216417) INFO 03-13 17:30:27 [backends.py:284] Directly load the compiled graph(s) for compile range (1, 2048) from the cache, took 0.475 s
(Worker_TP0 pid=3216417) INFO 03-13 17:30:27 [monitor.py:48] torch.compile took 9.31 s in total
(Worker_TP0 pid=3216417) INFO 03-13 17:30:27 [decorators.py:296] Directly load AOT compilation from path /home/mosh/.cache/vllm/torch_compile_cache/torch_aot_compile/b4ee554bf18c035e549938c994fc1ec663a9f5d5520cb602335225806ff84f95/rank_0_0/model
(Worker_TP1 pid=3216418) /data/lijinghui/uv_projects/.venv/lib/python3.13/site-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (16) < num_heads (24). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP1 pid=3216418)   return fn(*contiguous_args, **contiguous_kwargs)
(Worker_TP0 pid=3216417) /data/lijinghui/uv_projects/.venv/lib/python3.13/site-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (16) < num_heads (24). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(Worker_TP0 pid=3216417)   return fn(*contiguous_args, **contiguous_kwargs)
(Worker_TP0 pid=3216417) INFO 03-13 17:30:49 [monitor.py:76] Initial profiling/warmup run took 21.91 s
(Worker_TP0 pid=3216417) INFO 03-13 17:30:50 [backends.py:988] Using cache directory: /home/mosh/.cache/vllm/torch_compile_cache/6922666318/rank_0_0/eagle_head for vLLM's torch.compile
(Worker_TP0 pid=3216417) INFO 03-13 17:30:50 [backends.py:1048] Dynamo bytecode transform time: 0.22 s
(Worker_TP1 pid=3216418) INFO 03-13 17:30:51 [decorators.py:296] Directly load AOT compilation from path /home/mosh/.cache/vllm/torch_compile_cache/torch_aot_compile/ec3d2674d11c2f084a8d854ef7f6a45aa36788a9276db3d57e7c67e247925f88/rank_1_0/model
(Worker_TP0 pid=3216417) INFO 03-13 17:30:51 [backends.py:284] Directly load the compiled graph(s) for compile range (1, 2048) from the cache, took 0.111 s
(Worker_TP0 pid=3216417) INFO 03-13 17:30:51 [monitor.py:48] torch.compile took 1.71 s in total
(Worker_TP0 pid=3216417) INFO 03-13 17:30:51 [decorators.py:296] Directly load AOT compilation from path /home/mosh/.cache/vllm/torch_compile_cache/torch_aot_compile/ec3d2674d11c2f084a8d854ef7f6a45aa36788a9276db3d57e7c67e247925f88/rank_0_0/model
(Worker_TP0 pid=3216417) INFO 03-13 17:30:51 [monitor.py:76] Initial profiling/warmup run took 0.06 s
(Worker_TP0 pid=3216417) WARNING 03-13 17:30:52 [kv_cache_utils.py:1054] Add 3 padding layers, may waste at most 6.25% KV cache memory
(Worker_TP0 pid=3216417) INFO 03-13 17:30:52 [kv_cache_utils.py:826] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=16
(Worker_TP0 pid=3216417) INFO 03-13 17:30:52 [gpu_model_runner.py:5637] Profiling CUDA graph memory: PIECEWISE=3 (largest=16), FULL=3 (largest=16)
(Worker_TP1 pid=3216418) WARNING 03-13 17:30:52 [kv_cache_utils.py:1054] Add 3 padding layers, may waste at most 6.25% KV cache memory
(Worker_TP1 pid=3216418) INFO 03-13 17:30:52 [kv_cache_utils.py:826] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=16
(Worker_TP1 pid=3216418) INFO 03-13 17:30:52 [gpu_model_runner.py:5637] Profiling CUDA graph memory: PIECEWISE=3 (largest=16), FULL=3 (largest=16)
(Worker_TP0 pid=3216417) INFO 03-13 17:30:54 [gpu_model_runner.py:5716] Estimated CUDA graph memory: 0.46 GiB total
(Worker_TP1 pid=3216418) INFO 03-13 17:30:54 [gpu_model_runner.py:5716] Estimated CUDA graph memory: 0.46 GiB total
(Worker_TP0 pid=3216417) INFO 03-13 17:30:55 [gpu_worker.py:452] Available KV cache memory: 5.36 GiB
(Worker_TP0 pid=3216417) INFO 03-13 17:30:55 [gpu_worker.py:468] CUDA graph memory profiling is enabled (VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1). This will become the default in v0.19. The current --gpu-memory-utilization=0.9000 is equivalent to --gpu-memory-utilization=0.8806 without CUDA graph memory profiling. To maintain the same effective KV cache size as before, increase --gpu-memory-utilization to 0.9194.
(Worker_TP1 pid=3216418) INFO 03-13 17:30:55 [gpu_worker.py:468] CUDA graph memory profiling is enabled (VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1). This will become the default in v0.19. The current --gpu-memory-utilization=0.9000 is equivalent to --gpu-memory-utilization=0.8806 without CUDA graph memory profiling. To maintain the same effective KV cache size as before, increase --gpu-memory-utilization to 0.9194.
(EngineCore pid=3216204) WARNING 03-13 17:30:55 [kv_cache_utils.py:1054] Add 3 padding layers, may waste at most 6.25% KV cache memory
(EngineCore pid=3216204) INFO 03-13 17:30:55 [kv_cache_utils.py:1314] GPU KV cache size: 40,800 tokens
(EngineCore pid=3216204) INFO 03-13 17:30:55 [kv_cache_utils.py:1319] Maximum concurrency for 20,000 tokens per request: 5.15x
(Worker_TP0 pid=3216417) 
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   0%|          | 0/3 [00:00<?, ?it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  33%|███▎      | 1/3 [00:00<00:00,  6.55it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  67%|██████▋   | 2/3 [00:00<00:00,  6.65it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 3/3 [00:00<00:00,  6.64it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 3/3 [00:00<00:00,  6.63it/s]
(Worker_TP0 pid=3216417) 
Capturing CUDA graphs (decode, FULL):   0%|          | 0/3 [00:00<?, ?it/s]
Capturing CUDA graphs (decode, FULL):  33%|███▎      | 1/3 [00:00<00:00,  4.84it/s]
Capturing CUDA graphs (decode, FULL):  67%|██████▋   | 2/3 [00:00<00:00,  4.81it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████| 3/3 [00:00<00:00,  4.74it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████| 3/3 [00:00<00:00,  4.76it/s]
(Worker_TP0 pid=3216417) INFO 03-13 17:30:57 [gpu_model_runner.py:5776] Graph capturing finished in 2 secs, took 0.21 GiB
(Worker_TP0 pid=3216417) INFO 03-13 17:30:57 [gpu_worker.py:614] CUDA graph pool memory: 0.21 GiB (actual), 0.46 GiB (estimated), difference: 0.24 GiB (113.6%).
(Worker_TP1 pid=3216418) INFO 03-13 17:30:57 [gpu_worker.py:614] CUDA graph pool memory: 0.21 GiB (actual), 0.46 GiB (estimated), difference: 0.24 GiB (113.6%).
(EngineCore pid=3216204) INFO 03-13 17:30:57 [core.py:279] init engine (profile, create kv cache, warmup model) took 39.01 seconds
(EngineCore pid=3216204) INFO 03-13 17:31:00 [registry.py:126] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.
(EngineCore pid=3216204) INFO 03-13 17:31:00 [vllm.py:748] Asynchronous scheduling is enabled.
(EngineCore pid=3216204) INFO 03-13 17:31:00 [compilation.py:289] Enabled custom fusions: norm_quant, act_quant, attn_quant, gemm_comms
WARNING 03-13 17:31:00 [input_processor.py:227] Passing raw prompts to InputProcessor is deprecated and will be removed in v0.18. You should instead pass the outputs of Renderer.render_cmpl() or Renderer.render_chat().
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932] WorkerProc hit an exception.
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932] Traceback (most recent call last):
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]   File "/data/lijinghui/uv_projects/.venv/lib/python3.13/site-packages/vllm/v1/executor/multiproc_executor.py", line 927, in worker_busy_loop
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     output = func(*args, **kwargs)
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]   File "/data/lijinghui/uv_projects/.venv/lib/python3.13/site-packages/vllm/v1/worker/worker_base.py", line 332, in execute_model
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     return self.worker.execute_model(scheduler_output)
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]            ~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]   File "/data/lijinghui/uv_projects/.venv/lib/python3.13/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     return func(*args, **kwargs)
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]   File "/data/lijinghui/uv_projects/.venv/lib/python3.13/site-packages/vllm/v1/worker/gpu_worker.py", line 819, in execute_model
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     output = self.model_runner.execute_model(
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]         scheduler_output, intermediate_tensors
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     )
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]   File "/data/lijinghui/uv_projects/.venv/lib/python3.13/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     return func(*args, **kwargs)
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]   File "/data/lijinghui/uv_projects/.venv/lib/python3.13/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3723, in execute_model
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     self._build_attention_metadata(
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]         num_tokens=num_tokens_unpadded,
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     ...<9 lines>...
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]         slot_mappings=slot_mappings_by_group,
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     )
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     ^
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]   File "/data/lijinghui/uv_projects/.venv/lib/python3.13/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2102, in _build_attention_metadata
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     _build_attn_group_metadata(kv_cache_gid, attn_gid, cm)
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]   File "/data/lijinghui/uv_projects/.venv/lib/python3.13/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2053, in _build_attn_group_metadata
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     attn_metadata_i = builder.build(
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]         common_prefix_len=cascade_attn_prefix_len,
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]         common_attn_metadata=common_attn_metadata,
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]         **extra_attn_metadata_args,
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     )
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]   File "/data/lijinghui/uv_projects/.venv/lib/python3.13/site-packages/vllm/v1/attention/backends/gdn_attn.py", line 310, in build
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     assert not (num_decodes > 0 and num_spec_decodes > 0), (
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932] AssertionError: num_decodes: 1, num_spec_decodes: 8
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932] Traceback (most recent call last):
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]   File "/data/lijinghui/uv_projects/.venv/lib/python3.13/site-packages/vllm/v1/executor/multiproc_executor.py", line 927, in worker_busy_loop
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     output = func(*args, **kwargs)
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]   File "/data/lijinghui/uv_projects/.venv/lib/python3.13/site-packages/vllm/v1/worker/worker_base.py", line 332, in execute_model
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     return self.worker.execute_model(scheduler_output)
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]            ~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]   File "/data/lijinghui/uv_projects/.venv/lib/python3.13/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     return func(*args, **kwargs)
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]   File "/data/lijinghui/uv_projects/.venv/lib/python3.13/site-packages/vllm/v1/worker/gpu_worker.py", line 819, in execute_model
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     output = self.model_runner.execute_model(
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]         scheduler_output, intermediate_tensors
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     )
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]   File "/data/lijinghui/uv_projects/.venv/lib/python3.13/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     return func(*args, **kwargs)
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]   File "/data/lijinghui/uv_projects/.venv/lib/python3.13/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3723, in execute_model
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     self._build_attention_metadata(
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]         num_tokens=num_tokens_unpadded,
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     ...<9 lines>...
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]         slot_mappings=slot_mappings_by_group,
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     )
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     ^
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]   File "/data/lijinghui/uv_projects/.venv/lib/python3.13/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2102, in _build_attention_metadata
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     _build_attn_group_metadata(kv_cache_gid, attn_gid, cm)
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]   File "/data/lijinghui/uv_projects/.venv/lib/python3.13/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2053, in _build_attn_group_metadata
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     attn_metadata_i = builder.build(
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]         common_prefix_len=cascade_attn_prefix_len,
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]         common_attn_metadata=common_attn_metadata,
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]         **extra_attn_metadata_args,
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     )
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]   File "/data/lijinghui/uv_projects/.venv/lib/python3.13/site-packages/vllm/v1/attention/backends/gdn_attn.py", line 310, in build
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     assert not (num_decodes > 0 and num_spec_decodes > 0), (
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932] AssertionError: num_decodes: 1, num_spec_decodes: 8
(Worker_TP1 pid=3216418) ERROR 03-14 04:24:54 [multiproc_executor.py:932] 
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932] WorkerProc hit an exception.
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932] Traceback (most recent call last):
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]   File "/data/lijinghui/uv_projects/.venv/lib/python3.13/site-packages/vllm/v1/executor/multiproc_executor.py", line 927, in worker_busy_loop
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     output = func(*args, **kwargs)
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]   File "/data/lijinghui/uv_projects/.venv/lib/python3.13/site-packages/vllm/v1/worker/worker_base.py", line 332, in execute_model
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     return self.worker.execute_model(scheduler_output)
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]            ~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]   File "/data/lijinghui/uv_projects/.venv/lib/python3.13/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     return func(*args, **kwargs)
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]   File "/data/lijinghui/uv_projects/.venv/lib/python3.13/site-packages/vllm/v1/worker/gpu_worker.py", line 819, in execute_model
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     output = self.model_runner.execute_model(
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]         scheduler_output, intermediate_tensors
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     )
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]   File "/data/lijinghui/uv_projects/.venv/lib/python3.13/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     return func(*args, **kwargs)
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]   File "/data/lijinghui/uv_projects/.venv/lib/python3.13/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3723, in execute_model
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     self._build_attention_metadata(
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]         num_tokens=num_tokens_unpadded,
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     ...<9 lines>...
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]         slot_mappings=slot_mappings_by_group,
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     )
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     ^
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]   File "/data/lijinghui/uv_projects/.venv/lib/python3.13/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2102, in _build_attention_metadata
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     _build_attn_group_metadata(kv_cache_gid, attn_gid, cm)
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]   File "/data/lijinghui/uv_projects/.venv/lib/python3.13/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2053, in _build_attn_group_metadata
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     attn_metadata_i = builder.build(
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]         common_prefix_len=cascade_attn_prefix_len,
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]         common_attn_metadata=common_attn_metadata,
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]         **extra_attn_metadata_args,
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     )
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]   File "/data/lijinghui/uv_projects/.venv/lib/python3.13/site-packages/vllm/v1/attention/backends/gdn_attn.py", line 310, in build
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     assert not (num_decodes > 0 and num_spec_decodes > 0), (
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932] AssertionError: num_decodes: 1, num_spec_decodes: 8
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932] Traceback (most recent call last):
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]   File "/data/lijinghui/uv_projects/.venv/lib/python3.13/site-packages/vllm/v1/executor/multiproc_executor.py", line 927, in worker_busy_loop
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     output = func(*args, **kwargs)
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]   File "/data/lijinghui/uv_projects/.venv/lib/python3.13/site-packages/vllm/v1/worker/worker_base.py", line 332, in execute_model
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     return self.worker.execute_model(scheduler_output)
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]            ~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]   File "/data/lijinghui/uv_projects/.venv/lib/python3.13/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     return func(*args, **kwargs)
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]   File "/data/lijinghui/uv_projects/.venv/lib/python3.13/site-packages/vllm/v1/worker/gpu_worker.py", line 819, in execute_model
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     output = self.model_runner.execute_model(
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]         scheduler_output, intermediate_tensors
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     )
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]   File "/data/lijinghui/uv_projects/.venv/lib/python3.13/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     return func(*args, **kwargs)
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]   File "/data/lijinghui/uv_projects/.venv/lib/python3.13/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3723, in execute_model
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     self._build_attention_metadata(
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]         num_tokens=num_tokens_unpadded,
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     ...<9 lines>...
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]         slot_mappings=slot_mappings_by_group,
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     )
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     ^
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]   File "/data/lijinghui/uv_projects/.venv/lib/python3.13/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2102, in _build_attention_metadata
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     _build_attn_group_metadata(kv_cache_gid, attn_gid, cm)
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]   File "/data/lijinghui/uv_projects/.venv/lib/python3.13/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2053, in _build_attn_group_metadata
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     attn_metadata_i = builder.build(
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]         common_prefix_len=cascade_attn_prefix_len,
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]         common_attn_metadata=common_attn_metadata,
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]         **extra_attn_metadata_args,
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     )
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]   File "/data/lijinghui/uv_projects/.venv/lib/python3.13/site-packages/vllm/v1/attention/backends/gdn_attn.py", line 310, in build
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]     assert not (num_decodes > 0 and num_spec_decodes > 0), (
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932] AssertionError: num_decodes: 1, num_spec_decodes: 8
(Worker_TP0 pid=3216417) ERROR 03-14 04:24:54 [multiproc_executor.py:932] 
(EngineCore pid=3216204) ERROR 03-14 04:24:54 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.17.1rc1.dev123+g10f08dedf) with config: model='/data/pretrained_models/Qwen3.5-27B-FP8', speculative_config=SpeculativeConfig(method='mtp', model='/data/pretrained_models/Qwen3.5-27B-FP8', num_spec_tokens=3), tokenizer='/data/pretrained_models/Qwen3.5-27B-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=20000, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/data/pretrained_models/Qwen3.5-27B-FP8, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+rms_norm', '+quant_fp8', '+rotary_embedding', '+quant_fp8', 'none', '+quant_fp8', '+quant_fp8', '+quant_fp8'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [2, 4, 8, 16], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': True, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': True, 'enable_sp': True, 'fuse_gemm_comms': True, 'fuse_allreduce_rms': False, 'enable_qk_norm_rope_fusion': True, 'sp_min_token_num': 4096}, 'max_cudagraph_capture_size': 16, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, 
(EngineCore pid=3216204) ERROR 03-14 04:24:54 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=['9a57663d1cafeb321773432905.889063-ad8e8cb5', '8c909591f98296871773433396.7792103-ae5c8d94', 'b11fe3897e83042e1773433424.5964267-899b383a', '9cb8db609a3f355d1773433431.0161006-b4eeaa62', '97cb35a4c5a4335b1773433441.9809551-9ae80e12', 'b22de8b0966403441773433445.1945488-aeb427f8', '8c2183ffc71568031773433453.1854048-84158659', '816b4afcb646fc9c1773433470.2651858-a3139cf1', '81f127132d613c781773433479.5284767-95942297'],resumed_req_ids=set(),new_token_ids_lens=[],all_token_ids_lens={},new_block_ids=[None, None, None, None, None, None, None, None, None],num_computed_tokens=[19998, 3097, 2308, 2121, 1865, 2262, 1637, 1172, 992],num_output_tokens=[19366, 2321, 1570, 1451, 1116, 1141, 975, 543, 381]), num_scheduled_tokens={816b4afcb646fc9c1773433470.2651858-a3139cf1: 4, 8c2183ffc71568031773433453.1854048-84158659: 4, 9a57663d1cafeb321773432905.889063-ad8e8cb5: 1, 81f127132d613c781773433479.5284767-95942297: 4, b11fe3897e83042e1773433424.5964267-899b383a: 4, b22de8b0966403441773433445.1945488-aeb427f8: 4, 8c909591f98296871773433396.7792103-ae5c8d94: 4, 97cb35a4c5a4335b1773433441.9809551-9ae80e12: 4, 9cb8db609a3f355d1773433431.0161006-b4eeaa62: 4}, total_num_scheduled_tokens=33, scheduled_spec_decode_tokens={b22de8b0966403441773433445.1945488-aeb427f8: [-1, -1, -1], 8c2183ffc71568031773433453.1854048-84158659: [-1, -1, -1], 9cb8db609a3f355d1773433431.0161006-b4eeaa62: [-1, -1, -1], 81f127132d613c781773433479.5284767-95942297: [-1, -1, -1], 97cb35a4c5a4335b1773433441.9809551-9ae80e12: [-1, -1, -1], 816b4afcb646fc9c1773433470.2651858-a3139cf1: [-1, -1, -1], 8c909591f98296871773433396.7792103-ae5c8d94: [-1, -1, -1], b11fe3897e83042e1773433424.5964267-899b383a: [-1, -1, -1]}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 0], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], has_structured_output_requests=false, pending_structured_output_tokens=false, num_invalid_spec_tokens=null, kv_connector_metadata=null, ec_connector_metadata=null, new_block_ids_to_zero=null)

@benchislett
Copy link
Collaborator

@vadiklyutiy didn't we fix this?

@benchislett
Copy link
Collaborator

That assertion fail looks like a known issue, I thought we merged a fix

@flutist
Copy link
Contributor Author

flutist commented Mar 15, 2026

That assertion fail looks like a known issue, I thought we merged a fix

could you help to merge the pr

@flutist flutist closed this Mar 16, 2026
@github-project-automation github-project-automation bot moved this to Done in NVIDIA Mar 16, 2026
@flutist flutist reopened this Mar 16, 2026
@flutist
Copy link
Contributor Author

flutist commented Mar 16, 2026

That assertion fail looks like a known issue, I thought we merged a fix

@benchislett Sorry to bother you, but could you please help me merge this PR file? This solved the problem. If there's anything else I can do, I'll continue. I'm very happy to hear your response.

@flutist
Copy link
Contributor Author

flutist commented Mar 16, 2026

@DarkLight1337 @Isotr0py could you help to take a look thanks

@benchislett benchislett requested a review from tdoublep March 16, 2026 20:44
Comment on lines +5148 to +5150
self.input_batch.num_accepted_tokens_cpu[:num_reqs] = max_query_len
self.num_decode_draft_tokens.np[:num_reqs] = max_query_len - 1
self.num_decode_draft_tokens.np[num_reqs:].fill(-1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain the choice of the values in each of these cases? Is the .fill(-1) the correct convention?

Copy link
Contributor Author

@flutist flutist Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_accepted_tokens_cpu[:num_reqs] = max_query_len
Total tokens per request in spec decode = 1 original + (max_query_len−1) draft. Setting to max_query_len = "all accepted" — dummy value to trigger spec-decode Triton JIT warmup.
num_decode_draft_tokens.np[:num_reqs] = max_query_len - 1
Draft count = total − 1. Lets GDN attention see IS_SPEC_DECODING=True
self.num_decode_draft_tokens.np[num_reqs:].fill(-1)
-1 = "not a decode request". Consumer uses >= 0 as mask:
spec_sequence_masks_cpu = num_decode_draft_tokens_cpu >= 0 # True=decode, False=prefill/unused
0 would be wrong — indistinguishable from "decode with 0 drafts". Unused padding slots must be -1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benchislett Sorry to bother you despite your busy schedule. Could you please take a look and let me know if there's anything else I need to modify? If you feel everything is okay, could you help approve this PR when you have a moment?

Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com>
Signed-off-by: xjx <30485581+flutist@users.noreply.github.com>
@ZJY0516
Copy link
Member

ZJY0516 commented Mar 17, 2026

assert not (num_decodes > 0 and num_spec_decodes > 0)

This has been fixed in #34871

@flutist flutist requested a review from benchislett March 18, 2026 05:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants