feat: Add SM121/GB10 (DGX Spark) Blackwell-class GPU support#31740
feat: Add SM121/GB10 (DGX Spark) Blackwell-class GPU support#31740seli-equinix wants to merge 13 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request adds support for NVIDIA GB10 (SM121) GPUs by introducing a is_blackwell_class() method and replacing hardcoded checks for SM100. The changes are well-structured and cover multiple parts of the codebase, including attention backends and quantization layers.
I've identified a critical issue regarding an inconsistency between a comment and code for FlashInfer autotuning, which could impact performance on new hardware. Additionally, there are a couple of instances of code duplication for the new is_blackwell_class logic that should be addressed to improve maintainability.
Overall, this is a good contribution to extend hardware support. Addressing the identified issues will make the changes more robust and easier to maintain.
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
|
Hi @seli-equinix, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
There was a problem hiding this comment.
Pull request overview
This PR extends vLLM's Blackwell architecture support to include the SM121/GB10 GPU found in DGX Spark devices, moving from device-specific checks to a unified Blackwell-class detection approach.
Key changes:
- Introduced
is_blackwell_class()method to detect SM10x, SM11x, and SM12x GPUs as a unified Blackwell family - Replaced scattered
is_device_capability_family(100)checks with the new class-based detection throughout the codebase - Added MoE configuration files optimized for GB10 hardware
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| vllm/platforms/interface.py | Added is_blackwell_class() method to platform interface with documentation for SM10x/11x/12x detection |
| vllm/platforms/cuda.py | Implemented Blackwell-class detection helper and integrated it into backend priority selection |
| vllm/attention/utils/fa_utils.py | Extended Flash Attention v3 fallback logic to include SM12x devices |
| vllm/v1/attention/backends/flashinfer.py | Updated HND layout detection and head dimension validation for Blackwell-class |
| vllm/v1/attention/backends/mla/*.py | Extended MLA backend compute capability checks to support SM12x |
| vllm/utils/*.py | Updated FlashInfer and DeepGemm utility functions to use Blackwell-class detection |
| vllm/model_executor/layers/quantization/*.py | Updated quantization backend selection logic for Blackwell-class |
| vllm/model_executor/layers/fused_moe/configs/*.json | Added GB10-specific MoE configuration files with Triton kernel parameters |
| vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py | Updated DeepGemm packed activation scale support check |
| vllm/model_executor/layers/batch_invariant.py | Extended batch-invariant mode enablement to Blackwell-class |
| vllm/model_executor/models/config.py | Updated kernel block alignment check for Blackwell-class |
| vllm/model_executor/warmup/kernel_warmup.py | Updated comment to reflect Blackwell-class architecture support |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
...ed_moe/configs/E=128,N=768,device_name=NVIDIA_GB10,dtype=fp8_w8a8,block_shape=[128,128].json
Show resolved
Hide resolved
|
Hi @seli-equinix, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
Outdated
Show resolved
Hide resolved
|
@seli-equinix - what were your "before" numbers? I'm seeing about the same (or even better) performance without this change on my Spark with FP8 models. Have you tested any FP4 (NVFP4/MXFP4) models? Also, what do you mean by "context" here? Is it an allocated KV cache size or inference numbers for request context of this size? If the latter, it's great, but I'd like to see the measurement methodology. If the former, then this is from a few weeks ago on main branch vLLM and single DGX spark: vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 --max-model-len 131072 --gpu-memory-utilization 0.7 --load-format fastsafetensors --host 0.0.0.0 --port 8888Jan: 44 t/s vllm bench serve \
--backend vllm \
--model Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 \
--endpoint /v1/completions \
--dataset-name sharegpt \
--dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
--port 8000 \
--num-prompts 1 |
|
@copilot open a new pull request to apply changes based on the comments in this thread |
|
Hi @seli-equinix, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
2434173 to
642709a
Compare
|
Hi @seli-equinix, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
d8ae522 to
9fbcb4c
Compare
|
@eugr please reach out to me at hellohal2064@gmail.com or you can give me a call 971-708-9761. would be happy to collaborate on the vLLM running on the DGX Spark :) |
|
@eugr thanks for the detailed questions! I ran benchmarks on my Spark using the same methodology: My Setup: Model: Qwen3-Next-80B-A3B-FP8 ============ Serving Benchmark Result ============
|
|
@seli-equinix - it's never been an issue for me on the main branch (after the initial Spark woes were addressed). Here is the model running on vLLM compiled from the main branch yesterday using my Docker build: There are a few warnings during inference that I haven't seen a month ago when I ran the model last time, but they don't seem to affect anything: Benchmark on today run: |
|
@eugr That is really strange, I could not get the docker (29.1.x) container to build with out the changes to the code in this PR. This is my setup from inside the container |
|
It could be PyTorch. I had to revert to release version of PyTorch as pre-release ones were giving me weird errors. CUDA 13.1 (but it worked with 13.0.2 just as well) Having said that, I'll try to apply your patch post-build and see if it improves anything. |
|
@seli-equinix - Tried to run with your patches applied (since they are all Python, I just applied post-install), and it fails on inference with trtllm - architecture is not supported: root@spark:/workspace/vllm# vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 --max-model-len 131072 --gpu-memory-utilization 0.7 --load-format fastsafetensors --host 0.0.0.0 --port 8888 --enable-prefix-caching
(APIServer pid=4548) INFO 01-06 23:25:32 [api_server.py:1278] vLLM API server version 0.14.0rc1.dev265+g951302989.d20260105
(APIServer pid=4548) INFO 01-06 23:25:32 [utils.py:253] non-default args: {'model_tag': 'Qwen/Qwen3-Next-80B-A3B-Instruct-FP8', 'host': '0.0.0.0', 'port': 8888, 'model': 'Qwen/Qwen3-Next-80B-A3B-Instruct-FP8', 'max_model_len': 131072, 'load_format': 'fastsafetensors', 'gpu_memory_utilization': 0.7, 'enable_prefix_caching': True}
(APIServer pid=4548) INFO 01-06 23:25:33 [model.py:522] Resolved architecture: Qwen3NextForCausalLM
(APIServer pid=4548) INFO 01-06 23:25:33 [model.py:1508] Using max model len 131072
(APIServer pid=4548) WARNING 01-06 23:25:33 [vllm.py:1447] Current vLLM config is not set.
(APIServer pid=4548) INFO 01-06 23:25:33 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=4548) INFO 01-06 23:25:33 [vllm.py:635] Disabling NCCL for DP synchronization when using async scheduling.
(APIServer pid=4548) INFO 01-06 23:25:33 [vllm.py:640] Asynchronous scheduling is enabled.
(APIServer pid=4548) INFO 01-06 23:25:33 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=4548) INFO 01-06 23:25:33 [config.py:338] Hybrid or mamba-based model detected without support for prefix caching: disabling.
(APIServer pid=4548) /usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:283: UserWarning:
(APIServer pid=4548) Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
(APIServer pid=4548) Minimum and Maximum cuda capability supported by this version of PyTorch is
(APIServer pid=4548) (8.0) - (12.0)
(APIServer pid=4548)
(APIServer pid=4548) warnings.warn(
(APIServer pid=4548) INFO 01-06 23:25:34 [config.py:469] Setting attention block size to 544 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=4548) INFO 01-06 23:25:34 [config.py:493] Padding mamba page size by 1.49% to ensure that mamba page size and attention page size are exactly equal.
(EngineCore_DP0 pid=4599) INFO 01-06 23:25:38 [core.py:96] Initializing a V1 LLM engine (v0.14.0rc1.dev265+g951302989.d20260105) with config: model='Qwen/Qwen3-Next-80B-A3B-Instruct-FP8', speculative_config=None, tokenizer='Qwen/Qwen3-Next-80B-A3B-Instruct-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=fastsafetensors, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False), seed=0, served_model_name=Qwen/Qwen3-Next-80B-A3B-Instruct-FP8, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+quant_fp8', 'none', '+quant_fp8'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False}, 'local_cache_dir': None}
(EngineCore_DP0 pid=4599) /usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:283: UserWarning:
(EngineCore_DP0 pid=4599) Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
(EngineCore_DP0 pid=4599) Minimum and Maximum cuda capability supported by this version of PyTorch is
(EngineCore_DP0 pid=4599) (8.0) - (12.0)
(EngineCore_DP0 pid=4599)
(EngineCore_DP0 pid=4599) warnings.warn(
(EngineCore_DP0 pid=4599) INFO 01-06 23:25:38 [parallel_state.py:1214] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.24.104:46275 backend=nccl
(EngineCore_DP0 pid=4599) INFO 01-06 23:25:38 [parallel_state.py:1425] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=4599) INFO 01-06 23:25:38 [gpu_model_runner.py:3758] Starting to load model Qwen/Qwen3-Next-80B-A3B-Instruct-FP8...
(EngineCore_DP0 pid=4599) INFO 01-06 23:25:38 [fp8.py:190] DeepGEMM is disabled because the platform does not support it.
(EngineCore_DP0 pid=4599) INFO 01-06 23:25:38 [fp8.py:209] Using Triton backend for FP8 MoE
(EngineCore_DP0 pid=4599) INFO 01-06 23:25:39 [cuda.py:381] Using FLASHINFER attention backend out of potential backends: ('FLASHINFER', 'FLASH_ATTN', 'TRITON_ATTN', 'FLEX_ATTENTION')
(EngineCore_DP0 pid=4599) INFO 01-06 23:25:39 [selector.py:112] Using HND KV cache layout for FLASHINFER backend.
Loading safetensors using Fastsafetensor loader: 0% Completed | 0/8 [00:00<?, ?it/s]
(EngineCore_DP0 pid=4599) /usr/local/lib/python3.12/dist-packages/fastsafetensors/copier/gds.py:185: UserWarning: GDS is not supported in this platform but nogds is False. use nogds=True
(EngineCore_DP0 pid=4599) warnings.warn(
Loading safetensors using Fastsafetensor loader: 12% Completed | 1/8 [00:03<00:25, 3.60s/it]
Loading safetensors using Fastsafetensor loader: 25% Completed | 2/8 [00:05<00:15, 2.51s/it]
Loading safetensors using Fastsafetensor loader: 38% Completed | 3/8 [00:08<00:13, 2.61s/it]
Loading safetensors using Fastsafetensor loader: 50% Completed | 4/8 [00:10<00:10, 2.69s/it]
Loading safetensors using Fastsafetensor loader: 62% Completed | 5/8 [00:13<00:08, 2.69s/it]
Loading safetensors using Fastsafetensor loader: 75% Completed | 6/8 [00:16<00:05, 2.72s/it]
Loading safetensors using Fastsafetensor loader: 88% Completed | 7/8 [00:19<00:02, 2.76s/it]
Loading safetensors using Fastsafetensor loader: 100% Completed | 8/8 [00:21<00:00, 2.76s/it]
Loading safetensors using Fastsafetensor loader: 100% Completed | 8/8 [00:21<00:00, 2.74s/it]
(EngineCore_DP0 pid=4599)
(EngineCore_DP0 pid=4599) INFO 01-06 23:26:04 [default_loader.py:308] Loading weights took 21.96 seconds
(EngineCore_DP0 pid=4599) INFO 01-06 23:26:05 [gpu_model_runner.py:3855] Model loading took 74.8851 GiB memory and 26.018481 seconds
(EngineCore_DP0 pid=4599) INFO 01-06 23:26:10 [backends.py:644] Using cache directory: /root/.cache/vllm/torch_compile_cache/d7e56f8d20/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=4599) INFO 01-06 23:26:10 [backends.py:704] Dynamo bytecode transform time: 5.06 s
(EngineCore_DP0 pid=4599) INFO 01-06 23:26:15 [backends.py:261] Cache the graph of compile range (1, 2048) for later use
(EngineCore_DP0 pid=4599) WARNING 01-06 23:26:15 [fused_moe.py:1054] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=512,N=512,device_name=NVIDIA_GB10,dtype=fp8_w8a8,block_shape=[128,128].json
(EngineCore_DP0 pid=4599) INFO 01-06 23:26:19 [backends.py:278] Compiling a graph for compile range (1, 2048) takes 3.95 s
(EngineCore_DP0 pid=4599) INFO 01-06 23:26:19 [monitor.py:34] torch.compile takes 9.01 s in total
(EngineCore_DP0 pid=4599) INFO 01-06 23:26:20 [gpu_worker.py:361] Available KV cache memory: 5.20 GiB
(EngineCore_DP0 pid=4599) INFO 01-06 23:26:20 [kv_cache_utils.py:1305] GPU KV cache size: 56,576 tokens
(EngineCore_DP0 pid=4599) INFO 01-06 23:26:20 [kv_cache_utils.py:1310] Maximum concurrency for 131,072 tokens per request: 1.71x
(EngineCore_DP0 pid=4599) INFO 01-06 23:26:20 [utils.py:465] `_KV_CACHE_LAYOUT_OVERRIDE` variable detected. Setting KV cache layout to HND.
(EngineCore_DP0 pid=4599) 2026-01-06 23:26:20,690 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore_DP0 pid=4599) 2026-01-06 23:26:20,805 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51/51 [00:04<00:00, 12.32it/s]
Capturing CUDA graphs (decode, FULL): 0%| | 0/35 [00:00<?, ?it/s](EngineCore_DP0 pid=4599) WARNING 01-06 23:26:25 [flashinfer.py:398] Using TRTLLM prefill attention (auto-detected).
Capturing CUDA graphs (decode, FULL): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:04<00:00, 7.35it/s]
(EngineCore_DP0 pid=4599) INFO 01-06 23:26:30 [gpu_model_runner.py:4806] Graph capturing finished in 10 secs, took 2.07 GiB
(EngineCore_DP0 pid=4599) INFO 01-06 23:26:30 [core.py:273] init engine (profile, create kv cache, warmup model) took 25.38 seconds
(EngineCore_DP0 pid=4599) INFO 01-06 23:26:31 [core.py:185] Batch queue is enabled with size 2
(EngineCore_DP0 pid=4599) INFO 01-06 23:26:32 [vllm.py:640] Asynchronous scheduling is enabled.
(APIServer pid=4548) INFO 01-06 23:26:32 [api_server.py:1020] Supported tasks: ['generate']
(APIServer pid=4548) WARNING 01-06 23:26:32 [model.py:1329] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=4548) INFO 01-06 23:26:32 [serving_responses.py:201] Using default chat sampling params from model: {'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
(APIServer pid=4548) INFO 01-06 23:26:32 [serving_chat.py:144] Using default chat sampling params from model: {'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
(APIServer pid=4548) INFO 01-06 23:26:32 [serving_chat.py:180] Warming up chat template processing...
(APIServer pid=4548) INFO 01-06 23:26:34 [chat_utils.py:599] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=4548) INFO 01-06 23:26:34 [serving_chat.py:216] Chat template warmup completed in 1351.7ms
(APIServer pid=4548) INFO 01-06 23:26:34 [serving_completion.py:78] Using default completion sampling params from model: {'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
(APIServer pid=4548) INFO 01-06 23:26:34 [serving_chat.py:144] Using default chat sampling params from model: {'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
(APIServer pid=4548) INFO 01-06 23:26:34 [api_server.py:1352] Starting vLLM API server 0 on http://0.0.0.0:8888
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:38] Available routes are:
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /docs, Methods: HEAD, GET
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /pause, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /resume, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /is_paused, Methods: GET
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /v1/audio/translations, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /classify, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /v1/embeddings, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /score, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /v1/score, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /rerank, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /v1/rerank, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /v2/rerank, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /pooling, Methods: POST
(APIServer pid=4548) INFO: Started server process [4548]
(APIServer pid=4548) INFO: Waiting for application startup.
(APIServer pid=4548) INFO: Application startup complete.
(APIServer pid=4548) INFO: 192.168.24.115:41154 - "POST /v1/completions HTTP/1.1" 200 OK
(EngineCore_DP0 pid=4599) /usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py:1044: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (12) < num_heads (16). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=4599) return fn(*args, **kwargs)
(EngineCore_DP0 pid=4599) /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (12) < num_heads (32). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=4599) return fn(*contiguous_args, **contiguous_kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.14.0rc1.dev265+g951302989.d20260105) with config: model='Qwen/Qwen3-Next-80B-A3B-Instruct-FP8', speculative_config=None, tokenizer='Qwen/Qwen3-Next-80B-A3B-Instruct-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=fastsafetensors, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False), seed=0, served_model_name=Qwen/Qwen3-Next-80B-A3B-Instruct-FP8, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '/root/.cache/vllm/torch_compile_cache/d7e56f8d20', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+quant_fp8', 'none', '+quant_fp8', '+quant_fp8', '+quant_fp8'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False}, 'local_cache_dir': '/root/.cache/vllm/torch_compile_cache/d7e56f8d20/rank_0_0/backbone'},
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=cmpl-8a1ba3eb074cb7e0-0-a09f774b,prompt_token_ids_len=12,mm_features=[],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[151643], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=119, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, structured_outputs=None, extra_args=None),block_ids=([1], [2], [3], [4]),num_computed_tokens=0,lora_request=None,prompt_embeds_shape=None)], scheduled_cached_reqs=CachedRequestData(req_ids=[], resumed_req_ids=[], new_token_ids=[], all_token_ids={}, new_block_ids=[], num_computed_tokens=[], num_output_tokens=[]), num_scheduled_tokens={cmpl-8a1ba3eb074cb7e0-0-a09f774b: 12}, total_num_scheduled_tokens=12, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 0], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], pending_structured_output_tokens=false, kv_connector_metadata=null, ec_connector_metadata=null)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.009615384615384581, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, kv_cache_eviction_events=[], spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] EngineCore encountered a fatal error.
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] Traceback (most recent call last):
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 881, in run_engine_core
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] engine_core.run_busy_loop()
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 908, in run_busy_loop
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] self._process_engine_step()
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 941, in _process_engine_step
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 455, in step_with_batch_queue
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] exec_model_fut.result()
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] return self.__get_result()
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] raise self._exception
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 79, in collective_rpc
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py", line 461, in run_method
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] return func(*args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 369, in execute_model
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] return self.worker.execute_model(scheduler_output, *args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] return func(*args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 622, in execute_model
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] output = self.model_runner.execute_model(
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] return func(*args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3276, in execute_model
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] model_output = self._model_forward(
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] ^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2905, in _model_forward
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] return self.model(
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] ^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 220, in __call__
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] return self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 1232, in forward
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] hidden_states = self.model(
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] ^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 442, in __call__
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] return TorchCompileWithNoGuardsWrapper.__call__(self, *args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/wrapper.py", line 223, in __call__
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] return self._call_with_optional_nvtx_range(
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/wrapper.py", line 109, in _call_with_optional_nvtx_range
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] return callable_fn(*args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 998, in forward
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] def forward(
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] return fn(*args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/caching.py", line 57, in __call__
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] return self.optimized_call(*args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 837, in call_wrapped
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] return self._wrapped_call(self, *args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 413, in __call__
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] raise e
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 400, in __call__
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] return super(self.cls, obj).__call__(*args, **kwargs) # type: ignore[misc]
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] File "<eval_with_key>.98", line 429, in forward
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] submod_7 = self.submod_7(getitem_21, s72, getitem_22, getitem_23, getitem_24); getitem_21 = getitem_22 = getitem_23 = submod_7 = None
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 837, in call_wrapped
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] return self._wrapped_call(self, *args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 413, in __call__
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] raise e
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 400, in __call__
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] return super(self.cls, obj).__call__(*args, **kwargs) # type: ignore[misc]
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] File "<eval_with_key>.8", line 5, in forward
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] unified_attention_with_output = torch.ops.vllm.unified_attention_with_output(query_8, key_8, value_9, output_10, 'model.layers.3.self_attn.attn'); query_8 = key_8 = value_9 = output_10 = unified_attention_with_output = None
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1255, in __call__
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] return self._op(*args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] File "/usr/local/lib/python3.12/dist-packages/vllm/attention/utils/kv_transfer_utils.py", line 39, in wrapper
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] return func(*args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] File "/usr/local/lib/python3.12/dist-packages/vllm/attention/layer.py", line 807, in unified_attention_with_output
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] self.impl.forward(
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/flashinfer.py", line 1430, in forward
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] trtllm_batch_context_with_kv_cache(
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] File "/usr/local/lib/python3.12/dist-packages/flashinfer/prefill.py", line 3644, in trtllm_batch_context_with_kv_cache
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] run_func(
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] File "python/tvm_ffi/cython/function.pxi", line 923, in tvm_ffi.core.Function.__call__
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] RuntimeError: Error in function 'TllmGenFmhaRunner' at /workspace/include/flashinfer/trtllm/fmha/fmhaRunner.cuh:30: Unsupported architecture
(EngineCore_DP0 pid=4599) Process EngineCore_DP0:
(EngineCore_DP0 pid=4599) Traceback (most recent call last):
(EngineCore_DP0 pid=4599) File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=4599) self.run()
(EngineCore_DP0 pid=4599) File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=4599) self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=4599) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 892, in run_engine_core
(EngineCore_DP0 pid=4599) raise e
(EngineCore_DP0 pid=4599) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 881, in run_engine_core
(EngineCore_DP0 pid=4599) engine_core.run_busy_loop()
(EngineCore_DP0 pid=4599) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 908, in run_busy_loop
(EngineCore_DP0 pid=4599) self._process_engine_step()
(EngineCore_DP0 pid=4599) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 941, in _process_engine_step
(EngineCore_DP0 pid=4599) outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=4599) ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 455, in step_with_batch_queue
(EngineCore_DP0 pid=4599) exec_model_fut.result()
(EngineCore_DP0 pid=4599) File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore_DP0 pid=4599) return self.__get_result()
(EngineCore_DP0 pid=4599) ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore_DP0 pid=4599) raise self._exception
(EngineCore_DP0 pid=4599) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 79, in collective_rpc
(EngineCore_DP0 pid=4599) result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=4599) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py", line 461, in run_method
(EngineCore_DP0 pid=4599) return func(*args, **kwargs)
(EngineCore_DP0 pid=4599) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 369, in execute_model
(EngineCore_DP0 pid=4599) return self.worker.execute_model(scheduler_output, *args, **kwargs)
(EngineCore_DP0 pid=4599) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=4599) return func(*args, **kwargs)
(EngineCore_DP0 pid=4599) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 622, in execute_model
(EngineCore_DP0 pid=4599) output = self.model_runner.execute_model(
(EngineCore_DP0 pid=4599) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=4599) return func(*args, **kwargs)
(EngineCore_DP0 pid=4599) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3276, in execute_model
(EngineCore_DP0 pid=4599) model_output = self._model_forward(
(EngineCore_DP0 pid=4599) ^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2905, in _model_forward
(EngineCore_DP0 pid=4599) return self.model(
(EngineCore_DP0 pid=4599) ^^^^^^^^^^^
(EngineCore_DP0 pid=4599) File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 220, in __call__
(EngineCore_DP0 pid=4599) return self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=4599) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=4599) return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=4599) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=4599) return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=4599) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 1232, in forward
(EngineCore_DP0 pid=4599) hidden_states = self.model(
(EngineCore_DP0 pid=4599) ^^^^^^^^^^^
(EngineCore_DP0 pid=4599) File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 442, in __call__
(EngineCore_DP0 pid=4599) return TorchCompileWithNoGuardsWrapper.__call__(self, *args, **kwargs)
(EngineCore_DP0 pid=4599) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/wrapper.py", line 223, in __call__
(EngineCore_DP0 pid=4599) return self._call_with_optional_nvtx_range(
(EngineCore_DP0 pid=4599) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/wrapper.py", line 109, in _call_with_optional_nvtx_range
(EngineCore_DP0 pid=4599) return callable_fn(*args, **kwargs)
(EngineCore_DP0 pid=4599) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 998, in forward
(EngineCore_DP0 pid=4599) def forward(
(EngineCore_DP0 pid=4599) File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn
(EngineCore_DP0 pid=4599) return fn(*args, **kwargs)
(EngineCore_DP0 pid=4599) ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/caching.py", line 57, in __call__
(EngineCore_DP0 pid=4599) return self.optimized_call(*args, **kwargs)
(EngineCore_DP0 pid=4599) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 837, in call_wrapped
(EngineCore_DP0 pid=4599) return self._wrapped_call(self, *args, **kwargs)
(EngineCore_DP0 pid=4599) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 413, in __call__
(EngineCore_DP0 pid=4599) raise e
(EngineCore_DP0 pid=4599) File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 400, in __call__
(EngineCore_DP0 pid=4599) return super(self.cls, obj).__call__(*args, **kwargs) # type: ignore[misc]
(EngineCore_DP0 pid=4599) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=4599) return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=4599) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=4599) return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=4599) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) File "<eval_with_key>.98", line 429, in forward
(EngineCore_DP0 pid=4599) submod_7 = self.submod_7(getitem_21, s72, getitem_22, getitem_23, getitem_24); getitem_21 = getitem_22 = getitem_23 = submod_7 = None
(EngineCore_DP0 pid=4599) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 837, in call_wrapped
(EngineCore_DP0 pid=4599) return self._wrapped_call(self, *args, **kwargs)
(EngineCore_DP0 pid=4599) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 413, in __call__
(EngineCore_DP0 pid=4599) raise e
(EngineCore_DP0 pid=4599) File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 400, in __call__
(EngineCore_DP0 pid=4599) return super(self.cls, obj).__call__(*args, **kwargs) # type: ignore[misc]
(EngineCore_DP0 pid=4599) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=4599) return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=4599) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=4599) return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=4599) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) File "<eval_with_key>.8", line 5, in forward
(EngineCore_DP0 pid=4599) unified_attention_with_output = torch.ops.vllm.unified_attention_with_output(query_8, key_8, value_9, output_10, 'model.layers.3.self_attn.attn'); query_8 = key_8 = value_9 = output_10 = unified_attention_with_output = None
(EngineCore_DP0 pid=4599) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1255, in __call__
(EngineCore_DP0 pid=4599) return self._op(*args, **kwargs)
(EngineCore_DP0 pid=4599) ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) File "/usr/local/lib/python3.12/dist-packages/vllm/attention/utils/kv_transfer_utils.py", line 39, in wrapper
(EngineCore_DP0 pid=4599) return func(*args, **kwargs)
(EngineCore_DP0 pid=4599) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) File "/usr/local/lib/python3.12/dist-packages/vllm/attention/layer.py", line 807, in unified_attention_with_output
(EngineCore_DP0 pid=4599) self.impl.forward(
(EngineCore_DP0 pid=4599) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/flashinfer.py", line 1430, in forward
(EngineCore_DP0 pid=4599) trtllm_batch_context_with_kv_cache(
(EngineCore_DP0 pid=4599) File "/usr/local/lib/python3.12/dist-packages/flashinfer/prefill.py", line 3644, in trtllm_batch_context_with_kv_cache
(EngineCore_DP0 pid=4599) run_func(
(EngineCore_DP0 pid=4599) File "python/tvm_ffi/cython/function.pxi", line 923, in tvm_ffi.core.Function.__call__
(EngineCore_DP0 pid=4599) RuntimeError: Error in function 'TllmGenFmhaRunner' at /workspace/include/flashinfer/trtllm/fmha/fmhaRunner.cuh:30: Unsupported architecture
(APIServer pid=4548) ERROR 01-06 23:27:37 [async_llm.py:543] AsyncLLM output_handler failed.
(APIServer pid=4548) ERROR 01-06 23:27:37 [async_llm.py:543] Traceback (most recent call last):
(APIServer pid=4548) ERROR 01-06 23:27:37 [async_llm.py:543] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 495, in output_handler
(APIServer pid=4548) ERROR 01-06 23:27:37 [async_llm.py:543] outputs = await engine_core.get_output_async()
(APIServer pid=4548) ERROR 01-06 23:27:37 [async_llm.py:543] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=4548) ERROR 01-06 23:27:37 [async_llm.py:543] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 899, in get_output_async
(APIServer pid=4548) ERROR 01-06 23:27:37 [async_llm.py:543] raise self._format_exception(outputs) from None
(APIServer pid=4548) ERROR 01-06 23:27:37 [async_llm.py:543] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512] Error in completion stream generator.
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512] Traceback (most recent call last):
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_completion.py", line 352, in completion_stream_generator
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512] async for prompt_idx, res in result_generator:
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512] File "/usr/local/lib/python3.12/dist-packages/vllm/utils/async_utils.py", line 278, in merge_async_iterators
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512] async for item in iterators[0]:
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 439, in generate
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512] out = q.get_nowait() or await q.get()
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512] ^^^^^^^^^^^^^
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/output_processor.py", line 73, in get
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512] raise output
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 495, in output_handler
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512] outputs = await engine_core.get_output_async()
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 899, in get_output_async
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512] raise self._format_exception(outputs) from None
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
[rank0]:[W106 23:27:37.808671670 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=4548) INFO: Shutting down
(APIServer pid=4548) INFO: Waiting for application shutdown.
(APIServer pid=4548) INFO: Application shutdown complete.
(APIServer pid=4548) INFO: Finished server process [4548] |
|
Same with flashinfer built from the source. |
|
@eugr I think you would need to make the latest pytouch 2.11. That is what I have running. I think the reason yours is failing is am still working on getting flashinfer-cubin to work so on my build it still uses TRITON_ATTN backend. It should have switch to that and not tried flashinfer. I am hoping to get flashinfer working soon. |
|
This pull request has merge conflicts that must be resolved before it can be |
|
Hi @seli-equinix, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Building this on a DGX Spark (GB10, CUDA 13.0, Ubuntu 24.04 aarch64) right now — documenting the full process here: https://gist.github.com/Borg2025/8254034a2dfabab380758a76db8e111b Key finding: Related: #31588 (ohsono's GB10 bug report) Thanks @seli-equinix for the GB10 patches — exactly what we needed. 🙏 |
|
Hi @seli-equinix, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
1 similar comment
|
Hi @seli-equinix, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
do you use the slack for vllm our are you on the NVIDIA community that way we can talk? |
|
This pull request has merge conflicts that must be resolved before it can be |
|
Hi @seli-equinix, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
This pull request has merge conflicts that must be resolved before it can be |
This PR adds support for NVIDIA GB10 GPUs found in DGX Spark devices. The GB10 reports compute capability 12.1 (SM121), which is part of the Blackwell architecture family but uses a different major version than the B100/B200 data center GPUs (SM10x). Changes: - Added is_blackwell_class() method to Platform interface and CudaPlatformBase - Updated _get_backend_priorities() to handle SM10x, SM11x, and SM12x - Replaced all is_device_capability_family(100) checks with is_blackwell_class() - Updated attention backend compute capability checks - Added docstrings explaining SM121 cuDNN/FlashInfer compatibility Blackwell architecture family now includes: - SM100/SM101: B100, B200 data center GPUs (major=10) - SM120/SM121: GB10 DGX Spark, Thor edge devices (major=12) - SM11x: Reserved for future Blackwell variants cuDNN prefill support for SM121: The cuDNN SDPA cubins (named cudnn_sm100_*) are architecture-family binaries that support all Blackwell variants. FlashInfer explicitly supports SM121 (beta) and dispatches SM100, SM110, SM120, and SM121 to the same gen_fmha_cutlass_sm100a_module. The has_nvidia_artifactory() check ensures cubins are available before enabling this feature. Tested on: NVIDIA GB10 (DGX Spark 2) with CUDA 13.0 and PyTorch 2.11 Signed-off-by: seli-equinix <seli@equinix.com>
Cherry-pick from fix/tool-call-empty-arguments branch. Prevents JSONDecodeError with Continue VSCode extension. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: seli-equinix <seli@equinix.com>
Cherry-pick from PR vllm-project#32704 - auto-detects GPU arch >= 110 and configures TRITON_PTXAS_PATH to use system CUDA toolkit's ptxas instead of Triton's bundled version (CUDA 12.8) which doesn't support sm_121a. This ensures Triton kernels compile correctly on DGX Spark GB10. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: seli-equinix <seli@equinix.com>
The CMakeLists.txt SM120 kernel checks only included 12.0a/12.0f but not 12.1a/12.1f. This caused builds targeting SM121 (DGX Spark GB10) to miss the CUTLASS scaled_mm, FP4, MLA, and MOE kernels. Updated checks: - SCALED_MM_ARCHS: 12.0f;12.1f (CUDA 13+) and 12.0a;12.1a (CUDA <13) - FP4_ARCHS: 12.0f;12.1f (CUDA 13+) and 12.0a;12.1a (CUDA <13) - MLA_ARCHS: Added 12.1f (CUDA 13+) - CUTLASS_MOE_DATA_ARCHS: Added 12.1f (CUDA 13+) This fixes: NotImplementedError: No compiled cutlass_scaled_mm for a compute capability less than CUDA device capability: 121 Signed-off-by: seli-equinix <seli@equinix.com>
Add optimized fused MoE kernel configuration for NVIDIA GB10 (SM121/DGX Spark) with FP8 w8a8 quantization. Config parameters adjusted for GB10's 48 SMs: - Reduced GROUP_SIZE_M (16-32 vs 64) for better SM utilization - Based on B200 config with SM-count-aware adjustments This eliminates the "Using default MoE config" warning and provides tuned block sizes for Qwen3-Next-80B-A3B-FP8 and similar MoE models with E=512, N=512. Signed-off-by: seli-equinix <seli@equinix.com> Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: seli-equinix <seli@equinix.com>
1. FlashMLA Sparse detection (flashmla.py): - Changed is_blackwell_class() to is_device_capability_family(100) - FlashMLA Sparse only supports SM90/SM100, NOT SM12x - Updated error message to be more specific 2. CMakeLists.txt MLA_ARCHS: - Removed SM12x (12.0f, 12.1f, 12.0a, 12.1a) from MLA_ARCHS - CUTLASS MLA only supports SM10x, SM12x uses TRITON_MLA - Added clarifying comment 3. cuda.py documentation: - Removed unverified "Thor" device reference - Only tested hardware (GB10 DGX Spark) now mentioned Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: seli-equinix <seli@equinix.com>
- Collect tokens/sec, latency, TPS benchmarks - Use plot plugin for trend graphs over builds - CSV output for metrics tracking Signed-off-by: seli-equinix <seli@equinix.com>
The _supports_quant_scheme() check restricted FP8 block-scale (kFp8Static128BlockSym) to exact SM90 only. SM121 (GB10 DGX Spark) has working FP8 block-scale support via FlashInfer v0.6.3 native group_gemm_fp8_nt_groupwise path. Add is_device_capability_family(120) to the FP8 block-scale check. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: seli-equinix <seli@equinix.com>
Separate FP4-specific utility checks (fp4_quantize, nvfp4_block_scale_interleave) into has_flashinfer_nvfp4() so that has_flashinfer_cutlass_fused_moe() only checks for the core CUTLASS MoE entry point. This allows FP8 CUTLASS MoE to work on SM121 (GB10) which has cutlass_fused_moe but lacks FP4 utilities. Also gate the nvfp4 quant scheme on has_flashinfer_nvfp4() in FlashInferExperts._supports_quant_scheme(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: seli-equinix <seli@equinix.com>
GraphPickler raises 'Unexpected raw Node during pickling' when node.meta contains raw Node objects in keys beyond the default filter list (source_fn_stack, nn_module_stack, fwd_source_fn_stack). This occurs on PyTorch nightly 2.11.0+ where additional FX passes inject Node refs into metadata fields like from_node. Fix: Walk all node.meta values recursively and strip any key whose value tree contains a torch.fx.Node reference before calling GraphPickler.dumps(). Also add a fallback Node handler in the custom reducer_override as a safety net for any references that slip through complex nested structures. This fixes the AOT cache being written as 0 bytes on every startup, which caused 'Ran out of input' errors on subsequent loads and forced a full torch.compile recompilation (~11s) on every restart. Signed-off-by: seli-equinix <seli@equinix.com>
…on PTX lacks SM121)
|
Hi @seli-equinix, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
This pull request has merge conflicts that must be resolved before it can be |
Summary
With lots of help from others, adding support for NVIDIA GB10 GPU (SM121) found in DGX Spark devices. This extends Blackwell architecture support beyond SM100/SM103 to include the SM12x family.
Commits in This PR
e3554dab0is_blackwell_class()06a6e576e8086e764f64b34ee80117a69efc397a80effKey Changes
Platform Detection
is_blackwell_class()method to detect SM10x, SM11x, and SM12x GPUsCUTLASS Kernel Support
SCALED_MM_ARCHSto include12.1f(CUDA 13+) and12.1a(CUDA <13)FP4_ARCHS,MLA_ARCHS,CUTLASS_MOE_DATA_ARCHSfor SM121NotImplementedError: No compiled cutlass_scaled_mm for compute capability: 121Triton Compilation
TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxaswhen SM121 detectedMoE Kernel Tuning
Hardware Tested
Performance Results
Backend Support Matrix for SM121
Files Changed
Platform/Detection
vllm/platforms/interface.py- Addedis_blackwell_class()abstract methodvllm/platforms/cuda.py- Implemented Blackwell-class detectionBuild System
CMakeLists.txt- Added SM121 (12.1a/12.1f) to CUTLASS kernel arch listsAttention/MLA
vllm/attention/utils/fa_utils.py- Extended FA fallback to SM12xvllm/v1/attention/backends/flashinfer.py- Added SM12x to HND layout supportMoE/Quantization
vllm/model_executor/layers/fused_moe/configs/- Added GB10 tuning configsAuto-generated
docs/design/attention_backends.md- Regenerated by pre-commit hooktools/pre_commit/generate_attention_backend_docs.py- From upstreamNotes