Skip to content

feat: Add SM121/GB10 (DGX Spark) Blackwell-class GPU support#31740

Open
seli-equinix wants to merge 13 commits intovllm-project:mainfrom
seli-equinix:feature/sm121-gb10-support
Open

feat: Add SM121/GB10 (DGX Spark) Blackwell-class GPU support#31740
seli-equinix wants to merge 13 commits intovllm-project:mainfrom
seli-equinix:feature/sm121-gb10-support

Conversation

@seli-equinix
Copy link
Copy Markdown

@seli-equinix seli-equinix commented Jan 5, 2026

Summary

With lots of help from others, adding support for NVIDIA GB10 GPU (SM121) found in DGX Spark devices. This extends Blackwell architecture support beyond SM100/SM103 to include the SM12x family.

Commits in This PR

Commit Description
e3554dab0 feat: Add SM121/GB10 Blackwell-class GPU support - Core platform detection with is_blackwell_class()
06a6e576e fix: Handle empty/whitespace tool_call arguments - Chat preprocessing fix
8086e764f feat: Auto-configure TRITON_PTXAS_PATH for SM121/GB10 - Ensures Triton kernel compilation works
64b34ee80 fix: Add SM121 (12.1) arch support for CUTLASS SM120 kernels - CMakeLists.txt fixes for 12.1a/12.1f
117a69efc feat: Add NVIDIA GB10 MoE tuning config for FP8 - Optimized kernel configs for Qwen3 MoE
397a80eff chore: Add attention backend documentation generator - Pre-commit hook from upstream

Key Changes

Platform Detection

  • Added is_blackwell_class() method to detect SM10x, SM11x, and SM12x GPUs
  • Updated backend selection to properly handle GB10/SM121

CUTLASS Kernel Support

  • Extended SCALED_MM_ARCHS to include 12.1f (CUDA 13+) and 12.1a (CUDA <13)
  • Extended FP4_ARCHS, MLA_ARCHS, CUTLASS_MOE_DATA_ARCHS for SM121
  • Fixes: NotImplementedError: No compiled cutlass_scaled_mm for compute capability: 121

Triton Compilation

  • Auto-configures TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas when SM121 detected
  • Required because Triton doesn't recognize SM121 natively

MoE Kernel Tuning

  • Added optimized FP8 w8a8 MoE configs for GB10 with Qwen3-Next-80B-A3B-FP8
  • 18 batch sizes tuned: 1, 2, 4, 8, 16, 24, 32, 48, 64, 96, 128, 256, 512, 1024, 1536, 2048, 3072, 4096

Hardware Tested

Property Value
Device NVIDIA DGX Spark with GB10 GPU
Compute Capability SM121 (12.1)
CUDA Version 13.1.0
PyTouch Version 2.11.0+cu130
Architecture ARM64 (aarch64)
Memory 119.6 GB unified
SMs 48

Performance Results

Model Tokens/sec Context Length
Qwen3-Coder-30B-A3B-FP8 44 tok/s 128K
Qwen3-Next-80B-A3B-FP8 45 tok/s 256K
qwen3-embedding
Qwen3-vision 45 tok/s 256K

Backend Support Matrix for SM121

Backend Status Notes
TRITON_ATTN ✅ Works Primary attention backend
TRITON_MLA ✅ Works Primary MLA backend
FlashInfer attention ✅ Works Supports SM75-SM121
FlashInfer MLA ❌ SM100 only Falls back to TRITON_MLA
CUTLASS MLA ❌ SM100 only Falls back to TRITON_MLA
FlashMLA Sparse ❌ SM90/100 only Not supported

Files Changed

Platform/Detection

  • vllm/platforms/interface.py - Added is_blackwell_class() abstract method
  • vllm/platforms/cuda.py - Implemented Blackwell-class detection

Build System

  • CMakeLists.txt - Added SM121 (12.1a/12.1f) to CUTLASS kernel arch lists

Attention/MLA

  • vllm/attention/utils/fa_utils.py - Extended FA fallback to SM12x
  • vllm/v1/attention/backends/flashinfer.py - Added SM12x to HND layout support

MoE/Quantization

  • vllm/model_executor/layers/fused_moe/configs/ - Added GB10 tuning configs

Auto-generated

  • docs/design/attention_backends.md - Regenerated by pre-commit hook
  • tools/pre_commit/generate_attention_backend_docs.py - From upstream

Notes

  • SM121 uses TRITON backends as primary since FlashInfer MLA/CUTLASS MLA only support SM100
  • GB10 has unified memory architecture which affects memory utilization differently than discrete GPUs
  • The MoE tuning configs provide significant speedup for Mixture-of-Experts models

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for NVIDIA GB10 (SM121) GPUs by introducing a is_blackwell_class() method and replacing hardcoded checks for SM100. The changes are well-structured and cover multiple parts of the codebase, including attention backends and quantization layers.

I've identified a critical issue regarding an inconsistency between a comment and code for FlashInfer autotuning, which could impact performance on new hardware. Additionally, there are a couple of instances of code duplication for the new is_blackwell_class logic that should be addressed to improve maintainability.

Overall, this is a good contribution to extend hardware support. Addressing the identified issues will make the changes more robust and easier to maintain.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Jan 5, 2026

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@mergify
Copy link
Copy Markdown

mergify bot commented Jan 5, 2026

Hi @seli-equinix, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends vLLM's Blackwell architecture support to include the SM121/GB10 GPU found in DGX Spark devices, moving from device-specific checks to a unified Blackwell-class detection approach.

Key changes:

  • Introduced is_blackwell_class() method to detect SM10x, SM11x, and SM12x GPUs as a unified Blackwell family
  • Replaced scattered is_device_capability_family(100) checks with the new class-based detection throughout the codebase
  • Added MoE configuration files optimized for GB10 hardware

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
vllm/platforms/interface.py Added is_blackwell_class() method to platform interface with documentation for SM10x/11x/12x detection
vllm/platforms/cuda.py Implemented Blackwell-class detection helper and integrated it into backend priority selection
vllm/attention/utils/fa_utils.py Extended Flash Attention v3 fallback logic to include SM12x devices
vllm/v1/attention/backends/flashinfer.py Updated HND layout detection and head dimension validation for Blackwell-class
vllm/v1/attention/backends/mla/*.py Extended MLA backend compute capability checks to support SM12x
vllm/utils/*.py Updated FlashInfer and DeepGemm utility functions to use Blackwell-class detection
vllm/model_executor/layers/quantization/*.py Updated quantization backend selection logic for Blackwell-class
vllm/model_executor/layers/fused_moe/configs/*.json Added GB10-specific MoE configuration files with Triton kernel parameters
vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py Updated DeepGemm packed activation scale support check
vllm/model_executor/layers/batch_invariant.py Extended batch-invariant mode enablement to Blackwell-class
vllm/model_executor/models/config.py Updated kernel block alignment check for Blackwell-class
vllm/model_executor/warmup/kernel_warmup.py Updated comment to reflect Blackwell-class architecture support

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@mergify
Copy link
Copy Markdown

mergify bot commented Jan 5, 2026

Hi @seli-equinix, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@eugr
Copy link
Copy Markdown

eugr commented Jan 6, 2026

@seli-equinix - what were your "before" numbers? I'm seeing about the same (or even better) performance without this change on my Spark with FP8 models. Have you tested any FP4 (NVFP4/MXFP4) models? Also, what do you mean by "context" here? Is it an allocated KV cache size or inference numbers for request context of this size? If the latter, it's great, but I'd like to see the measurement methodology. If the former, then this is from a few weeks ago on main branch vLLM and single DGX spark:

vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 --max-model-len 131072 --gpu-memory-utilization 0.7 --load-format fastsafetensors --host 0.0.0.0 --port 8888

Jan: 44 t/s

vllm bench serve \
  --backend vllm \
  --model Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 \
  --endpoint /v1/completions \
  --dataset-name sharegpt \
  --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
  --port 8000 \
  --num-prompts 1
============ Serving Benchmark Result ============
Successful requests:                     1
Failed requests:                         0
Benchmark duration (s):                  2.79
Total input tokens:                      12
Total generated tokens:                  119
Request throughput (req/s):              0.36
Output token throughput (tok/s):         42.61
Peak output token throughput (tok/s):    44.00
Peak concurrent requests:                1.00
Total Token throughput (tok/s):          46.90
---------------Time to First Token----------------
Mean TTFT (ms):                          131.13
Median TTFT (ms):                        131.13
P99 TTFT (ms):                           131.13
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          22.55
Median TPOT (ms):                        22.55
P99 TPOT (ms):                           22.55
---------------Inter-token Latency----------------
Mean ITL (ms):                           22.55
Median ITL (ms):                         22.38
P99 ITL (ms):                            25.39
==================================================

@seli-equinix
Copy link
Copy Markdown
Author

@copilot open a new pull request to apply changes based on the comments in this thread

@mergify
Copy link
Copy Markdown

mergify bot commented Jan 6, 2026

Hi @seli-equinix, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@seli-equinix seli-equinix force-pushed the feature/sm121-gb10-support branch from 2434173 to 642709a Compare January 6, 2026 19:33
@mergify
Copy link
Copy Markdown

mergify bot commented Jan 6, 2026

Hi @seli-equinix, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@seli-equinix seli-equinix force-pushed the feature/sm121-gb10-support branch from d8ae522 to 9fbcb4c Compare January 6, 2026 19:43
@seli-equinix
Copy link
Copy Markdown
Author

@eugr please reach out to me at hellohal2064@gmail.com or you can give me a call 971-708-9761. would be happy to collaborate on the vLLM running on the DGX Spark :)

@seli-equinix
Copy link
Copy Markdown
Author

@eugr thanks for the detailed questions! I ran benchmarks on my Spark using the same methodology:

My Setup: Model: Qwen3-Next-80B-A3B-FP8
vLLM: Built from this branch with SM121 detection
Input: ~12 tokens/request, Output: 120 tokens max

============ Serving Benchmark Result ============
Successful requests: 1
Failed requests: 0
Benchmark duration (s): 2.70
Total input tokens: 12
Total generated tokens: 120
Request throughput (req/s): 0.37
Output token throughput (tok/s): 44.43
Peak output token throughput (tok/s): 44.43
Peak concurrent requests: 1.00
Total Token throughput (tok/s): 48.88
---------------Time to First Token----------------
Mean TTFT (ms): 95.63
Median TTFT (ms): 95.63
P99 TTFT (ms): 95.63
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 21.89
Median TPOT (ms): 21.94
P99 TPOT (ms): 24.22
---------------Inter-token Latency----------------
Mean ITL (ms): 21.89
Median ITL (ms): 21.94
P99 ITL (ms): 24.22

What this PR actually fixes:*
The PR ensures SM121 (GB10/Spark) is recognized as Blackwell-class architecture for attention backend selection. Without it, the is_device_capability_family(100) checks would fail for SM121 since major=12, not major=10.

Your question about "before":
I should clarify - I don't have before/after performance numbers because my goal was compatibility, not optimization. What specific error (if any) were you seeing before this landed in main?

You're running on main branch successfully?
If you're running Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 on main without issues, that's interesting. Can you share:

  1. What vLLM version/commit?
  2. Any startup warnings about unsupported architecture?
  3. Which attention backend is being selected?

My "context" reference:
I meant the model's configured context window (--max-model-len), not actual inference benchmarks with long prompts.

FP4:
Haven't tested NVFP4/MXFP4, my needs are focused on working with the smart Qwen models - I am building a custom MCP server that run's on my second Spark that is a LLM backed learning and memory system.

Would love to compare notes - sounds like we're both working on Spark optimization!

@eugr
Copy link
Copy Markdown

eugr commented Jan 6, 2026

@seli-equinix - it's never been an issue for me on the main branch (after the initial Spark woes were addressed).
There is a startup warning from pyTorch, but it can be ignored as it doesn't seem to affect performance (I tried a patched version of pytorch too).

Here is the model running on vLLM compiled from the main branch yesterday using my Docker build:

root@spark:/workspace/vllm# vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 --max-model-len 131072 --gpu-memory-utilization 0.7 --load-format fastsafetensors --host 0.0.0.0 --port 8888
(APIServer pid=799) INFO 01-06 21:04:06 [api_server.py:1278] vLLM API server version 0.14.0rc1.dev265+g951302989.d20260105
(APIServer pid=799) INFO 01-06 21:04:06 [utils.py:253] non-default args: {'model_tag': 'Qwen/Qwen3-Next-80B-A3B-Instruct-FP8', 'host': '0.0.0.0', 'port': 8888, 'model': 'Qwen/Qwen3-Next-80B-A3B-Instruct-FP8', 'max_model_len': 131072, 'load_format': 'fastsafetensors', 'gpu_memory_utilization': 0.7}
(APIServer pid=799) INFO 01-06 21:04:11 [model.py:522] Resolved architecture: Qwen3NextForCausalLM
(APIServer pid=799) INFO 01-06 21:04:11 [model.py:1508] Using max model len 131072
(APIServer pid=799) WARNING 01-06 21:04:11 [vllm.py:1447] Current vLLM config is not set.
(APIServer pid=799) INFO 01-06 21:04:11 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=799) INFO 01-06 21:04:11 [vllm.py:635] Disabling NCCL for DP synchronization when using async scheduling.
(APIServer pid=799) INFO 01-06 21:04:11 [vllm.py:640] Asynchronous scheduling is enabled.
(APIServer pid=799) INFO 01-06 21:04:11 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=799) /usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:283: UserWarning:
(APIServer pid=799)     Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
(APIServer pid=799)     Minimum and Maximum cuda capability supported by this version of PyTorch is
(APIServer pid=799)     (8.0) - (12.0)
(APIServer pid=799)
(APIServer pid=799)   warnings.warn(
(APIServer pid=799) INFO 01-06 21:04:11 [config.py:469] Setting attention block size to 544 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=799) INFO 01-06 21:04:11 [config.py:493] Padding mamba page size by 1.49% to ensure that mamba page size and attention page size are exactly equal.
(EngineCore_DP0 pid=892) INFO 01-06 21:04:15 [core.py:96] Initializing a V1 LLM engine (v0.14.0rc1.dev265+g951302989.d20260105) with config: model='Qwen/Qwen3-Next-80B-A3B-Instruct-FP8', speculative_config=None, tokenizer='Qwen/Qwen3-Next-80B-A3B-Instruct-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=fastsafetensors, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False), seed=0, served_model_name=Qwen/Qwen3-Next-80B-A3B-Instruct-FP8, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+quant_fp8', 'none', '+quant_fp8'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False}, 'local_cache_dir': None}
(EngineCore_DP0 pid=892) /usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:283: UserWarning:
(EngineCore_DP0 pid=892)     Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
(EngineCore_DP0 pid=892)     Minimum and Maximum cuda capability supported by this version of PyTorch is
(EngineCore_DP0 pid=892)     (8.0) - (12.0)
(EngineCore_DP0 pid=892)
(EngineCore_DP0 pid=892)   warnings.warn(
(EngineCore_DP0 pid=892) INFO 01-06 21:04:15 [parallel_state.py:1214] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.24.104:59097 backend=nccl
(EngineCore_DP0 pid=892) INFO 01-06 21:04:15 [parallel_state.py:1425] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=892) INFO 01-06 21:04:16 [gpu_model_runner.py:3758] Starting to load model Qwen/Qwen3-Next-80B-A3B-Instruct-FP8...
(EngineCore_DP0 pid=892) INFO 01-06 21:04:16 [fp8.py:190] DeepGEMM is disabled because the platform does not support it.
(EngineCore_DP0 pid=892) INFO 01-06 21:04:16 [fp8.py:209] Using Triton backend for FP8 MoE
(EngineCore_DP0 pid=892) INFO 01-06 21:04:17 [cuda.py:351] Using FLASH_ATTN attention backend out of potential backends: ('FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION')
Loading safetensors using Fastsafetensor loader:   0% Completed | 0/8 [00:00<?, ?it/s]
(EngineCore_DP0 pid=892) /usr/local/lib/python3.12/dist-packages/fastsafetensors/copier/gds.py:185: UserWarning: GDS is not supported in this platform but nogds is False. use nogds=True
(EngineCore_DP0 pid=892)   warnings.warn(
Loading safetensors using Fastsafetensor loader:  12% Completed | 1/8 [00:03<00:24,  3.50s/it]
Loading safetensors using Fastsafetensor loader:  25% Completed | 2/8 [00:05<00:14,  2.42s/it]
Loading safetensors using Fastsafetensor loader:  38% Completed | 3/8 [00:08<00:13,  2.65s/it]
Loading safetensors using Fastsafetensor loader:  50% Completed | 4/8 [00:10<00:10,  2.64s/it]
Loading safetensors using Fastsafetensor loader:  62% Completed | 5/8 [00:13<00:08,  2.83s/it]
Loading safetensors using Fastsafetensor loader:  75% Completed | 6/8 [00:16<00:05,  2.81s/it]
Loading safetensors using Fastsafetensor loader:  88% Completed | 7/8 [00:19<00:02,  2.90s/it]
Loading safetensors using Fastsafetensor loader: 100% Completed | 8/8 [00:22<00:00,  2.91s/it]
Loading safetensors using Fastsafetensor loader: 100% Completed | 8/8 [00:22<00:00,  2.84s/it]
(EngineCore_DP0 pid=892)
(EngineCore_DP0 pid=892) INFO 01-06 21:04:42 [default_loader.py:308] Loading weights took 22.69 seconds
(EngineCore_DP0 pid=892) INFO 01-06 21:04:43 [gpu_model_runner.py:3855] Model loading took 74.8851 GiB memory and 26.704041 seconds
(EngineCore_DP0 pid=892) INFO 01-06 21:04:48 [backends.py:644] Using cache directory: /root/.cache/vllm/torch_compile_cache/412fb0c8f0/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=892) INFO 01-06 21:04:48 [backends.py:704] Dynamo bytecode transform time: 5.09 s
(EngineCore_DP0 pid=892) [rank0]:W0106 21:04:53.838000 892 torch/_inductor/utils.py:1613] Not enough SMs to use max_autotune_gemm mode
(EngineCore_DP0 pid=892) INFO 01-06 21:04:56 [backends.py:261] Cache the graph of compile range (1, 2048) for later use
(EngineCore_DP0 pid=892) WARNING 01-06 21:04:59 [fused_moe.py:1054] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=512,N=512,device_name=NVIDIA_GB10,dtype=fp8_w8a8,block_shape=[128,128].json
(EngineCore_DP0 pid=892) INFO 01-06 21:06:22 [backends.py:278] Compiling a graph for compile range (1, 2048) takes 88.76 s
(EngineCore_DP0 pid=892) INFO 01-06 21:06:22 [monitor.py:34] torch.compile takes 93.85 s in total
(EngineCore_DP0 pid=892) INFO 01-06 21:06:23 [gpu_worker.py:361] Available KV cache memory: 4.65 GiB
(EngineCore_DP0 pid=892) INFO 01-06 21:06:23 [kv_cache_utils.py:1305] GPU KV cache size: 50,592 tokens
(EngineCore_DP0 pid=892) INFO 01-06 21:06:23 [kv_cache_utils.py:1310] Maximum concurrency for 131,072 tokens per request: 1.53x
(EngineCore_DP0 pid=892) 2026-01-06 21:06:23,957 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore_DP0 pid=892) 2026-01-06 21:06:24,066 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51/51 [00:04<00:00, 10.88it/s]
Capturing CUDA graphs (decode, FULL): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:52<00:00,  1.50s/it]
(EngineCore_DP0 pid=892) INFO 01-06 21:07:22 [gpu_model_runner.py:4806] Graph capturing finished in 58 secs, took 2.52 GiB
(EngineCore_DP0 pid=892) INFO 01-06 21:07:22 [core.py:273] init engine (profile, create kv cache, warmup model) took 158.76 seconds
(EngineCore_DP0 pid=892) INFO 01-06 21:07:23 [core.py:185] Batch queue is enabled with size 2
(EngineCore_DP0 pid=892) INFO 01-06 21:07:23 [vllm.py:640] Asynchronous scheduling is enabled.
(APIServer pid=799) INFO 01-06 21:07:23 [api_server.py:1020] Supported tasks: ['generate']
(APIServer pid=799) WARNING 01-06 21:07:24 [model.py:1329] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=799) INFO 01-06 21:07:24 [serving_responses.py:201] Using default chat sampling params from model: {'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
(APIServer pid=799) INFO 01-06 21:07:24 [serving_chat.py:144] Using default chat sampling params from model: {'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
(APIServer pid=799) INFO 01-06 21:07:24 [serving_chat.py:180] Warming up chat template processing...
(APIServer pid=799) INFO 01-06 21:07:25 [chat_utils.py:599] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=799) INFO 01-06 21:07:25 [serving_chat.py:216] Chat template warmup completed in 1305.3ms
(APIServer pid=799) INFO 01-06 21:07:25 [serving_completion.py:78] Using default completion sampling params from model: {'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
(APIServer pid=799) INFO 01-06 21:07:25 [serving_chat.py:144] Using default chat sampling params from model: {'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
(APIServer pid=799) INFO 01-06 21:07:25 [api_server.py:1352] Starting vLLM API server 0 on http://0.0.0.0:8888
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:38] Available routes are:
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /docs, Methods: HEAD, GET
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /pause, Methods: POST
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /resume, Methods: POST
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /is_paused, Methods: GET
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /v1/audio/translations, Methods: POST
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /classify, Methods: POST
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /v1/embeddings, Methods: POST
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /score, Methods: POST
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /v1/score, Methods: POST
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /rerank, Methods: POST
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /v1/rerank, Methods: POST
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /v2/rerank, Methods: POST
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /pooling, Methods: POST
(APIServer pid=799) INFO:     Started server process [799]
(APIServer pid=799) INFO:     Waiting for application startup.
(APIServer pid=799) INFO:     Application startup complete.

There are a few warnings during inference that I haven't seen a month ago when I ran the model last time, but they don't seem to affect anything:

(EngineCore_DP0 pid=892) /usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py:1044: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (12) < num_heads (16). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=892)   return fn(*args, **kwargs)
(EngineCore_DP0 pid=892) /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (12) < num_heads (32). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=892)   return fn(*contiguous_args, **contiguous_kwargs)
(EngineCore_DP0 pid=892) /usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py:1044: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (12) < num_heads (16). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=892)   return fn(*args, **kwargs)
(EngineCore_DP0 pid=892) /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (12) < num_heads (32). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=892)   return fn(*contiguous_args, **contiguous_kwargs)

Benchmark on today run:

============ Serving Benchmark Result ============
Successful requests:                     1
Failed requests:                         0
Benchmark duration (s):                  2.67
Total input tokens:                      12
Total generated tokens:                  119
Request throughput (req/s):              0.37
Output token throughput (tok/s):         44.57
Peak output token throughput (tok/s):    46.00
Peak concurrent requests:                1.00
Total token throughput (tok/s):          49.06
---------------Time to First Token----------------
Mean TTFT (ms):                          93.94
Median TTFT (ms):                        93.94
P99 TTFT (ms):                           93.94
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          21.83
Median TPOT (ms):                        21.83
P99 TPOT (ms):                           21.83
---------------Inter-token Latency----------------
Mean ITL (ms):                           21.83
Median ITL (ms):                         21.77
P99 ITL (ms):                            23.79
==================================================

@seli-equinix
Copy link
Copy Markdown
Author

@eugr That is really strange, I could not get the docker (29.1.x) container to build with out the changes to the code in this PR. This is my setup from inside the container
OS | Ubuntu 24.04.3 LTS (Noble Numbat)
Kernel | 6.14.0-1015-nvidia (aarch64)
PyTorch | 2.11.0.dev20260103+cu130 (nightly)
CUDA | 13.0
cuDNN | 9.15.01

@eugr
Copy link
Copy Markdown

eugr commented Jan 6, 2026

It could be PyTorch. I had to revert to release version of PyTorch as pre-release ones were giving me weird errors.
What Dockerfile are you using? One from vLLM repo? It never really worked for me anyway. But cu130 wheels or just building from the source works fine. You can try my build: https://github.com/eugr/spark-vllm-docker

CUDA 13.1 (but it worked with 13.0.2 just as well)
torch==2.9.1+cu130
cuDNN 9.17.0
flashinfer-cubin==0.6.0rc2
flashinfer-jit-cache==0.6.0rc2+cu130
flashinfer-python==0.6.0rc2
triton @ file:///workspace/wheels/triton-3.5.1-cp312-cp312-linux_aarch64.whl
triton-kernels @ file:///workspace/wheels/triton_kernels-1.0.0-py3-none-any.whl

Having said that, I'll try to apply your patch post-build and see if it improves anything.

@eugr
Copy link
Copy Markdown

eugr commented Jan 6, 2026

@seli-equinix - Tried to run with your patches applied (since they are all Python, I just applied post-install), and it fails on inference with trtllm - architecture is not supported:

root@spark:/workspace/vllm# vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 --max-model-len 131072 --gpu-memory-utilization 0.7 --load-format fastsafetensors --host 0.0.0.0 --port 8888 --enable-prefix-caching
(APIServer pid=4548) INFO 01-06 23:25:32 [api_server.py:1278] vLLM API server version 0.14.0rc1.dev265+g951302989.d20260105
(APIServer pid=4548) INFO 01-06 23:25:32 [utils.py:253] non-default args: {'model_tag': 'Qwen/Qwen3-Next-80B-A3B-Instruct-FP8', 'host': '0.0.0.0', 'port': 8888, 'model': 'Qwen/Qwen3-Next-80B-A3B-Instruct-FP8', 'max_model_len': 131072, 'load_format': 'fastsafetensors', 'gpu_memory_utilization': 0.7, 'enable_prefix_caching': True}
(APIServer pid=4548) INFO 01-06 23:25:33 [model.py:522] Resolved architecture: Qwen3NextForCausalLM
(APIServer pid=4548) INFO 01-06 23:25:33 [model.py:1508] Using max model len 131072
(APIServer pid=4548) WARNING 01-06 23:25:33 [vllm.py:1447] Current vLLM config is not set.
(APIServer pid=4548) INFO 01-06 23:25:33 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=4548) INFO 01-06 23:25:33 [vllm.py:635] Disabling NCCL for DP synchronization when using async scheduling.
(APIServer pid=4548) INFO 01-06 23:25:33 [vllm.py:640] Asynchronous scheduling is enabled.
(APIServer pid=4548) INFO 01-06 23:25:33 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=4548) INFO 01-06 23:25:33 [config.py:338] Hybrid or mamba-based model detected without support for prefix caching: disabling.
(APIServer pid=4548) /usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:283: UserWarning:
(APIServer pid=4548)     Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
(APIServer pid=4548)     Minimum and Maximum cuda capability supported by this version of PyTorch is
(APIServer pid=4548)     (8.0) - (12.0)
(APIServer pid=4548)
(APIServer pid=4548)   warnings.warn(
(APIServer pid=4548) INFO 01-06 23:25:34 [config.py:469] Setting attention block size to 544 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=4548) INFO 01-06 23:25:34 [config.py:493] Padding mamba page size by 1.49% to ensure that mamba page size and attention page size are exactly equal.
(EngineCore_DP0 pid=4599) INFO 01-06 23:25:38 [core.py:96] Initializing a V1 LLM engine (v0.14.0rc1.dev265+g951302989.d20260105) with config: model='Qwen/Qwen3-Next-80B-A3B-Instruct-FP8', speculative_config=None, tokenizer='Qwen/Qwen3-Next-80B-A3B-Instruct-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=fastsafetensors, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False), seed=0, served_model_name=Qwen/Qwen3-Next-80B-A3B-Instruct-FP8, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+quant_fp8', 'none', '+quant_fp8'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False}, 'local_cache_dir': None}
(EngineCore_DP0 pid=4599) /usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:283: UserWarning:
(EngineCore_DP0 pid=4599)     Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
(EngineCore_DP0 pid=4599)     Minimum and Maximum cuda capability supported by this version of PyTorch is
(EngineCore_DP0 pid=4599)     (8.0) - (12.0)
(EngineCore_DP0 pid=4599)
(EngineCore_DP0 pid=4599)   warnings.warn(
(EngineCore_DP0 pid=4599) INFO 01-06 23:25:38 [parallel_state.py:1214] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.24.104:46275 backend=nccl
(EngineCore_DP0 pid=4599) INFO 01-06 23:25:38 [parallel_state.py:1425] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=4599) INFO 01-06 23:25:38 [gpu_model_runner.py:3758] Starting to load model Qwen/Qwen3-Next-80B-A3B-Instruct-FP8...
(EngineCore_DP0 pid=4599) INFO 01-06 23:25:38 [fp8.py:190] DeepGEMM is disabled because the platform does not support it.
(EngineCore_DP0 pid=4599) INFO 01-06 23:25:38 [fp8.py:209] Using Triton backend for FP8 MoE
(EngineCore_DP0 pid=4599) INFO 01-06 23:25:39 [cuda.py:381] Using FLASHINFER attention backend out of potential backends: ('FLASHINFER', 'FLASH_ATTN', 'TRITON_ATTN', 'FLEX_ATTENTION')
(EngineCore_DP0 pid=4599) INFO 01-06 23:25:39 [selector.py:112] Using HND KV cache layout for FLASHINFER backend.
Loading safetensors using Fastsafetensor loader:   0% Completed | 0/8 [00:00<?, ?it/s]
(EngineCore_DP0 pid=4599) /usr/local/lib/python3.12/dist-packages/fastsafetensors/copier/gds.py:185: UserWarning: GDS is not supported in this platform but nogds is False. use nogds=True
(EngineCore_DP0 pid=4599)   warnings.warn(
Loading safetensors using Fastsafetensor loader:  12% Completed | 1/8 [00:03<00:25,  3.60s/it]
Loading safetensors using Fastsafetensor loader:  25% Completed | 2/8 [00:05<00:15,  2.51s/it]
Loading safetensors using Fastsafetensor loader:  38% Completed | 3/8 [00:08<00:13,  2.61s/it]
Loading safetensors using Fastsafetensor loader:  50% Completed | 4/8 [00:10<00:10,  2.69s/it]
Loading safetensors using Fastsafetensor loader:  62% Completed | 5/8 [00:13<00:08,  2.69s/it]
Loading safetensors using Fastsafetensor loader:  75% Completed | 6/8 [00:16<00:05,  2.72s/it]
Loading safetensors using Fastsafetensor loader:  88% Completed | 7/8 [00:19<00:02,  2.76s/it]
Loading safetensors using Fastsafetensor loader: 100% Completed | 8/8 [00:21<00:00,  2.76s/it]
Loading safetensors using Fastsafetensor loader: 100% Completed | 8/8 [00:21<00:00,  2.74s/it]
(EngineCore_DP0 pid=4599)
(EngineCore_DP0 pid=4599) INFO 01-06 23:26:04 [default_loader.py:308] Loading weights took 21.96 seconds
(EngineCore_DP0 pid=4599) INFO 01-06 23:26:05 [gpu_model_runner.py:3855] Model loading took 74.8851 GiB memory and 26.018481 seconds
(EngineCore_DP0 pid=4599) INFO 01-06 23:26:10 [backends.py:644] Using cache directory: /root/.cache/vllm/torch_compile_cache/d7e56f8d20/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=4599) INFO 01-06 23:26:10 [backends.py:704] Dynamo bytecode transform time: 5.06 s
(EngineCore_DP0 pid=4599) INFO 01-06 23:26:15 [backends.py:261] Cache the graph of compile range (1, 2048) for later use
(EngineCore_DP0 pid=4599) WARNING 01-06 23:26:15 [fused_moe.py:1054] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=512,N=512,device_name=NVIDIA_GB10,dtype=fp8_w8a8,block_shape=[128,128].json
(EngineCore_DP0 pid=4599) INFO 01-06 23:26:19 [backends.py:278] Compiling a graph for compile range (1, 2048) takes 3.95 s
(EngineCore_DP0 pid=4599) INFO 01-06 23:26:19 [monitor.py:34] torch.compile takes 9.01 s in total
(EngineCore_DP0 pid=4599) INFO 01-06 23:26:20 [gpu_worker.py:361] Available KV cache memory: 5.20 GiB
(EngineCore_DP0 pid=4599) INFO 01-06 23:26:20 [kv_cache_utils.py:1305] GPU KV cache size: 56,576 tokens
(EngineCore_DP0 pid=4599) INFO 01-06 23:26:20 [kv_cache_utils.py:1310] Maximum concurrency for 131,072 tokens per request: 1.71x
(EngineCore_DP0 pid=4599) INFO 01-06 23:26:20 [utils.py:465] `_KV_CACHE_LAYOUT_OVERRIDE` variable detected. Setting KV cache layout to HND.
(EngineCore_DP0 pid=4599) 2026-01-06 23:26:20,690 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore_DP0 pid=4599) 2026-01-06 23:26:20,805 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51/51 [00:04<00:00, 12.32it/s]
Capturing CUDA graphs (decode, FULL):   0%|                                                                                                                                                                                                                                                                                                                                                                 | 0/35 [00:00<?, ?it/s](EngineCore_DP0 pid=4599) WARNING 01-06 23:26:25 [flashinfer.py:398] Using TRTLLM prefill attention (auto-detected).
Capturing CUDA graphs (decode, FULL): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:04<00:00,  7.35it/s]
(EngineCore_DP0 pid=4599) INFO 01-06 23:26:30 [gpu_model_runner.py:4806] Graph capturing finished in 10 secs, took 2.07 GiB
(EngineCore_DP0 pid=4599) INFO 01-06 23:26:30 [core.py:273] init engine (profile, create kv cache, warmup model) took 25.38 seconds
(EngineCore_DP0 pid=4599) INFO 01-06 23:26:31 [core.py:185] Batch queue is enabled with size 2
(EngineCore_DP0 pid=4599) INFO 01-06 23:26:32 [vllm.py:640] Asynchronous scheduling is enabled.
(APIServer pid=4548) INFO 01-06 23:26:32 [api_server.py:1020] Supported tasks: ['generate']
(APIServer pid=4548) WARNING 01-06 23:26:32 [model.py:1329] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=4548) INFO 01-06 23:26:32 [serving_responses.py:201] Using default chat sampling params from model: {'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
(APIServer pid=4548) INFO 01-06 23:26:32 [serving_chat.py:144] Using default chat sampling params from model: {'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
(APIServer pid=4548) INFO 01-06 23:26:32 [serving_chat.py:180] Warming up chat template processing...
(APIServer pid=4548) INFO 01-06 23:26:34 [chat_utils.py:599] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=4548) INFO 01-06 23:26:34 [serving_chat.py:216] Chat template warmup completed in 1351.7ms
(APIServer pid=4548) INFO 01-06 23:26:34 [serving_completion.py:78] Using default completion sampling params from model: {'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
(APIServer pid=4548) INFO 01-06 23:26:34 [serving_chat.py:144] Using default chat sampling params from model: {'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
(APIServer pid=4548) INFO 01-06 23:26:34 [api_server.py:1352] Starting vLLM API server 0 on http://0.0.0.0:8888
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:38] Available routes are:
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /docs, Methods: HEAD, GET
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /pause, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /resume, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /is_paused, Methods: GET
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /v1/audio/translations, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /classify, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /v1/embeddings, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /score, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /v1/score, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /rerank, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /v1/rerank, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /v2/rerank, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /pooling, Methods: POST
(APIServer pid=4548) INFO:     Started server process [4548]
(APIServer pid=4548) INFO:     Waiting for application startup.
(APIServer pid=4548) INFO:     Application startup complete.
(APIServer pid=4548) INFO:     192.168.24.115:41154 - "POST /v1/completions HTTP/1.1" 200 OK
(EngineCore_DP0 pid=4599) /usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py:1044: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (12) < num_heads (16). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=4599)   return fn(*args, **kwargs)
(EngineCore_DP0 pid=4599) /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (12) < num_heads (32). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=4599)   return fn(*contiguous_args, **contiguous_kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.14.0rc1.dev265+g951302989.d20260105) with config: model='Qwen/Qwen3-Next-80B-A3B-Instruct-FP8', speculative_config=None, tokenizer='Qwen/Qwen3-Next-80B-A3B-Instruct-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=fastsafetensors, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False), seed=0, served_model_name=Qwen/Qwen3-Next-80B-A3B-Instruct-FP8, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '/root/.cache/vllm/torch_compile_cache/d7e56f8d20', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+quant_fp8', 'none', '+quant_fp8', '+quant_fp8', '+quant_fp8'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False}, 'local_cache_dir': '/root/.cache/vllm/torch_compile_cache/d7e56f8d20/rank_0_0/backbone'},
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=cmpl-8a1ba3eb074cb7e0-0-a09f774b,prompt_token_ids_len=12,mm_features=[],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[151643], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=119, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, structured_outputs=None, extra_args=None),block_ids=([1], [2], [3], [4]),num_computed_tokens=0,lora_request=None,prompt_embeds_shape=None)], scheduled_cached_reqs=CachedRequestData(req_ids=[], resumed_req_ids=[], new_token_ids=[], all_token_ids={}, new_block_ids=[], num_computed_tokens=[], num_output_tokens=[]), num_scheduled_tokens={cmpl-8a1ba3eb074cb7e0-0-a09f774b: 12}, total_num_scheduled_tokens=12, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 0], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], pending_structured_output_tokens=false, kv_connector_metadata=null, ec_connector_metadata=null)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.009615384615384581, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, kv_cache_eviction_events=[], spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] EngineCore encountered a fatal error.
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] Traceback (most recent call last):
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 881, in run_engine_core
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     engine_core.run_busy_loop()
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 908, in run_busy_loop
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     self._process_engine_step()
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 941, in _process_engine_step
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 455, in step_with_batch_queue
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     exec_model_fut.result()
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     return self.__get_result()
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     raise self._exception
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 79, in collective_rpc
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py", line 461, in run_method
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     return func(*args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 369, in execute_model
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     return self.worker.execute_model(scheduler_output, *args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     return func(*args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 622, in execute_model
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     output = self.model_runner.execute_model(
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     return func(*args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3276, in execute_model
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     model_output = self._model_forward(
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]                    ^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2905, in _model_forward
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     return self.model(
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]            ^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 220, in __call__
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     return self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 1232, in forward
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     hidden_states = self.model(
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]                     ^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 442, in __call__
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     return TorchCompileWithNoGuardsWrapper.__call__(self, *args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/wrapper.py", line 223, in __call__
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     return self._call_with_optional_nvtx_range(
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/wrapper.py", line 109, in _call_with_optional_nvtx_range
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     return callable_fn(*args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 998, in forward
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     def forward(
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     return fn(*args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/caching.py", line 57, in __call__
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     return self.optimized_call(*args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 837, in call_wrapped
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     return self._wrapped_call(self, *args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 413, in __call__
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     raise e
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 400, in __call__
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "<eval_with_key>.98", line 429, in forward
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     submod_7 = self.submod_7(getitem_21, s72, getitem_22, getitem_23, getitem_24);  getitem_21 = getitem_22 = getitem_23 = submod_7 = None
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 837, in call_wrapped
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     return self._wrapped_call(self, *args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 413, in __call__
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     raise e
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 400, in __call__
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "<eval_with_key>.8", line 5, in forward
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     unified_attention_with_output = torch.ops.vllm.unified_attention_with_output(query_8, key_8, value_9, output_10, 'model.layers.3.self_attn.attn');  query_8 = key_8 = value_9 = output_10 = unified_attention_with_output = None
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1255, in __call__
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     return self._op(*args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/vllm/attention/utils/kv_transfer_utils.py", line 39, in wrapper
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     return func(*args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/vllm/attention/layer.py", line 807, in unified_attention_with_output
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     self.impl.forward(
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/flashinfer.py", line 1430, in forward
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     trtllm_batch_context_with_kv_cache(
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/flashinfer/prefill.py", line 3644, in trtllm_batch_context_with_kv_cache
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     run_func(
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "python/tvm_ffi/cython/function.pxi", line 923, in tvm_ffi.core.Function.__call__
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] RuntimeError: Error in function 'TllmGenFmhaRunner' at /workspace/include/flashinfer/trtllm/fmha/fmhaRunner.cuh:30: Unsupported architecture
(EngineCore_DP0 pid=4599) Process EngineCore_DP0:
(EngineCore_DP0 pid=4599) Traceback (most recent call last):
(EngineCore_DP0 pid=4599)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=4599)     self.run()
(EngineCore_DP0 pid=4599)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=4599)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 892, in run_engine_core
(EngineCore_DP0 pid=4599)     raise e
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 881, in run_engine_core
(EngineCore_DP0 pid=4599)     engine_core.run_busy_loop()
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 908, in run_busy_loop
(EngineCore_DP0 pid=4599)     self._process_engine_step()
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 941, in _process_engine_step
(EngineCore_DP0 pid=4599)     outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=4599)                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 455, in step_with_batch_queue
(EngineCore_DP0 pid=4599)     exec_model_fut.result()
(EngineCore_DP0 pid=4599)   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore_DP0 pid=4599)     return self.__get_result()
(EngineCore_DP0 pid=4599)            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore_DP0 pid=4599)     raise self._exception
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 79, in collective_rpc
(EngineCore_DP0 pid=4599)     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=4599)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py", line 461, in run_method
(EngineCore_DP0 pid=4599)     return func(*args, **kwargs)
(EngineCore_DP0 pid=4599)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 369, in execute_model
(EngineCore_DP0 pid=4599)     return self.worker.execute_model(scheduler_output, *args, **kwargs)
(EngineCore_DP0 pid=4599)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=4599)     return func(*args, **kwargs)
(EngineCore_DP0 pid=4599)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 622, in execute_model
(EngineCore_DP0 pid=4599)     output = self.model_runner.execute_model(
(EngineCore_DP0 pid=4599)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=4599)     return func(*args, **kwargs)
(EngineCore_DP0 pid=4599)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3276, in execute_model
(EngineCore_DP0 pid=4599)     model_output = self._model_forward(
(EngineCore_DP0 pid=4599)                    ^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2905, in _model_forward
(EngineCore_DP0 pid=4599)     return self.model(
(EngineCore_DP0 pid=4599)            ^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 220, in __call__
(EngineCore_DP0 pid=4599)     return self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=4599)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=4599)     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=4599)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=4599)     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=4599)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 1232, in forward
(EngineCore_DP0 pid=4599)     hidden_states = self.model(
(EngineCore_DP0 pid=4599)                     ^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 442, in __call__
(EngineCore_DP0 pid=4599)     return TorchCompileWithNoGuardsWrapper.__call__(self, *args, **kwargs)
(EngineCore_DP0 pid=4599)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/wrapper.py", line 223, in __call__
(EngineCore_DP0 pid=4599)     return self._call_with_optional_nvtx_range(
(EngineCore_DP0 pid=4599)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/wrapper.py", line 109, in _call_with_optional_nvtx_range
(EngineCore_DP0 pid=4599)     return callable_fn(*args, **kwargs)
(EngineCore_DP0 pid=4599)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 998, in forward
(EngineCore_DP0 pid=4599)     def forward(
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn
(EngineCore_DP0 pid=4599)     return fn(*args, **kwargs)
(EngineCore_DP0 pid=4599)            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/caching.py", line 57, in __call__
(EngineCore_DP0 pid=4599)     return self.optimized_call(*args, **kwargs)
(EngineCore_DP0 pid=4599)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 837, in call_wrapped
(EngineCore_DP0 pid=4599)     return self._wrapped_call(self, *args, **kwargs)
(EngineCore_DP0 pid=4599)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 413, in __call__
(EngineCore_DP0 pid=4599)     raise e
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 400, in __call__
(EngineCore_DP0 pid=4599)     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(EngineCore_DP0 pid=4599)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=4599)     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=4599)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=4599)     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=4599)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "<eval_with_key>.98", line 429, in forward
(EngineCore_DP0 pid=4599)     submod_7 = self.submod_7(getitem_21, s72, getitem_22, getitem_23, getitem_24);  getitem_21 = getitem_22 = getitem_23 = submod_7 = None
(EngineCore_DP0 pid=4599)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 837, in call_wrapped
(EngineCore_DP0 pid=4599)     return self._wrapped_call(self, *args, **kwargs)
(EngineCore_DP0 pid=4599)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 413, in __call__
(EngineCore_DP0 pid=4599)     raise e
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 400, in __call__
(EngineCore_DP0 pid=4599)     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(EngineCore_DP0 pid=4599)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=4599)     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=4599)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=4599)     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=4599)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "<eval_with_key>.8", line 5, in forward
(EngineCore_DP0 pid=4599)     unified_attention_with_output = torch.ops.vllm.unified_attention_with_output(query_8, key_8, value_9, output_10, 'model.layers.3.self_attn.attn');  query_8 = key_8 = value_9 = output_10 = unified_attention_with_output = None
(EngineCore_DP0 pid=4599)                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1255, in __call__
(EngineCore_DP0 pid=4599)     return self._op(*args, **kwargs)
(EngineCore_DP0 pid=4599)            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/vllm/attention/utils/kv_transfer_utils.py", line 39, in wrapper
(EngineCore_DP0 pid=4599)     return func(*args, **kwargs)
(EngineCore_DP0 pid=4599)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/vllm/attention/layer.py", line 807, in unified_attention_with_output
(EngineCore_DP0 pid=4599)     self.impl.forward(
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/flashinfer.py", line 1430, in forward
(EngineCore_DP0 pid=4599)     trtllm_batch_context_with_kv_cache(
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/flashinfer/prefill.py", line 3644, in trtllm_batch_context_with_kv_cache
(EngineCore_DP0 pid=4599)     run_func(
(EngineCore_DP0 pid=4599)   File "python/tvm_ffi/cython/function.pxi", line 923, in tvm_ffi.core.Function.__call__
(EngineCore_DP0 pid=4599) RuntimeError: Error in function 'TllmGenFmhaRunner' at /workspace/include/flashinfer/trtllm/fmha/fmhaRunner.cuh:30: Unsupported architecture
(APIServer pid=4548) ERROR 01-06 23:27:37 [async_llm.py:543] AsyncLLM output_handler failed.
(APIServer pid=4548) ERROR 01-06 23:27:37 [async_llm.py:543] Traceback (most recent call last):
(APIServer pid=4548) ERROR 01-06 23:27:37 [async_llm.py:543]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 495, in output_handler
(APIServer pid=4548) ERROR 01-06 23:27:37 [async_llm.py:543]     outputs = await engine_core.get_output_async()
(APIServer pid=4548) ERROR 01-06 23:27:37 [async_llm.py:543]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=4548) ERROR 01-06 23:27:37 [async_llm.py:543]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 899, in get_output_async
(APIServer pid=4548) ERROR 01-06 23:27:37 [async_llm.py:543]     raise self._format_exception(outputs) from None
(APIServer pid=4548) ERROR 01-06 23:27:37 [async_llm.py:543] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512] Error in completion stream generator.
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512] Traceback (most recent call last):
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_completion.py", line 352, in completion_stream_generator
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512]     async for prompt_idx, res in result_generator:
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils/async_utils.py", line 278, in merge_async_iterators
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512]     async for item in iterators[0]:
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 439, in generate
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512]     out = q.get_nowait() or await q.get()
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512]                             ^^^^^^^^^^^^^
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/output_processor.py", line 73, in get
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512]     raise output
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 495, in output_handler
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512]     outputs = await engine_core.get_output_async()
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 899, in get_output_async
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512]     raise self._format_exception(outputs) from None
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
[rank0]:[W106 23:27:37.808671670 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=4548) INFO:     Shutting down
(APIServer pid=4548) INFO:     Waiting for application shutdown.
(APIServer pid=4548) INFO:     Application shutdown complete.
(APIServer pid=4548) INFO:     Finished server process [4548]

@eugr
Copy link
Copy Markdown

eugr commented Jan 6, 2026

Same with flashinfer built from the source.

@seli-equinix
Copy link
Copy Markdown
Author

@eugr I think you would need to make the latest pytouch 2.11. That is what I have running. I think the reason yours is failing is am still working on getting flashinfer-cubin to work so on my build it still uses TRITON_ATTN backend. It should have switch to that and not tried flashinfer. I am hoping to get flashinfer working soon.

@mergify
Copy link
Copy Markdown

mergify bot commented Feb 7, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @seli-equinix.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Feb 7, 2026
@mergify
Copy link
Copy Markdown

mergify bot commented Feb 16, 2026

Hi @seli-equinix, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@Borg2025
Copy link
Copy Markdown

Building this on a DGX Spark (GB10, CUDA 13.0, Ubuntu 24.04 aarch64) right now — documenting the full process here: https://gist.github.com/Borg2025/8254034a2dfabab380758a76db8e111b

Key finding: torch==2.9.1+cu129 from PyTorch's index works on GB10 (SM 12.1 warning is harmless). Build is compiling with SM120 kernels enabled. Will update the gist with results and benchmarks (vLLM vs Ollama) when complete.

Related: #31588 (ohsono's GB10 bug report)

Thanks @seli-equinix for the GB10 patches — exactly what we needed. 🙏

@mergify
Copy link
Copy Markdown

mergify bot commented Feb 16, 2026

Hi @seli-equinix, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

1 similar comment
@mergify
Copy link
Copy Markdown

mergify bot commented Feb 16, 2026

Hi @seli-equinix, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@seli-equinix
Copy link
Copy Markdown
Author

Building this on a DGX Spark (GB10, CUDA 13.0, Ubuntu 24.04 aarch64) right now — documenting the full process here: https://gist.github.com/Borg2025/8254034a2dfabab380758a76db8e111b

Key finding: torch==2.9.1+cu129 from PyTorch's index works on GB10 (SM 12.1 warning is harmless). Build is compiling with SM120 kernels enabled. Will update the gist with results and benchmarks (vLLM vs Ollama) when complete.

Related: #31588 (ohsono's GB10 bug report)

Thanks @seli-equinix for the GB10 patches — exactly what we needed. 🙏

do you use the slack for vllm our are you on the NVIDIA community that way we can talk?

@mergify
Copy link
Copy Markdown

mergify bot commented Feb 18, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @seli-equinix.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify
Copy link
Copy Markdown

mergify bot commented Mar 5, 2026

Hi @seli-equinix, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@mergify
Copy link
Copy Markdown

mergify bot commented Mar 8, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @seli-equinix.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

seli-equinix and others added 13 commits March 10, 2026 23:02
This PR adds support for NVIDIA GB10 GPUs found in DGX Spark devices.
The GB10 reports compute capability 12.1 (SM121), which is part of the
Blackwell architecture family but uses a different major version than
the B100/B200 data center GPUs (SM10x).

Changes:
- Added is_blackwell_class() method to Platform interface and CudaPlatformBase
- Updated _get_backend_priorities() to handle SM10x, SM11x, and SM12x
- Replaced all is_device_capability_family(100) checks with is_blackwell_class()
- Updated attention backend compute capability checks
- Added docstrings explaining SM121 cuDNN/FlashInfer compatibility

Blackwell architecture family now includes:
- SM100/SM101: B100, B200 data center GPUs (major=10)
- SM120/SM121: GB10 DGX Spark, Thor edge devices (major=12)
- SM11x: Reserved for future Blackwell variants

cuDNN prefill support for SM121:
The cuDNN SDPA cubins (named cudnn_sm100_*) are architecture-family
binaries that support all Blackwell variants. FlashInfer explicitly
supports SM121 (beta) and dispatches SM100, SM110, SM120, and SM121
to the same gen_fmha_cutlass_sm100a_module. The has_nvidia_artifactory()
check ensures cubins are available before enabling this feature.

Tested on: NVIDIA GB10 (DGX Spark 2) with CUDA 13.0 and PyTorch 2.11

Signed-off-by: seli-equinix <seli@equinix.com>
Cherry-pick from fix/tool-call-empty-arguments branch.
Prevents JSONDecodeError with Continue VSCode extension.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: seli-equinix <seli@equinix.com>
Cherry-pick from PR vllm-project#32704 - auto-detects GPU arch >= 110 and
configures TRITON_PTXAS_PATH to use system CUDA toolkit's ptxas
instead of Triton's bundled version (CUDA 12.8) which doesn't
support sm_121a.

This ensures Triton kernels compile correctly on DGX Spark GB10.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: seli-equinix <seli@equinix.com>
The CMakeLists.txt SM120 kernel checks only included 12.0a/12.0f but not
12.1a/12.1f. This caused builds targeting SM121 (DGX Spark GB10) to miss
the CUTLASS scaled_mm, FP4, MLA, and MOE kernels.

Updated checks:
- SCALED_MM_ARCHS: 12.0f;12.1f (CUDA 13+) and 12.0a;12.1a (CUDA <13)
- FP4_ARCHS: 12.0f;12.1f (CUDA 13+) and 12.0a;12.1a (CUDA <13)
- MLA_ARCHS: Added 12.1f (CUDA 13+)
- CUTLASS_MOE_DATA_ARCHS: Added 12.1f (CUDA 13+)

This fixes:
NotImplementedError: No compiled cutlass_scaled_mm for a compute
capability less than CUDA device capability: 121

Signed-off-by: seli-equinix <seli@equinix.com>
Add optimized fused MoE kernel configuration for NVIDIA GB10
(SM121/DGX Spark) with FP8 w8a8 quantization.

Config parameters adjusted for GB10's 48 SMs:
- Reduced GROUP_SIZE_M (16-32 vs 64) for better SM utilization
- Based on B200 config with SM-count-aware adjustments

This eliminates the "Using default MoE config" warning and
provides tuned block sizes for Qwen3-Next-80B-A3B-FP8 and
similar MoE models with E=512, N=512.

Signed-off-by: seli-equinix <seli@equinix.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: seli-equinix <seli@equinix.com>
1. FlashMLA Sparse detection (flashmla.py):
   - Changed is_blackwell_class() to is_device_capability_family(100)
   - FlashMLA Sparse only supports SM90/SM100, NOT SM12x
   - Updated error message to be more specific

2. CMakeLists.txt MLA_ARCHS:
   - Removed SM12x (12.0f, 12.1f, 12.0a, 12.1a) from MLA_ARCHS
   - CUTLASS MLA only supports SM10x, SM12x uses TRITON_MLA
   - Added clarifying comment

3. cuda.py documentation:
   - Removed unverified "Thor" device reference
   - Only tested hardware (GB10 DGX Spark) now mentioned

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: seli-equinix <seli@equinix.com>
- Collect tokens/sec, latency, TPS benchmarks
- Use plot plugin for trend graphs over builds
- CSV output for metrics tracking

Signed-off-by: seli-equinix <seli@equinix.com>
The _supports_quant_scheme() check restricted FP8 block-scale
(kFp8Static128BlockSym) to exact SM90 only. SM121 (GB10 DGX Spark)
has working FP8 block-scale support via FlashInfer v0.6.3 native
group_gemm_fp8_nt_groupwise path. Add is_device_capability_family(120)
to the FP8 block-scale check.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: seli-equinix <seli@equinix.com>
Separate FP4-specific utility checks (fp4_quantize,
nvfp4_block_scale_interleave) into has_flashinfer_nvfp4() so that
has_flashinfer_cutlass_fused_moe() only checks for the core CUTLASS
MoE entry point. This allows FP8 CUTLASS MoE to work on SM121 (GB10)
which has cutlass_fused_moe but lacks FP4 utilities.

Also gate the nvfp4 quant scheme on has_flashinfer_nvfp4() in
FlashInferExperts._supports_quant_scheme().

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: seli-equinix <seli@equinix.com>
GraphPickler raises 'Unexpected raw Node during pickling' when node.meta
contains raw Node objects in keys beyond the default filter list
(source_fn_stack, nn_module_stack, fwd_source_fn_stack). This occurs on
PyTorch nightly 2.11.0+ where additional FX passes inject Node refs into
metadata fields like from_node.

Fix: Walk all node.meta values recursively and strip any key whose value
tree contains a torch.fx.Node reference before calling GraphPickler.dumps().
Also add a fallback Node handler in the custom reducer_override as a
safety net for any references that slip through complex nested structures.

This fixes the AOT cache being written as 0 bytes on every startup,
which caused 'Ran out of input' errors on subsequent loads and forced
a full torch.compile recompilation (~11s) on every restart.

Signed-off-by: seli-equinix <seli@equinix.com>
@mergify
Copy link
Copy Markdown

mergify bot commented Mar 11, 2026

Hi @seli-equinix, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@mergify
Copy link
Copy Markdown

mergify bot commented Mar 12, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @seli-equinix.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build documentation Improvements or additions to documentation frontend needs-rebase nvidia qwen Related to Qwen models v1

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

6 participants