feat: Add SM121/GB10 (DGX Spark) Blackwell-class GPU support by seli-equinix · Pull Request #31740 · vllm-project/vllm

seli-equinix · 2026-01-05T17:23:41Z

Summary

With lots of help from others, adding support for NVIDIA GB10 GPU (SM121) found in DGX Spark devices. This extends Blackwell architecture support beyond SM100/SM103 to include the SM12x family.

Commits in This PR

Commit	Description
`e3554dab0`	feat: Add SM121/GB10 Blackwell-class GPU support - Core platform detection with `is_blackwell_class()`
`06a6e576e`	fix: Handle empty/whitespace tool_call arguments - Chat preprocessing fix
`8086e764f`	feat: Auto-configure TRITON_PTXAS_PATH for SM121/GB10 - Ensures Triton kernel compilation works
`64b34ee80`	fix: Add SM121 (12.1) arch support for CUTLASS SM120 kernels - CMakeLists.txt fixes for 12.1a/12.1f
`117a69efc`	feat: Add NVIDIA GB10 MoE tuning config for FP8 - Optimized kernel configs for Qwen3 MoE
`397a80eff`	chore: Add attention backend documentation generator - Pre-commit hook from upstream

Key Changes

Platform Detection

Added is_blackwell_class() method to detect SM10x, SM11x, and SM12x GPUs
Updated backend selection to properly handle GB10/SM121

CUTLASS Kernel Support

Extended SCALED_MM_ARCHS to include 12.1f (CUDA 13+) and 12.1a (CUDA <13)
Extended FP4_ARCHS, MLA_ARCHS, CUTLASS_MOE_DATA_ARCHS for SM121
Fixes: NotImplementedError: No compiled cutlass_scaled_mm for compute capability: 121

Triton Compilation

Auto-configures TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas when SM121 detected
Required because Triton doesn't recognize SM121 natively

MoE Kernel Tuning

Added optimized FP8 w8a8 MoE configs for GB10 with Qwen3-Next-80B-A3B-FP8
18 batch sizes tuned: 1, 2, 4, 8, 16, 24, 32, 48, 64, 96, 128, 256, 512, 1024, 1536, 2048, 3072, 4096

Hardware Tested

Property	Value
Device	NVIDIA DGX Spark with GB10 GPU
Compute Capability	SM121 (12.1)
CUDA Version	13.1.0
PyTouch Version	2.11.0+cu130
Architecture	ARM64 (aarch64)
Memory	119.6 GB unified
SMs	48

Performance Results

Model	Tokens/sec	Context Length
Qwen3-Coder-30B-A3B-FP8	44 tok/s	128K
Qwen3-Next-80B-A3B-FP8	45 tok/s	256K
qwen3-embedding
Qwen3-vision	45 tok/s	256K

Backend Support Matrix for SM121

Backend	Status	Notes
TRITON_ATTN	✅ Works	Primary attention backend
TRITON_MLA	✅ Works	Primary MLA backend
FlashInfer attention	✅ Works	Supports SM75-SM121
FlashInfer MLA	❌ SM100 only	Falls back to TRITON_MLA
CUTLASS MLA	❌ SM100 only	Falls back to TRITON_MLA
FlashMLA Sparse	❌ SM90/100 only	Not supported

Files Changed

Platform/Detection

vllm/platforms/interface.py - Added is_blackwell_class() abstract method
vllm/platforms/cuda.py - Implemented Blackwell-class detection

Build System

CMakeLists.txt - Added SM121 (12.1a/12.1f) to CUTLASS kernel arch lists

Attention/MLA

vllm/attention/utils/fa_utils.py - Extended FA fallback to SM12x
vllm/v1/attention/backends/flashinfer.py - Added SM12x to HND layout support

MoE/Quantization

vllm/model_executor/layers/fused_moe/configs/ - Added GB10 tuning configs

Auto-generated

docs/design/attention_backends.md - Regenerated by pre-commit hook
tools/pre_commit/generate_attention_backend_docs.py - From upstream

Notes

SM121 uses TRITON backends as primary since FlashInfer MLA/CUTLASS MLA only support SM100
GB10 has unified memory architecture which affects memory utilization differently than discrete GPUs
The MoE tuning configs provide significant speedup for Mixture-of-Experts models

gemini-code-assist

Code Review

This pull request adds support for NVIDIA GB10 (SM121) GPUs by introducing a is_blackwell_class() method and replacing hardcoded checks for SM100. The changes are well-structured and cover multiple parts of the codebase, including attention backends and quantization layers.

I've identified a critical issue regarding an inconsistency between a comment and code for FlashInfer autotuning, which could impact performance on new hardware. Additionally, there are a couple of instances of code duplication for the new is_blackwell_class logic that should be addressed to improve maintainability.

Overall, this is a good contribution to extend hardware support. Addressing the identified issues will make the changes more robust and easier to maintain.

vllm/model_executor/warmup/kernel_warmup.py

vllm/platforms/cuda.py

github-actions · 2026-01-05T18:05:51Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

mergify · 2026-01-05T18:17:57Z

Hi @seli-equinix, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Copilot

Pull request overview

This PR extends vLLM's Blackwell architecture support to include the SM121/GB10 GPU found in DGX Spark devices, moving from device-specific checks to a unified Blackwell-class detection approach.

Key changes:

Introduced is_blackwell_class() method to detect SM10x, SM11x, and SM12x GPUs as a unified Blackwell family
Replaced scattered is_device_capability_family(100) checks with the new class-based detection throughout the codebase
Added MoE configuration files optimized for GB10 hardware

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
vllm/platforms/interface.py	Added `is_blackwell_class()` method to platform interface with documentation for SM10x/11x/12x detection
vllm/platforms/cuda.py	Implemented Blackwell-class detection helper and integrated it into backend priority selection
vllm/attention/utils/fa_utils.py	Extended Flash Attention v3 fallback logic to include SM12x devices
vllm/v1/attention/backends/flashinfer.py	Updated HND layout detection and head dimension validation for Blackwell-class
vllm/v1/attention/backends/mla/*.py	Extended MLA backend compute capability checks to support SM12x
vllm/utils/*.py	Updated FlashInfer and DeepGemm utility functions to use Blackwell-class detection
vllm/model_executor/layers/quantization/*.py	Updated quantization backend selection logic for Blackwell-class
vllm/model_executor/layers/fused_moe/configs/*.json	Added GB10-specific MoE configuration files with Triton kernel parameters
vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py	Updated DeepGemm packed activation scale support check
vllm/model_executor/layers/batch_invariant.py	Extended batch-invariant mode enablement to Blackwell-class
vllm/model_executor/models/config.py	Updated kernel block alignment check for Blackwell-class
vllm/model_executor/warmup/kernel_warmup.py	Updated comment to reflect Blackwell-class architecture support

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

...ed_moe/configs/E=128,N=768,device_name=NVIDIA_GB10,dtype=fp8_w8a8,block_shape=[128,128].json

vllm/platforms/cuda.py

mergify · 2026-01-05T19:31:40Z

Hi @seli-equinix, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

vllm/model_executor/layers/quantization/utils/flashinfer_utils.py

eugr · 2026-01-06T19:05:54Z

@seli-equinix - what were your "before" numbers? I'm seeing about the same (or even better) performance without this change on my Spark with FP8 models. Have you tested any FP4 (NVFP4/MXFP4) models? Also, what do you mean by "context" here? Is it an allocated KV cache size or inference numbers for request context of this size? If the latter, it's great, but I'd like to see the measurement methodology. If the former, then this is from a few weeks ago on main branch vLLM and single DGX spark:

vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 --max-model-len 131072 --gpu-memory-utilization 0.7 --load-format fastsafetensors --host 0.0.0.0 --port 8888

Jan: 44 t/s

vllm bench serve \
  --backend vllm \
  --model Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 \
  --endpoint /v1/completions \
  --dataset-name sharegpt \
  --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
  --port 8000 \
  --num-prompts 1

============ Serving Benchmark Result ============
Successful requests:                     1
Failed requests:                         0
Benchmark duration (s):                  2.79
Total input tokens:                      12
Total generated tokens:                  119
Request throughput (req/s):              0.36
Output token throughput (tok/s):         42.61
Peak output token throughput (tok/s):    44.00
Peak concurrent requests:                1.00
Total Token throughput (tok/s):          46.90
---------------Time to First Token----------------
Mean TTFT (ms):                          131.13
Median TTFT (ms):                        131.13
P99 TTFT (ms):                           131.13
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          22.55
Median TPOT (ms):                        22.55
P99 TPOT (ms):                           22.55
---------------Inter-token Latency----------------
Mean ITL (ms):                           22.55
Median ITL (ms):                         22.38
P99 ITL (ms):                            25.39
==================================================

seli-equinix · 2026-01-06T19:24:23Z

@copilot open a new pull request to apply changes based on the comments in this thread

mergify · 2026-01-06T19:33:43Z

Hi @seli-equinix, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2026-01-06T19:39:21Z

Hi @seli-equinix, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

seli-equinix · 2026-01-06T19:50:36Z

@eugr please reach out to me at hellohal2064@gmail.com or you can give me a call 971-708-9761. would be happy to collaborate on the vLLM running on the DGX Spark :)

seli-equinix · 2026-01-06T20:47:52Z

@eugr thanks for the detailed questions! I ran benchmarks on my Spark using the same methodology:

My Setup: Model: Qwen3-Next-80B-A3B-FP8
vLLM: Built from this branch with SM121 detection
Input: ~12 tokens/request, Output: 120 tokens max

============ Serving Benchmark Result ============
Successful requests: 1
Failed requests: 0
Benchmark duration (s): 2.70
Total input tokens: 12
Total generated tokens: 120
Request throughput (req/s): 0.37
Output token throughput (tok/s): 44.43
Peak output token throughput (tok/s): 44.43
Peak concurrent requests: 1.00
Total Token throughput (tok/s): 48.88
---------------Time to First Token----------------
Mean TTFT (ms): 95.63
Median TTFT (ms): 95.63
P99 TTFT (ms): 95.63
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 21.89
Median TPOT (ms): 21.94
P99 TPOT (ms): 24.22
---------------Inter-token Latency----------------
Mean ITL (ms): 21.89
Median ITL (ms): 21.94
P99 ITL (ms): 24.22

What this PR actually fixes:*
The PR ensures SM121 (GB10/Spark) is recognized as Blackwell-class architecture for attention backend selection. Without it, the is_device_capability_family(100) checks would fail for SM121 since major=12, not major=10.

Your question about "before":
I should clarify - I don't have before/after performance numbers because my goal was compatibility, not optimization. What specific error (if any) were you seeing before this landed in main?

You're running on main branch successfully?
If you're running Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 on main without issues, that's interesting. Can you share:

What vLLM version/commit?
Any startup warnings about unsupported architecture?
Which attention backend is being selected?

My "context" reference:
I meant the model's configured context window (--max-model-len), not actual inference benchmarks with long prompts.

FP4:
Haven't tested NVFP4/MXFP4, my needs are focused on working with the smart Qwen models - I am building a custom MCP server that run's on my second Spark that is a LLM backed learning and memory system.

Would love to compare notes - sounds like we're both working on Spark optimization!

eugr · 2026-01-06T21:14:28Z

@seli-equinix - it's never been an issue for me on the main branch (after the initial Spark woes were addressed).
There is a startup warning from pyTorch, but it can be ignored as it doesn't seem to affect performance (I tried a patched version of pytorch too).

Here is the model running on vLLM compiled from the main branch yesterday using my Docker build:

root@spark:/workspace/vllm# vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 --max-model-len 131072 --gpu-memory-utilization 0.7 --load-format fastsafetensors --host 0.0.0.0 --port 8888
(APIServer pid=799) INFO 01-06 21:04:06 [api_server.py:1278] vLLM API server version 0.14.0rc1.dev265+g951302989.d20260105
(APIServer pid=799) INFO 01-06 21:04:06 [utils.py:253] non-default args: {'model_tag': 'Qwen/Qwen3-Next-80B-A3B-Instruct-FP8', 'host': '0.0.0.0', 'port': 8888, 'model': 'Qwen/Qwen3-Next-80B-A3B-Instruct-FP8', 'max_model_len': 131072, 'load_format': 'fastsafetensors', 'gpu_memory_utilization': 0.7}
(APIServer pid=799) INFO 01-06 21:04:11 [model.py:522] Resolved architecture: Qwen3NextForCausalLM
(APIServer pid=799) INFO 01-06 21:04:11 [model.py:1508] Using max model len 131072
(APIServer pid=799) WARNING 01-06 21:04:11 [vllm.py:1447] Current vLLM config is not set.
(APIServer pid=799) INFO 01-06 21:04:11 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=799) INFO 01-06 21:04:11 [vllm.py:635] Disabling NCCL for DP synchronization when using async scheduling.
(APIServer pid=799) INFO 01-06 21:04:11 [vllm.py:640] Asynchronous scheduling is enabled.
(APIServer pid=799) INFO 01-06 21:04:11 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=799) /usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:283: UserWarning:
(APIServer pid=799)     Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
(APIServer pid=799)     Minimum and Maximum cuda capability supported by this version of PyTorch is
(APIServer pid=799)     (8.0) - (12.0)
(APIServer pid=799)
(APIServer pid=799)   warnings.warn(
(APIServer pid=799) INFO 01-06 21:04:11 [config.py:469] Setting attention block size to 544 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=799) INFO 01-06 21:04:11 [config.py:493] Padding mamba page size by 1.49% to ensure that mamba page size and attention page size are exactly equal.
(EngineCore_DP0 pid=892) INFO 01-06 21:04:15 [core.py:96] Initializing a V1 LLM engine (v0.14.0rc1.dev265+g951302989.d20260105) with config: model='Qwen/Qwen3-Next-80B-A3B-Instruct-FP8', speculative_config=None, tokenizer='Qwen/Qwen3-Next-80B-A3B-Instruct-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=fastsafetensors, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False), seed=0, served_model_name=Qwen/Qwen3-Next-80B-A3B-Instruct-FP8, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+quant_fp8', 'none', '+quant_fp8'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False}, 'local_cache_dir': None}
(EngineCore_DP0 pid=892) /usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:283: UserWarning:
(EngineCore_DP0 pid=892)     Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
(EngineCore_DP0 pid=892)     Minimum and Maximum cuda capability supported by this version of PyTorch is
(EngineCore_DP0 pid=892)     (8.0) - (12.0)
(EngineCore_DP0 pid=892)
(EngineCore_DP0 pid=892)   warnings.warn(
(EngineCore_DP0 pid=892) INFO 01-06 21:04:15 [parallel_state.py:1214] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.24.104:59097 backend=nccl
(EngineCore_DP0 pid=892) INFO 01-06 21:04:15 [parallel_state.py:1425] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=892) INFO 01-06 21:04:16 [gpu_model_runner.py:3758] Starting to load model Qwen/Qwen3-Next-80B-A3B-Instruct-FP8...
(EngineCore_DP0 pid=892) INFO 01-06 21:04:16 [fp8.py:190] DeepGEMM is disabled because the platform does not support it.
(EngineCore_DP0 pid=892) INFO 01-06 21:04:16 [fp8.py:209] Using Triton backend for FP8 MoE
(EngineCore_DP0 pid=892) INFO 01-06 21:04:17 [cuda.py:351] Using FLASH_ATTN attention backend out of potential backends: ('FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION')
Loading safetensors using Fastsafetensor loader:   0% Completed | 0/8 [00:00<?, ?it/s]
(EngineCore_DP0 pid=892) /usr/local/lib/python3.12/dist-packages/fastsafetensors/copier/gds.py:185: UserWarning: GDS is not supported in this platform but nogds is False. use nogds=True
(EngineCore_DP0 pid=892)   warnings.warn(
Loading safetensors using Fastsafetensor loader:  12% Completed | 1/8 [00:03<00:24,  3.50s/it]
Loading safetensors using Fastsafetensor loader:  25% Completed | 2/8 [00:05<00:14,  2.42s/it]
Loading safetensors using Fastsafetensor loader:  38% Completed | 3/8 [00:08<00:13,  2.65s/it]
Loading safetensors using Fastsafetensor loader:  50% Completed | 4/8 [00:10<00:10,  2.64s/it]
Loading safetensors using Fastsafetensor loader:  62% Completed | 5/8 [00:13<00:08,  2.83s/it]
Loading safetensors using Fastsafetensor loader:  75% Completed | 6/8 [00:16<00:05,  2.81s/it]
Loading safetensors using Fastsafetensor loader:  88% Completed | 7/8 [00:19<00:02,  2.90s/it]
Loading safetensors using Fastsafetensor loader: 100% Completed | 8/8 [00:22<00:00,  2.91s/it]
Loading safetensors using Fastsafetensor loader: 100% Completed | 8/8 [00:22<00:00,  2.84s/it]
(EngineCore_DP0 pid=892)
(EngineCore_DP0 pid=892) INFO 01-06 21:04:42 [default_loader.py:308] Loading weights took 22.69 seconds
(EngineCore_DP0 pid=892) INFO 01-06 21:04:43 [gpu_model_runner.py:3855] Model loading took 74.8851 GiB memory and 26.704041 seconds
(EngineCore_DP0 pid=892) INFO 01-06 21:04:48 [backends.py:644] Using cache directory: /root/.cache/vllm/torch_compile_cache/412fb0c8f0/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=892) INFO 01-06 21:04:48 [backends.py:704] Dynamo bytecode transform time: 5.09 s
(EngineCore_DP0 pid=892) [rank0]:W0106 21:04:53.838000 892 torch/_inductor/utils.py:1613] Not enough SMs to use max_autotune_gemm mode
(EngineCore_DP0 pid=892) INFO 01-06 21:04:56 [backends.py:261] Cache the graph of compile range (1, 2048) for later use
(EngineCore_DP0 pid=892) WARNING 01-06 21:04:59 [fused_moe.py:1054] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=512,N=512,device_name=NVIDIA_GB10,dtype=fp8_w8a8,block_shape=[128,128].json
(EngineCore_DP0 pid=892) INFO 01-06 21:06:22 [backends.py:278] Compiling a graph for compile range (1, 2048) takes 88.76 s
(EngineCore_DP0 pid=892) INFO 01-06 21:06:22 [monitor.py:34] torch.compile takes 93.85 s in total
(EngineCore_DP0 pid=892) INFO 01-06 21:06:23 [gpu_worker.py:361] Available KV cache memory: 4.65 GiB
(EngineCore_DP0 pid=892) INFO 01-06 21:06:23 [kv_cache_utils.py:1305] GPU KV cache size: 50,592 tokens
(EngineCore_DP0 pid=892) INFO 01-06 21:06:23 [kv_cache_utils.py:1310] Maximum concurrency for 131,072 tokens per request: 1.53x
(EngineCore_DP0 pid=892) 2026-01-06 21:06:23,957 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore_DP0 pid=892) 2026-01-06 21:06:24,066 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51/51 [00:04<00:00, 10.88it/s]
Capturing CUDA graphs (decode, FULL): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:52<00:00,  1.50s/it]
(EngineCore_DP0 pid=892) INFO 01-06 21:07:22 [gpu_model_runner.py:4806] Graph capturing finished in 58 secs, took 2.52 GiB
(EngineCore_DP0 pid=892) INFO 01-06 21:07:22 [core.py:273] init engine (profile, create kv cache, warmup model) took 158.76 seconds
(EngineCore_DP0 pid=892) INFO 01-06 21:07:23 [core.py:185] Batch queue is enabled with size 2
(EngineCore_DP0 pid=892) INFO 01-06 21:07:23 [vllm.py:640] Asynchronous scheduling is enabled.
(APIServer pid=799) INFO 01-06 21:07:23 [api_server.py:1020] Supported tasks: ['generate']
(APIServer pid=799) WARNING 01-06 21:07:24 [model.py:1329] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=799) INFO 01-06 21:07:24 [serving_responses.py:201] Using default chat sampling params from model: {'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
(APIServer pid=799) INFO 01-06 21:07:24 [serving_chat.py:144] Using default chat sampling params from model: {'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
(APIServer pid=799) INFO 01-06 21:07:24 [serving_chat.py:180] Warming up chat template processing...
(APIServer pid=799) INFO 01-06 21:07:25 [chat_utils.py:599] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=799) INFO 01-06 21:07:25 [serving_chat.py:216] Chat template warmup completed in 1305.3ms
(APIServer pid=799) INFO 01-06 21:07:25 [serving_completion.py:78] Using default completion sampling params from model: {'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
(APIServer pid=799) INFO 01-06 21:07:25 [serving_chat.py:144] Using default chat sampling params from model: {'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
(APIServer pid=799) INFO 01-06 21:07:25 [api_server.py:1352] Starting vLLM API server 0 on http://0.0.0.0:8888
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:38] Available routes are:
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /docs, Methods: HEAD, GET
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /pause, Methods: POST
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /resume, Methods: POST
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /is_paused, Methods: GET
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /v1/audio/translations, Methods: POST
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /classify, Methods: POST
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /v1/embeddings, Methods: POST
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /score, Methods: POST
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /v1/score, Methods: POST
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /rerank, Methods: POST
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /v1/rerank, Methods: POST
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /v2/rerank, Methods: POST
(APIServer pid=799) INFO 01-06 21:07:25 [launcher.py:46] Route: /pooling, Methods: POST
(APIServer pid=799) INFO:     Started server process [799]
(APIServer pid=799) INFO:     Waiting for application startup.
(APIServer pid=799) INFO:     Application startup complete.

There are a few warnings during inference that I haven't seen a month ago when I ran the model last time, but they don't seem to affect anything:

(EngineCore_DP0 pid=892) /usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py:1044: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (12) < num_heads (16). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=892)   return fn(*args, **kwargs)
(EngineCore_DP0 pid=892) /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (12) < num_heads (32). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=892)   return fn(*contiguous_args, **contiguous_kwargs)
(EngineCore_DP0 pid=892) /usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py:1044: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (12) < num_heads (16). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=892)   return fn(*args, **kwargs)
(EngineCore_DP0 pid=892) /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (12) < num_heads (32). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=892)   return fn(*contiguous_args, **contiguous_kwargs)

Benchmark on today run:

============ Serving Benchmark Result ============
Successful requests:                     1
Failed requests:                         0
Benchmark duration (s):                  2.67
Total input tokens:                      12
Total generated tokens:                  119
Request throughput (req/s):              0.37
Output token throughput (tok/s):         44.57
Peak output token throughput (tok/s):    46.00
Peak concurrent requests:                1.00
Total token throughput (tok/s):          49.06
---------------Time to First Token----------------
Mean TTFT (ms):                          93.94
Median TTFT (ms):                        93.94
P99 TTFT (ms):                           93.94
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          21.83
Median TPOT (ms):                        21.83
P99 TPOT (ms):                           21.83
---------------Inter-token Latency----------------
Mean ITL (ms):                           21.83
Median ITL (ms):                         21.77
P99 ITL (ms):                            23.79
==================================================

seli-equinix · 2026-01-06T22:10:01Z

@eugr That is really strange, I could not get the docker (29.1.x) container to build with out the changes to the code in this PR. This is my setup from inside the container
OS | Ubuntu 24.04.3 LTS (Noble Numbat)
Kernel | 6.14.0-1015-nvidia (aarch64)
PyTorch | 2.11.0.dev20260103+cu130 (nightly)
CUDA | 13.0
cuDNN | 9.15.01

eugr · 2026-01-06T23:21:45Z

It could be PyTorch. I had to revert to release version of PyTorch as pre-release ones were giving me weird errors.
What Dockerfile are you using? One from vLLM repo? It never really worked for me anyway. But cu130 wheels or just building from the source works fine. You can try my build: https://github.com/eugr/spark-vllm-docker

CUDA 13.1 (but it worked with 13.0.2 just as well)
torch==2.9.1+cu130
cuDNN 9.17.0
flashinfer-cubin==0.6.0rc2
flashinfer-jit-cache==0.6.0rc2+cu130
flashinfer-python==0.6.0rc2
triton @ file:///workspace/wheels/triton-3.5.1-cp312-cp312-linux_aarch64.whl
triton-kernels @ file:///workspace/wheels/triton_kernels-1.0.0-py3-none-any.whl

Having said that, I'll try to apply your patch post-build and see if it improves anything.

eugr · 2026-01-06T23:30:30Z

@seli-equinix - Tried to run with your patches applied (since they are all Python, I just applied post-install), and it fails on inference with trtllm - architecture is not supported:

root@spark:/workspace/vllm# vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 --max-model-len 131072 --gpu-memory-utilization 0.7 --load-format fastsafetensors --host 0.0.0.0 --port 8888 --enable-prefix-caching
(APIServer pid=4548) INFO 01-06 23:25:32 [api_server.py:1278] vLLM API server version 0.14.0rc1.dev265+g951302989.d20260105
(APIServer pid=4548) INFO 01-06 23:25:32 [utils.py:253] non-default args: {'model_tag': 'Qwen/Qwen3-Next-80B-A3B-Instruct-FP8', 'host': '0.0.0.0', 'port': 8888, 'model': 'Qwen/Qwen3-Next-80B-A3B-Instruct-FP8', 'max_model_len': 131072, 'load_format': 'fastsafetensors', 'gpu_memory_utilization': 0.7, 'enable_prefix_caching': True}
(APIServer pid=4548) INFO 01-06 23:25:33 [model.py:522] Resolved architecture: Qwen3NextForCausalLM
(APIServer pid=4548) INFO 01-06 23:25:33 [model.py:1508] Using max model len 131072
(APIServer pid=4548) WARNING 01-06 23:25:33 [vllm.py:1447] Current vLLM config is not set.
(APIServer pid=4548) INFO 01-06 23:25:33 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=4548) INFO 01-06 23:25:33 [vllm.py:635] Disabling NCCL for DP synchronization when using async scheduling.
(APIServer pid=4548) INFO 01-06 23:25:33 [vllm.py:640] Asynchronous scheduling is enabled.
(APIServer pid=4548) INFO 01-06 23:25:33 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=4548) INFO 01-06 23:25:33 [config.py:338] Hybrid or mamba-based model detected without support for prefix caching: disabling.
(APIServer pid=4548) /usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:283: UserWarning:
(APIServer pid=4548)     Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
(APIServer pid=4548)     Minimum and Maximum cuda capability supported by this version of PyTorch is
(APIServer pid=4548)     (8.0) - (12.0)
(APIServer pid=4548)
(APIServer pid=4548)   warnings.warn(
(APIServer pid=4548) INFO 01-06 23:25:34 [config.py:469] Setting attention block size to 544 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=4548) INFO 01-06 23:25:34 [config.py:493] Padding mamba page size by 1.49% to ensure that mamba page size and attention page size are exactly equal.
(EngineCore_DP0 pid=4599) INFO 01-06 23:25:38 [core.py:96] Initializing a V1 LLM engine (v0.14.0rc1.dev265+g951302989.d20260105) with config: model='Qwen/Qwen3-Next-80B-A3B-Instruct-FP8', speculative_config=None, tokenizer='Qwen/Qwen3-Next-80B-A3B-Instruct-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=fastsafetensors, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False), seed=0, served_model_name=Qwen/Qwen3-Next-80B-A3B-Instruct-FP8, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+quant_fp8', 'none', '+quant_fp8'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False}, 'local_cache_dir': None}
(EngineCore_DP0 pid=4599) /usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:283: UserWarning:
(EngineCore_DP0 pid=4599)     Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
(EngineCore_DP0 pid=4599)     Minimum and Maximum cuda capability supported by this version of PyTorch is
(EngineCore_DP0 pid=4599)     (8.0) - (12.0)
(EngineCore_DP0 pid=4599)
(EngineCore_DP0 pid=4599)   warnings.warn(
(EngineCore_DP0 pid=4599) INFO 01-06 23:25:38 [parallel_state.py:1214] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.24.104:46275 backend=nccl
(EngineCore_DP0 pid=4599) INFO 01-06 23:25:38 [parallel_state.py:1425] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=4599) INFO 01-06 23:25:38 [gpu_model_runner.py:3758] Starting to load model Qwen/Qwen3-Next-80B-A3B-Instruct-FP8...
(EngineCore_DP0 pid=4599) INFO 01-06 23:25:38 [fp8.py:190] DeepGEMM is disabled because the platform does not support it.
(EngineCore_DP0 pid=4599) INFO 01-06 23:25:38 [fp8.py:209] Using Triton backend for FP8 MoE
(EngineCore_DP0 pid=4599) INFO 01-06 23:25:39 [cuda.py:381] Using FLASHINFER attention backend out of potential backends: ('FLASHINFER', 'FLASH_ATTN', 'TRITON_ATTN', 'FLEX_ATTENTION')
(EngineCore_DP0 pid=4599) INFO 01-06 23:25:39 [selector.py:112] Using HND KV cache layout for FLASHINFER backend.
Loading safetensors using Fastsafetensor loader:   0% Completed | 0/8 [00:00<?, ?it/s]
(EngineCore_DP0 pid=4599) /usr/local/lib/python3.12/dist-packages/fastsafetensors/copier/gds.py:185: UserWarning: GDS is not supported in this platform but nogds is False. use nogds=True
(EngineCore_DP0 pid=4599)   warnings.warn(
Loading safetensors using Fastsafetensor loader:  12% Completed | 1/8 [00:03<00:25,  3.60s/it]
Loading safetensors using Fastsafetensor loader:  25% Completed | 2/8 [00:05<00:15,  2.51s/it]
Loading safetensors using Fastsafetensor loader:  38% Completed | 3/8 [00:08<00:13,  2.61s/it]
Loading safetensors using Fastsafetensor loader:  50% Completed | 4/8 [00:10<00:10,  2.69s/it]
Loading safetensors using Fastsafetensor loader:  62% Completed | 5/8 [00:13<00:08,  2.69s/it]
Loading safetensors using Fastsafetensor loader:  75% Completed | 6/8 [00:16<00:05,  2.72s/it]
Loading safetensors using Fastsafetensor loader:  88% Completed | 7/8 [00:19<00:02,  2.76s/it]
Loading safetensors using Fastsafetensor loader: 100% Completed | 8/8 [00:21<00:00,  2.76s/it]
Loading safetensors using Fastsafetensor loader: 100% Completed | 8/8 [00:21<00:00,  2.74s/it]
(EngineCore_DP0 pid=4599)
(EngineCore_DP0 pid=4599) INFO 01-06 23:26:04 [default_loader.py:308] Loading weights took 21.96 seconds
(EngineCore_DP0 pid=4599) INFO 01-06 23:26:05 [gpu_model_runner.py:3855] Model loading took 74.8851 GiB memory and 26.018481 seconds
(EngineCore_DP0 pid=4599) INFO 01-06 23:26:10 [backends.py:644] Using cache directory: /root/.cache/vllm/torch_compile_cache/d7e56f8d20/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=4599) INFO 01-06 23:26:10 [backends.py:704] Dynamo bytecode transform time: 5.06 s
(EngineCore_DP0 pid=4599) INFO 01-06 23:26:15 [backends.py:261] Cache the graph of compile range (1, 2048) for later use
(EngineCore_DP0 pid=4599) WARNING 01-06 23:26:15 [fused_moe.py:1054] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=512,N=512,device_name=NVIDIA_GB10,dtype=fp8_w8a8,block_shape=[128,128].json
(EngineCore_DP0 pid=4599) INFO 01-06 23:26:19 [backends.py:278] Compiling a graph for compile range (1, 2048) takes 3.95 s
(EngineCore_DP0 pid=4599) INFO 01-06 23:26:19 [monitor.py:34] torch.compile takes 9.01 s in total
(EngineCore_DP0 pid=4599) INFO 01-06 23:26:20 [gpu_worker.py:361] Available KV cache memory: 5.20 GiB
(EngineCore_DP0 pid=4599) INFO 01-06 23:26:20 [kv_cache_utils.py:1305] GPU KV cache size: 56,576 tokens
(EngineCore_DP0 pid=4599) INFO 01-06 23:26:20 [kv_cache_utils.py:1310] Maximum concurrency for 131,072 tokens per request: 1.71x
(EngineCore_DP0 pid=4599) INFO 01-06 23:26:20 [utils.py:465] `_KV_CACHE_LAYOUT_OVERRIDE` variable detected. Setting KV cache layout to HND.
(EngineCore_DP0 pid=4599) 2026-01-06 23:26:20,690 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore_DP0 pid=4599) 2026-01-06 23:26:20,805 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51/51 [00:04<00:00, 12.32it/s]
Capturing CUDA graphs (decode, FULL):   0%|                                                                                                                                                                                                                                                                                                                                                                 | 0/35 [00:00<?, ?it/s](EngineCore_DP0 pid=4599) WARNING 01-06 23:26:25 [flashinfer.py:398] Using TRTLLM prefill attention (auto-detected).
Capturing CUDA graphs (decode, FULL): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:04<00:00,  7.35it/s]
(EngineCore_DP0 pid=4599) INFO 01-06 23:26:30 [gpu_model_runner.py:4806] Graph capturing finished in 10 secs, took 2.07 GiB
(EngineCore_DP0 pid=4599) INFO 01-06 23:26:30 [core.py:273] init engine (profile, create kv cache, warmup model) took 25.38 seconds
(EngineCore_DP0 pid=4599) INFO 01-06 23:26:31 [core.py:185] Batch queue is enabled with size 2
(EngineCore_DP0 pid=4599) INFO 01-06 23:26:32 [vllm.py:640] Asynchronous scheduling is enabled.
(APIServer pid=4548) INFO 01-06 23:26:32 [api_server.py:1020] Supported tasks: ['generate']
(APIServer pid=4548) WARNING 01-06 23:26:32 [model.py:1329] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=4548) INFO 01-06 23:26:32 [serving_responses.py:201] Using default chat sampling params from model: {'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
(APIServer pid=4548) INFO 01-06 23:26:32 [serving_chat.py:144] Using default chat sampling params from model: {'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
(APIServer pid=4548) INFO 01-06 23:26:32 [serving_chat.py:180] Warming up chat template processing...
(APIServer pid=4548) INFO 01-06 23:26:34 [chat_utils.py:599] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=4548) INFO 01-06 23:26:34 [serving_chat.py:216] Chat template warmup completed in 1351.7ms
(APIServer pid=4548) INFO 01-06 23:26:34 [serving_completion.py:78] Using default completion sampling params from model: {'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
(APIServer pid=4548) INFO 01-06 23:26:34 [serving_chat.py:144] Using default chat sampling params from model: {'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
(APIServer pid=4548) INFO 01-06 23:26:34 [api_server.py:1352] Starting vLLM API server 0 on http://0.0.0.0:8888
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:38] Available routes are:
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /docs, Methods: HEAD, GET
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /pause, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /resume, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /is_paused, Methods: GET
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /v1/audio/translations, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /classify, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /v1/embeddings, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /score, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /v1/score, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /rerank, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /v1/rerank, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /v2/rerank, Methods: POST
(APIServer pid=4548) INFO 01-06 23:26:34 [launcher.py:46] Route: /pooling, Methods: POST
(APIServer pid=4548) INFO:     Started server process [4548]
(APIServer pid=4548) INFO:     Waiting for application startup.
(APIServer pid=4548) INFO:     Application startup complete.
(APIServer pid=4548) INFO:     192.168.24.115:41154 - "POST /v1/completions HTTP/1.1" 200 OK
(EngineCore_DP0 pid=4599) /usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py:1044: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (12) < num_heads (16). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=4599)   return fn(*args, **kwargs)
(EngineCore_DP0 pid=4599) /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (12) < num_heads (32). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=4599)   return fn(*contiguous_args, **contiguous_kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.14.0rc1.dev265+g951302989.d20260105) with config: model='Qwen/Qwen3-Next-80B-A3B-Instruct-FP8', speculative_config=None, tokenizer='Qwen/Qwen3-Next-80B-A3B-Instruct-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=fastsafetensors, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False), seed=0, served_model_name=Qwen/Qwen3-Next-80B-A3B-Instruct-FP8, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '/root/.cache/vllm/torch_compile_cache/d7e56f8d20', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+quant_fp8', 'none', '+quant_fp8', '+quant_fp8', '+quant_fp8'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False}, 'local_cache_dir': '/root/.cache/vllm/torch_compile_cache/d7e56f8d20/rank_0_0/backbone'},
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=cmpl-8a1ba3eb074cb7e0-0-a09f774b,prompt_token_ids_len=12,mm_features=[],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[151643], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=119, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, structured_outputs=None, extra_args=None),block_ids=([1], [2], [3], [4]),num_computed_tokens=0,lora_request=None,prompt_embeds_shape=None)], scheduled_cached_reqs=CachedRequestData(req_ids=[], resumed_req_ids=[], new_token_ids=[], all_token_ids={}, new_block_ids=[], num_computed_tokens=[], num_output_tokens=[]), num_scheduled_tokens={cmpl-8a1ba3eb074cb7e0-0-a09f774b: 12}, total_num_scheduled_tokens=12, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 0], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], pending_structured_output_tokens=false, kv_connector_metadata=null, ec_connector_metadata=null)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.009615384615384581, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, kv_cache_eviction_events=[], spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] EngineCore encountered a fatal error.
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] Traceback (most recent call last):
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 881, in run_engine_core
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     engine_core.run_busy_loop()
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 908, in run_busy_loop
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     self._process_engine_step()
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 941, in _process_engine_step
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 455, in step_with_batch_queue
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     exec_model_fut.result()
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     return self.__get_result()
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     raise self._exception
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 79, in collective_rpc
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py", line 461, in run_method
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     return func(*args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 369, in execute_model
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     return self.worker.execute_model(scheduler_output, *args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     return func(*args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 622, in execute_model
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     output = self.model_runner.execute_model(
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     return func(*args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3276, in execute_model
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     model_output = self._model_forward(
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]                    ^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2905, in _model_forward
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     return self.model(
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]            ^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 220, in __call__
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     return self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 1232, in forward
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     hidden_states = self.model(
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]                     ^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 442, in __call__
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     return TorchCompileWithNoGuardsWrapper.__call__(self, *args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/wrapper.py", line 223, in __call__
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     return self._call_with_optional_nvtx_range(
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/wrapper.py", line 109, in _call_with_optional_nvtx_range
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     return callable_fn(*args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 998, in forward
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     def forward(
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     return fn(*args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/caching.py", line 57, in __call__
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     return self.optimized_call(*args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 837, in call_wrapped
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     return self._wrapped_call(self, *args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 413, in __call__
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     raise e
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 400, in __call__
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "<eval_with_key>.98", line 429, in forward
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     submod_7 = self.submod_7(getitem_21, s72, getitem_22, getitem_23, getitem_24);  getitem_21 = getitem_22 = getitem_23 = submod_7 = None
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 837, in call_wrapped
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     return self._wrapped_call(self, *args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 413, in __call__
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     raise e
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 400, in __call__
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "<eval_with_key>.8", line 5, in forward
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     unified_attention_with_output = torch.ops.vllm.unified_attention_with_output(query_8, key_8, value_9, output_10, 'model.layers.3.self_attn.attn');  query_8 = key_8 = value_9 = output_10 = unified_attention_with_output = None
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1255, in __call__
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     return self._op(*args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/vllm/attention/utils/kv_transfer_utils.py", line 39, in wrapper
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     return func(*args, **kwargs)
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/vllm/attention/layer.py", line 807, in unified_attention_with_output
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     self.impl.forward(
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/flashinfer.py", line 1430, in forward
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     trtllm_batch_context_with_kv_cache(
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "/usr/local/lib/python3.12/dist-packages/flashinfer/prefill.py", line 3644, in trtllm_batch_context_with_kv_cache
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]     run_func(
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890]   File "python/tvm_ffi/cython/function.pxi", line 923, in tvm_ffi.core.Function.__call__
(EngineCore_DP0 pid=4599) ERROR 01-06 23:27:37 [core.py:890] RuntimeError: Error in function 'TllmGenFmhaRunner' at /workspace/include/flashinfer/trtllm/fmha/fmhaRunner.cuh:30: Unsupported architecture
(EngineCore_DP0 pid=4599) Process EngineCore_DP0:
(EngineCore_DP0 pid=4599) Traceback (most recent call last):
(EngineCore_DP0 pid=4599)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=4599)     self.run()
(EngineCore_DP0 pid=4599)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=4599)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 892, in run_engine_core
(EngineCore_DP0 pid=4599)     raise e
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 881, in run_engine_core
(EngineCore_DP0 pid=4599)     engine_core.run_busy_loop()
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 908, in run_busy_loop
(EngineCore_DP0 pid=4599)     self._process_engine_step()
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 941, in _process_engine_step
(EngineCore_DP0 pid=4599)     outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=4599)                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 455, in step_with_batch_queue
(EngineCore_DP0 pid=4599)     exec_model_fut.result()
(EngineCore_DP0 pid=4599)   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore_DP0 pid=4599)     return self.__get_result()
(EngineCore_DP0 pid=4599)            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore_DP0 pid=4599)     raise self._exception
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 79, in collective_rpc
(EngineCore_DP0 pid=4599)     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=4599)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py", line 461, in run_method
(EngineCore_DP0 pid=4599)     return func(*args, **kwargs)
(EngineCore_DP0 pid=4599)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 369, in execute_model
(EngineCore_DP0 pid=4599)     return self.worker.execute_model(scheduler_output, *args, **kwargs)
(EngineCore_DP0 pid=4599)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=4599)     return func(*args, **kwargs)
(EngineCore_DP0 pid=4599)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 622, in execute_model
(EngineCore_DP0 pid=4599)     output = self.model_runner.execute_model(
(EngineCore_DP0 pid=4599)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=4599)     return func(*args, **kwargs)
(EngineCore_DP0 pid=4599)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3276, in execute_model
(EngineCore_DP0 pid=4599)     model_output = self._model_forward(
(EngineCore_DP0 pid=4599)                    ^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2905, in _model_forward
(EngineCore_DP0 pid=4599)     return self.model(
(EngineCore_DP0 pid=4599)            ^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 220, in __call__
(EngineCore_DP0 pid=4599)     return self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=4599)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=4599)     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=4599)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=4599)     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=4599)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 1232, in forward
(EngineCore_DP0 pid=4599)     hidden_states = self.model(
(EngineCore_DP0 pid=4599)                     ^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 442, in __call__
(EngineCore_DP0 pid=4599)     return TorchCompileWithNoGuardsWrapper.__call__(self, *args, **kwargs)
(EngineCore_DP0 pid=4599)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/wrapper.py", line 223, in __call__
(EngineCore_DP0 pid=4599)     return self._call_with_optional_nvtx_range(
(EngineCore_DP0 pid=4599)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/wrapper.py", line 109, in _call_with_optional_nvtx_range
(EngineCore_DP0 pid=4599)     return callable_fn(*args, **kwargs)
(EngineCore_DP0 pid=4599)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 998, in forward
(EngineCore_DP0 pid=4599)     def forward(
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn
(EngineCore_DP0 pid=4599)     return fn(*args, **kwargs)
(EngineCore_DP0 pid=4599)            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/caching.py", line 57, in __call__
(EngineCore_DP0 pid=4599)     return self.optimized_call(*args, **kwargs)
(EngineCore_DP0 pid=4599)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 837, in call_wrapped
(EngineCore_DP0 pid=4599)     return self._wrapped_call(self, *args, **kwargs)
(EngineCore_DP0 pid=4599)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 413, in __call__
(EngineCore_DP0 pid=4599)     raise e
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 400, in __call__
(EngineCore_DP0 pid=4599)     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(EngineCore_DP0 pid=4599)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=4599)     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=4599)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=4599)     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=4599)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "<eval_with_key>.98", line 429, in forward
(EngineCore_DP0 pid=4599)     submod_7 = self.submod_7(getitem_21, s72, getitem_22, getitem_23, getitem_24);  getitem_21 = getitem_22 = getitem_23 = submod_7 = None
(EngineCore_DP0 pid=4599)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 837, in call_wrapped
(EngineCore_DP0 pid=4599)     return self._wrapped_call(self, *args, **kwargs)
(EngineCore_DP0 pid=4599)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 413, in __call__
(EngineCore_DP0 pid=4599)     raise e
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 400, in __call__
(EngineCore_DP0 pid=4599)     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(EngineCore_DP0 pid=4599)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore_DP0 pid=4599)     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=4599)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore_DP0 pid=4599)     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=4599)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "<eval_with_key>.8", line 5, in forward
(EngineCore_DP0 pid=4599)     unified_attention_with_output = torch.ops.vllm.unified_attention_with_output(query_8, key_8, value_9, output_10, 'model.layers.3.self_attn.attn');  query_8 = key_8 = value_9 = output_10 = unified_attention_with_output = None
(EngineCore_DP0 pid=4599)                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1255, in __call__
(EngineCore_DP0 pid=4599)     return self._op(*args, **kwargs)
(EngineCore_DP0 pid=4599)            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/vllm/attention/utils/kv_transfer_utils.py", line 39, in wrapper
(EngineCore_DP0 pid=4599)     return func(*args, **kwargs)
(EngineCore_DP0 pid=4599)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/vllm/attention/layer.py", line 807, in unified_attention_with_output
(EngineCore_DP0 pid=4599)     self.impl.forward(
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/flashinfer.py", line 1430, in forward
(EngineCore_DP0 pid=4599)     trtllm_batch_context_with_kv_cache(
(EngineCore_DP0 pid=4599)   File "/usr/local/lib/python3.12/dist-packages/flashinfer/prefill.py", line 3644, in trtllm_batch_context_with_kv_cache
(EngineCore_DP0 pid=4599)     run_func(
(EngineCore_DP0 pid=4599)   File "python/tvm_ffi/cython/function.pxi", line 923, in tvm_ffi.core.Function.__call__
(EngineCore_DP0 pid=4599) RuntimeError: Error in function 'TllmGenFmhaRunner' at /workspace/include/flashinfer/trtllm/fmha/fmhaRunner.cuh:30: Unsupported architecture
(APIServer pid=4548) ERROR 01-06 23:27:37 [async_llm.py:543] AsyncLLM output_handler failed.
(APIServer pid=4548) ERROR 01-06 23:27:37 [async_llm.py:543] Traceback (most recent call last):
(APIServer pid=4548) ERROR 01-06 23:27:37 [async_llm.py:543]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 495, in output_handler
(APIServer pid=4548) ERROR 01-06 23:27:37 [async_llm.py:543]     outputs = await engine_core.get_output_async()
(APIServer pid=4548) ERROR 01-06 23:27:37 [async_llm.py:543]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=4548) ERROR 01-06 23:27:37 [async_llm.py:543]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 899, in get_output_async
(APIServer pid=4548) ERROR 01-06 23:27:37 [async_llm.py:543]     raise self._format_exception(outputs) from None
(APIServer pid=4548) ERROR 01-06 23:27:37 [async_llm.py:543] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512] Error in completion stream generator.
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512] Traceback (most recent call last):
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_completion.py", line 352, in completion_stream_generator
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512]     async for prompt_idx, res in result_generator:
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils/async_utils.py", line 278, in merge_async_iterators
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512]     async for item in iterators[0]:
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 439, in generate
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512]     out = q.get_nowait() or await q.get()
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512]                             ^^^^^^^^^^^^^
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/output_processor.py", line 73, in get
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512]     raise output
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 495, in output_handler
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512]     outputs = await engine_core.get_output_async()
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 899, in get_output_async
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512]     raise self._format_exception(outputs) from None
(APIServer pid=4548) ERROR 01-06 23:27:37 [serving_completion.py:512] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
[rank0]:[W106 23:27:37.808671670 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=4548) INFO:     Shutting down
(APIServer pid=4548) INFO:     Waiting for application shutdown.
(APIServer pid=4548) INFO:     Application shutdown complete.
(APIServer pid=4548) INFO:     Finished server process [4548]

eugr · 2026-01-06T23:56:09Z

Same with flashinfer built from the source.

seli-equinix · 2026-01-07T03:20:35Z

@eugr I think you would need to make the latest pytouch 2.11. That is what I have running. I think the reason yours is failing is am still working on getting flashinfer-cubin to work so on my build it still uses TRITON_ATTN backend. It should have switch to that and not tried flashinfer. I am hoping to get flashinfer working soon.

mergify · 2026-02-07T13:43:46Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @seli-equinix.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2026-02-16T18:36:09Z

Hi @seli-equinix, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Borg2025 · 2026-02-16T21:10:40Z

Building this on a DGX Spark (GB10, CUDA 13.0, Ubuntu 24.04 aarch64) right now — documenting the full process here: https://gist.github.com/Borg2025/8254034a2dfabab380758a76db8e111b

Key finding: torch==2.9.1+cu129 from PyTorch's index works on GB10 (SM 12.1 warning is harmless). Build is compiling with SM120 kernels enabled. Will update the gist with results and benchmarks (vLLM vs Ollama) when complete.

Related: #31588 (ohsono's GB10 bug report)

Thanks @seli-equinix for the GB10 patches — exactly what we needed. 🙏

mergify · 2026-02-16T22:39:17Z

Hi @seli-equinix, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2026-02-16T22:47:10Z

Hi @seli-equinix, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

seli-equinix · 2026-02-17T18:20:42Z

Building this on a DGX Spark (GB10, CUDA 13.0, Ubuntu 24.04 aarch64) right now — documenting the full process here: https://gist.github.com/Borg2025/8254034a2dfabab380758a76db8e111b

Key finding: torch==2.9.1+cu129 from PyTorch's index works on GB10 (SM 12.1 warning is harmless). Build is compiling with SM120 kernels enabled. Will update the gist with results and benchmarks (vLLM vs Ollama) when complete.

Related: #31588 (ohsono's GB10 bug report)

Thanks @seli-equinix for the GB10 patches — exactly what we needed. 🙏

do you use the slack for vllm our are you on the NVIDIA community that way we can talk?

mergify · 2026-02-18T18:57:37Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @seli-equinix.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2026-03-05T17:46:13Z

Hi @seli-equinix, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2026-03-08T10:07:54Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @seli-equinix.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

This PR adds support for NVIDIA GB10 GPUs found in DGX Spark devices. The GB10 reports compute capability 12.1 (SM121), which is part of the Blackwell architecture family but uses a different major version than the B100/B200 data center GPUs (SM10x). Changes: - Added is_blackwell_class() method to Platform interface and CudaPlatformBase - Updated _get_backend_priorities() to handle SM10x, SM11x, and SM12x - Replaced all is_device_capability_family(100) checks with is_blackwell_class() - Updated attention backend compute capability checks - Added docstrings explaining SM121 cuDNN/FlashInfer compatibility Blackwell architecture family now includes: - SM100/SM101: B100, B200 data center GPUs (major=10) - SM120/SM121: GB10 DGX Spark, Thor edge devices (major=12) - SM11x: Reserved for future Blackwell variants cuDNN prefill support for SM121: The cuDNN SDPA cubins (named cudnn_sm100_*) are architecture-family binaries that support all Blackwell variants. FlashInfer explicitly supports SM121 (beta) and dispatches SM100, SM110, SM120, and SM121 to the same gen_fmha_cutlass_sm100a_module. The has_nvidia_artifactory() check ensures cubins are available before enabling this feature. Tested on: NVIDIA GB10 (DGX Spark 2) with CUDA 13.0 and PyTorch 2.11 Signed-off-by: seli-equinix <seli@equinix.com>

Cherry-pick from fix/tool-call-empty-arguments branch. Prevents JSONDecodeError with Continue VSCode extension. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: seli-equinix <seli@equinix.com>

Cherry-pick from PR vllm-project#32704 - auto-detects GPU arch >= 110 and configures TRITON_PTXAS_PATH to use system CUDA toolkit's ptxas instead of Triton's bundled version (CUDA 12.8) which doesn't support sm_121a. This ensures Triton kernels compile correctly on DGX Spark GB10. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: seli-equinix <seli@equinix.com>

The CMakeLists.txt SM120 kernel checks only included 12.0a/12.0f but not 12.1a/12.1f. This caused builds targeting SM121 (DGX Spark GB10) to miss the CUTLASS scaled_mm, FP4, MLA, and MOE kernels. Updated checks: - SCALED_MM_ARCHS: 12.0f;12.1f (CUDA 13+) and 12.0a;12.1a (CUDA <13) - FP4_ARCHS: 12.0f;12.1f (CUDA 13+) and 12.0a;12.1a (CUDA <13) - MLA_ARCHS: Added 12.1f (CUDA 13+) - CUTLASS_MOE_DATA_ARCHS: Added 12.1f (CUDA 13+) This fixes: NotImplementedError: No compiled cutlass_scaled_mm for a compute capability less than CUDA device capability: 121 Signed-off-by: seli-equinix <seli@equinix.com>

Add optimized fused MoE kernel configuration for NVIDIA GB10 (SM121/DGX Spark) with FP8 w8a8 quantization. Config parameters adjusted for GB10's 48 SMs: - Reduced GROUP_SIZE_M (16-32 vs 64) for better SM utilization - Based on B200 config with SM-count-aware adjustments This eliminates the "Using default MoE config" warning and provides tuned block sizes for Qwen3-Next-80B-A3B-FP8 and similar MoE models with E=512, N=512. Signed-off-by: seli-equinix <seli@equinix.com> Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: seli-equinix <seli@equinix.com>

1. FlashMLA Sparse detection (flashmla.py): - Changed is_blackwell_class() to is_device_capability_family(100) - FlashMLA Sparse only supports SM90/SM100, NOT SM12x - Updated error message to be more specific 2. CMakeLists.txt MLA_ARCHS: - Removed SM12x (12.0f, 12.1f, 12.0a, 12.1a) from MLA_ARCHS - CUTLASS MLA only supports SM10x, SM12x uses TRITON_MLA - Added clarifying comment 3. cuda.py documentation: - Removed unverified "Thor" device reference - Only tested hardware (GB10 DGX Spark) now mentioned Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: seli-equinix <seli@equinix.com>

- Collect tokens/sec, latency, TPS benchmarks - Use plot plugin for trend graphs over builds - CSV output for metrics tracking Signed-off-by: seli-equinix <seli@equinix.com>

The _supports_quant_scheme() check restricted FP8 block-scale (kFp8Static128BlockSym) to exact SM90 only. SM121 (GB10 DGX Spark) has working FP8 block-scale support via FlashInfer v0.6.3 native group_gemm_fp8_nt_groupwise path. Add is_device_capability_family(120) to the FP8 block-scale check. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: seli-equinix <seli@equinix.com>

Separate FP4-specific utility checks (fp4_quantize, nvfp4_block_scale_interleave) into has_flashinfer_nvfp4() so that has_flashinfer_cutlass_fused_moe() only checks for the core CUTLASS MoE entry point. This allows FP8 CUTLASS MoE to work on SM121 (GB10) which has cutlass_fused_moe but lacks FP4 utilities. Also gate the nvfp4 quant scheme on has_flashinfer_nvfp4() in FlashInferExperts._supports_quant_scheme(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: seli-equinix <seli@equinix.com>

GraphPickler raises 'Unexpected raw Node during pickling' when node.meta contains raw Node objects in keys beyond the default filter list (source_fn_stack, nn_module_stack, fwd_source_fn_stack). This occurs on PyTorch nightly 2.11.0+ where additional FX passes inject Node refs into metadata fields like from_node. Fix: Walk all node.meta values recursively and strip any key whose value tree contains a torch.fx.Node reference before calling GraphPickler.dumps(). Also add a fallback Node handler in the custom reducer_override as a safety net for any references that slip through complex nested structures. This fixes the AOT cache being written as 0 bytes on every startup, which caused 'Ran out of input' errors on subsequent loads and forced a full torch.compile recompilation (~11s) on every restart. Signed-off-by: seli-equinix <seli@equinix.com>

…on PTX lacks SM121)

…rsection)

mergify · 2026-03-11T13:11:31Z

Hi @seli-equinix, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

mergify · 2026-03-12T17:08:53Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @seli-equinix.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Copilot AI review requested due to automatic review settings January 5, 2026 17:23

seli-equinix requested review from LucasWilkinson, mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners January 5, 2026 17:23

mergify bot added nvidia v1 labels Jan 5, 2026

github-project-automation bot added this to NVIDIA Jan 5, 2026

gemini-code-assist bot reviewed Jan 5, 2026

View reviewed changes

vllm/model_executor/warmup/kernel_warmup.py Show resolved Hide resolved

vllm/platforms/cuda.py Show resolved Hide resolved

vllm/platforms/cuda.py Show resolved Hide resolved

Copilot AI reviewed Jan 5, 2026

View reviewed changes

...ed_moe/configs/E=128,N=768,device_name=NVIDIA_GB10,dtype=fp8_w8a8,block_shape=[128,128].json Show resolved Hide resolved

vllm/platforms/cuda.py Outdated Show resolved Hide resolved

mgoin reviewed Jan 6, 2026

View reviewed changes

vllm/model_executor/layers/quantization/utils/flashinfer_utils.py Outdated Show resolved Hide resolved

seli-equinix force-pushed the feature/sm121-gb10-support branch from 2434173 to 642709a Compare January 6, 2026 19:33

seli-equinix force-pushed the feature/sm121-gb10-support branch from d8ae522 to 9fbcb4c Compare January 6, 2026 19:43

mergify bot added the needs-rebase label Feb 7, 2026

Borg2025 mentioned this pull request Feb 16, 2026

[Bug]: vLLM SM 12.1 (Blackwell GB10) V1 Engine Bug Report (Relates to: #28589, #31128, #28621, #27679) #31588

Closed

1 task

88plug mentioned this pull request Feb 18, 2026

[Bugfix] Add is_blackwell_class() for SM121/GB10 DGX Spark support #34822

Open

5 tasks

seli-equinix and others added 13 commits March 10, 2026 23:02

fix: Handle empty/whitespace tool_call arguments in chat preprocessing

3515f5e

Cherry-pick from fix/tool-call-empty-arguments branch. Prevents JSONDecodeError with Continue VSCode extension. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Signed-off-by: seli-equinix <seli@equinix.com>

Add metrics visualization with plot plugin

0d37a9f

- Collect tokens/sec, latency, TPS benchmarks - Use plot plugin for trend graphs over builds - CSV output for metrics tracking Signed-off-by: seli-equinix <seli@equinix.com>

fix: GB10 nogds + add Qwen3.5 MoE tuning config (E=256,N=512)

061c2fe

fix: Prefer FlashInfer for ViT attention on SM121/GB10 (Flash Attenti…

57b86c3

…on PTX lacks SM121)

fix: Add SM121 to CUDA_SUPPORTED_ARCHS (was filtered out by arch inte…

1a629ce

…rsection)

haosdent mentioned this pull request Mar 14, 2026

[Bug]: GPT-OSS-120B gpt-oss MXFP4 on SM121 (Blackwell DGX Spark): Marlin kernel generates wrong first Harmony token, producing null content/reasoning #37030

Open

JCorners68 mentioned this pull request Mar 29, 2026

[Build] Add SM121 (DGX Spark / GB10) to published build targets #38484

Open

3 tasks

Sggin1 mentioned this pull request Mar 30, 2026

[Bug]: No sm_121 (Blackwell) support on aarch64 — NVIDIA DGX Spark / Acer GN100 #36821

Open

1 task

Uh oh!

Conversation

seli-equinix commented Jan 5, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Commits in This PR

Key Changes

Platform Detection

CUTLASS Kernel Support

Triton Compilation

MoE Kernel Tuning

Hardware Tested

Performance Results

Backend Support Matrix for SM121

Files Changed

Platform/Detection

Build System

Attention/MLA

MoE/Quantization

Auto-generated

Notes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Jan 5, 2026

Uh oh!

mergify bot commented Jan 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Jan 5, 2026

Uh oh!

Uh oh!

eugr commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seli-equinix commented Jan 6, 2026

Uh oh!

mergify bot commented Jan 6, 2026

Uh oh!

mergify bot commented Jan 6, 2026

Uh oh!

seli-equinix commented Jan 6, 2026

Uh oh!

seli-equinix commented Jan 6, 2026

Uh oh!

eugr commented Jan 6, 2026

Uh oh!

seli-equinix commented Jan 6, 2026

Uh oh!

eugr commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eugr commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eugr commented Jan 6, 2026

Uh oh!

seli-equinix commented Jan 7, 2026

Uh oh!

mergify bot commented Feb 7, 2026

Uh oh!

mergify bot commented Feb 16, 2026

Uh oh!

Borg2025 commented Feb 16, 2026

Uh oh!

mergify bot commented Feb 16, 2026

Uh oh!

mergify bot commented Feb 16, 2026

Uh oh!

seli-equinix commented Jan 5, 2026 •

edited by github-actions bot

Loading

eugr commented Jan 6, 2026 •

edited

Loading

eugr commented Jan 6, 2026 •

edited

Loading

eugr commented Jan 6, 2026 •

edited

Loading