Skip to content

[ModelBash][DSV3] Add TRTLLM DSV3 Router GEMM kernel (6% B1 Speedup)#34302

Merged
robertgshaw2-redhat merged 25 commits intomainfrom
use-sgl-gate-for-fp32-router-logits
Feb 23, 2026
Merged

[ModelBash][DSV3] Add TRTLLM DSV3 Router GEMM kernel (6% B1 Speedup)#34302
robertgshaw2-redhat merged 25 commits intomainfrom
use-sgl-gate-for-fp32-router-logits

Conversation

@robertgshaw2-redhat
Copy link
Copy Markdown
Collaborator

@robertgshaw2-redhat robertgshaw2-redhat commented Feb 11, 2026

Port the optimized router GEMM kernel from sglang's sgl-kernel for DeepSeek V3 MoE models. This kernel is specifically optimized for small batch sizes (1-16 tokens) common in decode phase. The kernel is originally adapted from TRTLLM.

Key features:

  • Computes output = mat_a @ mat_b.T for MoE routing
  • Supports bfloat16 input with float32 or bfloat16 output (router logits uses fp32)
  • Optimized for DSV3 dimensions: hidden_dim=7168, num_experts={256,384}
  • Requires SM90+ (Hopper) GPUs and CUDA 12.0+
  • Supports Programmatic Dependent Launch (PDL) via TRTLLM_ENABLE_PDL=1

Original kernel adapted from TensorRT-LLM's dsv3RouterGemm implementation.

5.5% E2E Speedup for Batch 1 Decode.

Purpose

Test Plan

eval:
	lm_eval \
		--model local-completions \
		--tasks gsm8k \
		--model_args "model={{MODEL}},base_url=http://localhost:{{PORT}}/v1/completions,num_concurrent=10,tokenized_requests=False" --limit 100

^ run with concurrency 10 to hit the low batch size

benchmark:
	vllm bench serve \
		--port {{PORT}} \
		--model {{MODEL}} \
		--dataset-name random \
		--input-len 2 \
		--output-len 100 \
		--max-concurrency 1 \
		--num-prompts 10 \
		--seed $(date +%s) \
		--temperature 0.0

Test Result

  • pr accuracy
local-completions ({'model': 'nvidia/DeepSeek-V3.1-NVFP4', 'base_url': 'http://localhost:8001/v1/completions', 'num_concurrent': 10, 'tokenized_requests': False}), gen_kwargs: ({}), limit: 100.0, num_fewshot: None, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|| 0.97|±  |0.0171|
|     |       |strict-match    |     5|exact_match|| 0.97|±  |0.0171|
  • main
============ Serving Benchmark Result ============
Successful requests:                     10        
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  9.13      
Total input tokens:                      10        
Total generated tokens:                  1000      
Request throughput (req/s):              1.10      
Output token throughput (tok/s):         109.58    
Peak output token throughput (tok/s):    111.00    
Peak concurrent requests:                2.00      
Total token throughput (tok/s):          110.67    
---------------Time to First Token----------------
Mean TTFT (ms):                          21.32     
Median TTFT (ms):                        18.55     
P99 TTFT (ms):                           43.17     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          9.00      
Median TPOT (ms):                        9.00      
P99 TPOT (ms):                           9.02      
---------------Inter-token Latency----------------
Mean ITL (ms):                           9.00      
Median ITL (ms):                         8.99      
P99 ITL (ms):                            9.34      
==================================================
  • pr
============ Serving Benchmark Result ============
Successful requests:                     10        
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  8.65      
Total input tokens:                      10        
Total generated tokens:                  1000      
Request throughput (req/s):              1.16      
Output token throughput (tok/s):         115.65    
Peak output token throughput (tok/s):    117.00    
Peak concurrent requests:                3.00      
Total token throughput (tok/s):          116.80    
---------------Time to First Token----------------
Mean TTFT (ms):                          22.68     
Median TTFT (ms):                        18.25     
P99 TTFT (ms):                           58.48     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          8.50      
Median TPOT (ms):                        8.50      
P99 TPOT (ms):                           8.53      
---------------Inter-token Latency----------------
Mean ITL (ms):                           8.50      
Median ITL (ms):                         8.50      
P99 ITL (ms):                            8.88      
==================================================
image
Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Port the optimized router GEMM kernel from sglang's sgl-kernel for
DeepSeek V3 MoE models. This kernel is specifically optimized for
small batch sizes (1-16 tokens) common in decode phase.

Key features:
- Computes output = mat_a @ mat_b.T for MoE routing
- Supports bfloat16 input with float32 or bfloat16 output
- Optimized for DSV3 dimensions: hidden_dim=7168, num_experts={256,384}
- Requires SM90+ (Hopper) GPUs and CUDA 12.0+
- Supports Programmatic Dependent Launch (PDL) via TRTLLM_ENABLE_PDL=1

Original kernel adapted from TensorRT-LLM's dsv3RouterGemm implementation.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Signed-off-by: Robert Shaw <robshaw@redhat.com>
@mergify mergify bot added ci/build deepseek Related to DeepSeek models labels Feb 11, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request ports an optimized router GEMM kernel for DeepSeek V3 MoE models from sglang. The changes include the CUDA kernel implementation, build system integration, and PyTorch bindings. The kernel is highly specialized for specific model configurations and hardware (SM90+). My review focuses on the new CUDA kernel implementation. I've identified significant code duplication between the float32 and bfloat16 output kernels, which should be refactored for better maintainability. Additionally, there are missing error checks for CUDA API calls, which could lead to unhandled runtime errors.

Comment on lines +42 to +50
inline int getSMVersion() {
int device{-1};
cudaGetDevice(&device);
int sm_major = 0;
int sm_minor = 0;
cudaDeviceGetAttribute(&sm_major, cudaDevAttrComputeCapabilityMajor, device);
cudaDeviceGetAttribute(&sm_minor, cudaDevAttrComputeCapabilityMinor, device);
return sm_major * 10 + sm_minor;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The CUDA API calls cudaGetDevice and cudaDeviceGetAttribute can return errors, but their return values are not being checked. This could lead to silent failures or undefined behavior if an error occurs (e.g., no CUDA device is available). It's important to handle these potential errors by checking the cudaError_t return value.

For example:

cudaError_t err = cudaGetDevice(&device);
if (err != cudaSuccess) {
    // Handle error
}

Or using a macro for checking, which is a common practice in CUDA projects to reduce boilerplate. Other parts of the vLLM codebase use error checking macros for CUDA calls, and that practice should be followed here for consistency and robustness.

Comment on lines +79 to +270
template <typename T, int kBlockSize, int VPT, int kNumTokens, int kNumExperts,
int kHiddenDim>
__global__ __launch_bounds__(128, 1) void router_gemm_kernel_float_output(
float* out, T const* mat_a, T const* mat_b) {
// Each block handles one expert column
int const n_idx = blockIdx.x;
int const tid = threadIdx.x;
constexpr int kWarpSize = 32;
constexpr int kNumWarps = kBlockSize / kWarpSize;
constexpr int k_elems_per_k_iteration = VPT * kBlockSize;
constexpr int k_iterations = kHiddenDim / k_elems_per_k_iteration;

// Initialize accumulators for all M rows
float acc[kNumTokens] = {};

// Shared memory for warp-level reduction
__shared__ float sm_reduction[kNumTokens][kNumWarps];

// B matrix is in column-major order
T const* b_col = mat_b + n_idx * kHiddenDim;

// Pre-compute k_base values
int k_bases[k_iterations];
#pragma unroll
for (int ki = 0; ki < k_iterations; ki++) {
k_bases[ki] = ki * k_elems_per_k_iteration + tid * VPT;
}

#if (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900))
asm volatile("griddepcontrol.wait;");
#endif

// Process the GEMM in chunks
for (int ki = 0; ki < k_iterations; ki++) {
int const k_base = k_bases[ki];

// Load B matrix values using vector load
uint4 b_vec = *reinterpret_cast<uint4 const*>(b_col + k_base);

// Convert B values to float
float b_float[VPT];
bf16_uint4_to_float8<VPT>(b_vec, b_float);

#pragma unroll
for (int m_idx = 0; m_idx < kNumTokens; m_idx++) {
uint4 a_vec = *reinterpret_cast<uint4 const*>(
mat_a + (m_idx * kHiddenDim) + k_base);

float a_float[VPT];
bf16_uint4_to_float8<VPT>(a_vec, a_float);

#pragma unroll
for (int k = 0; k < VPT; k++) {
acc[m_idx] += a_float[k] * b_float[k];
}
}
}

// Warp-level reduction
int const warpId = tid / 32;
int const laneId = tid % 32;

float warp_result[kNumTokens];
#pragma unroll
for (int m_idx = 0; m_idx < kNumTokens; m_idx++) {
warp_result[m_idx] = acc[m_idx];
}

#pragma unroll
for (int m = 0; m < kNumTokens; m++) {
float sum = warp_result[m];
sum += __shfl_xor_sync(0xffffffff, sum, 16);
sum += __shfl_xor_sync(0xffffffff, sum, 8);
sum += __shfl_xor_sync(0xffffffff, sum, 4);
sum += __shfl_xor_sync(0xffffffff, sum, 2);
sum += __shfl_xor_sync(0xffffffff, sum, 1);

if (laneId == 0) {
sm_reduction[m][warpId] = sum;
}
}

__syncthreads();

// Final reduction across warps
if (tid == 0) {
#pragma unroll
for (int m = 0; m < kNumTokens; m++) {
float final_sum = 0.0f;
#pragma unroll
for (int w = 0; w < kNumWarps; w++) {
final_sum += sm_reduction[m][w];
}
out[m * kNumExperts + n_idx] = final_sum;
}
}

#if (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900))
asm volatile("griddepcontrol.launch_dependents;");
#endif
}

// Router GEMM kernel with bfloat16 output
template <typename T, int kBlockSize, int VPT, int kNumTokens, int kNumExperts,
int kHiddenDim>
__global__ __launch_bounds__(128, 1) void router_gemm_kernel_bf16_output(
__nv_bfloat16* out, T const* mat_a, T const* mat_b) {
int const n_idx = blockIdx.x;
int const tid = threadIdx.x;
constexpr int kWarpSize = 32;
constexpr int kNumWarps = kBlockSize / kWarpSize;
constexpr int k_elems_per_k_iteration = VPT * kBlockSize;
constexpr int k_iterations = kHiddenDim / k_elems_per_k_iteration;

float acc[kNumTokens] = {};
__shared__ float sm_reduction[kNumTokens][kNumWarps];

T const* b_col = mat_b + n_idx * kHiddenDim;

int k_bases[k_iterations];
#pragma unroll
for (int ki = 0; ki < k_iterations; ki++) {
k_bases[ki] = ki * k_elems_per_k_iteration + tid * VPT;
}

#if (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900))
asm volatile("griddepcontrol.wait;");
#endif

for (int ki = 0; ki < k_iterations; ki++) {
int const k_base = k_bases[ki];
uint4 b_vec = *reinterpret_cast<uint4 const*>(b_col + k_base);

float b_float[VPT];
bf16_uint4_to_float8<VPT>(b_vec, b_float);

#pragma unroll
for (int m_idx = 0; m_idx < kNumTokens; m_idx++) {
uint4 a_vec = *reinterpret_cast<uint4 const*>(
mat_a + (m_idx * kHiddenDim) + k_base);

float a_float[VPT];
bf16_uint4_to_float8<VPT>(a_vec, a_float);

#pragma unroll
for (int k = 0; k < VPT; k++) {
acc[m_idx] += a_float[k] * b_float[k];
}
}
}

int const warpId = tid / 32;
int const laneId = tid % 32;

float warp_result[kNumTokens];
#pragma unroll
for (int m_idx = 0; m_idx < kNumTokens; m_idx++) {
warp_result[m_idx] = acc[m_idx];
}

#pragma unroll
for (int m = 0; m < kNumTokens; m++) {
float sum = warp_result[m];
sum += __shfl_xor_sync(0xffffffff, sum, 16);
sum += __shfl_xor_sync(0xffffffff, sum, 8);
sum += __shfl_xor_sync(0xffffffff, sum, 4);
sum += __shfl_xor_sync(0xffffffff, sum, 2);
sum += __shfl_xor_sync(0xffffffff, sum, 1);

if (laneId == 0) {
sm_reduction[m][warpId] = sum;
}
}

__syncthreads();

if (tid == 0) {
#pragma unroll
for (int m = 0; m < kNumTokens; m++) {
float final_sum = 0.0f;
#pragma unroll
for (int w = 0; w < kNumWarps; w++) {
final_sum += sm_reduction[m][w];
}
out[m * kNumExperts + n_idx] = __float2bfloat16(final_sum);
}
}

#if (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900))
asm volatile("griddepcontrol.launch_dependents;");
#endif
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The kernels router_gemm_kernel_float_output and router_gemm_kernel_bf16_output are nearly identical, with the only difference being the output data type and the final store operation. This large amount of duplicated code increases maintenance overhead and the risk of introducing inconsistencies.

To improve maintainability, these two kernels should be refactored into a single templated kernel. You can introduce a helper struct OutputWriter templated on the output type to handle the final store operation.

Here's a sketch of the proposed refactoring:

template <typename T_out>
struct OutputWriter;

template <>
struct OutputWriter<float> {
  __device__ __forceinline__ static void write(float* out, int index,
                                               float value) {
    out[index] = value;
  }
};

template <>
struct OutputWriter<__nv_bfloat16> {
  __device__ __forceinline__ static void write(__nv_bfloat16* out, int index,
                                               float value) {
    out[index] = __float2bfloat16(value);
  }
};

template <typename T, typename T_out, int kBlockSize, int VPT, int kNumTokens,
          int kNumExperts, int kHiddenDim>
__global__ __launch_bounds__(128, 1) void router_gemm_kernel(
    T_out* out, T const* mat_a, T const* mat_b) {
  // ... common kernel logic ...

  // In the final reduction section
  if (tid == 0) {
#pragma unroll
    for (int m = 0; m < kNumTokens; m++) {
      float final_sum = 0.0f;
#pragma unroll
      for (int w = 0; w < kNumWarps; w++) {
        final_sum += sm_reduction[m][w];
      }
      OutputWriter<T_out>::write(out, m * kNumExperts + n_idx, final_sum);
    }
  }

  // ... rest of common kernel logic ...
}

Then, invokeRouterGemmFloatOutput and invokeRouterGemmBf16Output can call this unified router_gemm_kernel with the appropriate output type (float or __nv_bfloat16). This will eliminate about 100 lines of redundant code.

@mgoin
Copy link
Copy Markdown
Member

mgoin commented Feb 11, 2026

Can we use the router gemm interface already present in flashinfer?

Robert Shaw added 8 commits February 10, 2026 21:42
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
@robertgshaw2-redhat robertgshaw2-redhat changed the title [MoE] Add DeepSeek V3 router GEMM kernel from sglang [MoE] Add TRTLLM DeepSeek V3 router GEMM kernel Feb 11, 2026
@robertgshaw2-redhat robertgshaw2-redhat changed the title [MoE] Add TRTLLM DeepSeek V3 router GEMM kernel [MoE] Add TRTLLM DeepSeek V3 router GEMM kernel (5.5% B1 Speedup) Feb 11, 2026
@robertgshaw2-redhat robertgshaw2-redhat changed the title [MoE] Add TRTLLM DeepSeek V3 router GEMM kernel (5.5% B1 Speedup) [MoE] Add TRTLLM DSV3 Router GEMM kernel (5.5% B1 Speedup) Feb 11, 2026
@robertgshaw2-redhat robertgshaw2-redhat changed the title [MoE] Add TRTLLM DSV3 Router GEMM kernel (5.5% B1 Speedup) [ModelBash][DSV3] Add TRTLLM DSV3 Router GEMM kernel (5.5% B1 Speedup) Feb 11, 2026
Copy link
Copy Markdown
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! LGTM, thanks!

@robertgshaw2-redhat
Copy link
Copy Markdown
Collaborator Author

@robertgshaw2-redhat
Copy link
Copy Markdown
Collaborator Author

Can we use the router gemm interface already present in flashinfer?

cc @pavanimajety - looks like these dont support SM90. Any idea of the plan here?

@robertgshaw2-redhat robertgshaw2-redhat marked this pull request as draft February 11, 2026 04:25
@robertgshaw2-redhat
Copy link
Copy Markdown
Collaborator Author

TODO:

  • conditions for when to deploy [expert size, etc]
  • see if we should use fp32 or bf16 for non-trtllm

CMakeLists.txt Outdated
endif()

# DeepSeek V3 router GEMM kernel - requires SM90+
cuda_archs_loose_intersection(DSV3_ROUTER_GEMM_ARCHS "9.0a;10.0a" "${CUDA_ARCHS}")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't compatible with CUDA 13 and missing blackwell ultra, should be something like

  if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 13.0)
    cuda_archs_loose_intersection(DSV3_ROUTER_GEMM_ARCHS "9.0a;10.0f;11.0f" "${CUDA_ARCHS}")
  else()
    cuda_archs_loose_intersection(DSV3_ROUTER_GEMM_ARCHS "9.0a;10.0a;10.1a;10.3a" "${CUDA_ARCHS}")
  endif()

def _set_allow_dsv3_router_gemm(self) -> None:
self.allow_dsv3_router_gemm = (
current_platform.is_cuda()
and current_platform.has_device_capability((9, 0))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be current_platform.is_device_capability(90) and current_platform.is_device_capability_family(100) since we aren't supporting sm120

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also looks like you need to check against supported n_experts since I only see instantiations for 256 or 384 experts

"output must be float32 or bf16");

auto const sm = getSMVersion();
TORCH_CHECK(sm >= 90, "required CUDA ARCH >= SM_90");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if this would work on SM120? Better to be explicit if we don't know

@robertgshaw2-redhat robertgshaw2-redhat changed the title [ModelBash][DSV3] Add TRTLLM DSV3 Router GEMM kernel (5.5% B1 Speedup) [ModelBash][DSV3] Add TRTLLM DSV3 Router GEMM kernel (6% B1 Speedup) Feb 11, 2026
Signed-off-by: Robert Shaw <robshaw@redhat.com>
@pavanimajety
Copy link
Copy Markdown
Collaborator

Hey all, FYI - there's a flashinfer PR ready that removes the restriction for non SM100 in case we want to switch to the flashinfer implementation - flashinfer-ai/flashinfer#2576

@xinli-sw
Copy link
Copy Markdown
Contributor

I think we still expect these kernels to improve and evolve(new HW arch) in FI, it would be great to consider invoking them directly with Flashinfer (perhaps with 0.6.5 update). Not blocking for this PR though, I'll keep track

@robertgshaw2-redhat
Copy link
Copy Markdown
Collaborator Author

@xinli-sw - sounds good. We can add it once FI once flashinfer hits 0.6.5

@robertgshaw2-redhat robertgshaw2-redhat merged commit 8435b2e into main Feb 23, 2026
116 checks passed
@robertgshaw2-redhat robertgshaw2-redhat deleted the use-sgl-gate-for-fp32-router-logits branch February 23, 2026 14:02
@stavinsky
Copy link
Copy Markdown

this commit somehow breaks the model loading on spark
i'm not sure if it important as nvfp is broken on spark in any case but I think I have to share this.

@robertgshaw2-redhat

thanks

the log
(vllm_source) dev@spark-476c:~/dev/vllm_source$ VLLM_USE_FLASHINFER_MOE_FP4=1 VLLM_FLASHINFER_MOE_BACKEND=throughput  vllm serve  --host 0.0.0.0 --gpu-memory-utilization 0.4 --load-format fastsafetensors --max-num-seqs 1 --kv-cache-dtype fp8 nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4
(APIServer pid=23205) INFO 02-23 19:09:18 [utils.py:293]
(APIServer pid=23205) INFO 02-23 19:09:18 [utils.py:293]        █     █     █▄   ▄█
(APIServer pid=23205) INFO 02-23 19:09:18 [utils.py:293]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.16.0rc2.dev397+g8435b2e04.d20260223
(APIServer pid=23205) INFO 02-23 19:09:18 [utils.py:293]   █▄█▀ █     █     █     █  model   nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4
(APIServer pid=23205) INFO 02-23 19:09:18 [utils.py:293]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=23205) INFO 02-23 19:09:18 [utils.py:293]
(APIServer pid=23205) INFO 02-23 19:09:18 [utils.py:229] non-default args: {'model_tag': 'nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4', 'host': '0.0.0.0', 'model': 'nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4', 'load_format': 'fastsafetensors', 'gpu_memory_utilization': 0.4, 'kv_cache_dtype': 'fp8', 'max_num_seqs': 1}
(APIServer pid=23205) INFO 02-23 19:09:20 [model.py:532] Resolved architecture: Qwen3NextForCausalLM
(APIServer pid=23205) INFO 02-23 19:09:20 [model.py:1556] Using max model len 262144
(APIServer pid=23205) INFO 02-23 19:09:20 [cache.py:225] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor.
(APIServer pid=23205) INFO 02-23 19:09:20 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=23205) /home/dev/dev/vllm_source/.venv/lib/python3.12/site-packages/torch/cuda/__init__.py:435: UserWarning:
(APIServer pid=23205)     Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
(APIServer pid=23205)     Minimum and Maximum cuda capability supported by this version of PyTorch is
(APIServer pid=23205)     (8.0) - (12.0)
(APIServer pid=23205)
(APIServer pid=23205)   queued_call()
(APIServer pid=23205) INFO 02-23 19:09:20 [config.py:500] Setting attention block size to 1072 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=23205) WARNING 02-23 19:09:20 [modelopt.py:1011] Detected ModelOpt NVFP4 checkpoint. Please note that the format is experimental and could change in future.
(APIServer pid=23205) INFO 02-23 19:09:20 [vllm.py:697] Asynchronous scheduling is enabled.
(EngineCore_DP0 pid=23247) INFO 02-23 19:09:25 [core.py:98] Initializing a V1 LLM engine (v0.16.0rc2.dev397+g8435b2e04.d20260223) with config: model='nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4', speculative_config=None, tokenizer='nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=fastsafetensors, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=modelopt_fp4, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 2, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=23247) /home/dev/dev/vllm_source/.venv/lib/python3.12/site-packages/torch/cuda/__init__.py:435: UserWarning:
(EngineCore_DP0 pid=23247)     Found GPU0 NVIDIA GB10 which is of cuda capability 12.1.
(EngineCore_DP0 pid=23247)     Minimum and Maximum cuda capability supported by this version of PyTorch is
(EngineCore_DP0 pid=23247)     (8.0) - (12.0)
(EngineCore_DP0 pid=23247)
(EngineCore_DP0 pid=23247)   queued_call()
(EngineCore_DP0 pid=23247) INFO 02-23 19:09:25 [parallel_state.py:1307] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.88.12:38557 backend=nccl
(EngineCore_DP0 pid=23247) INFO 02-23 19:09:25 [parallel_state.py:1535] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(EngineCore_DP0 pid=23247) INFO 02-23 19:09:25 [gpu_model_runner.py:4139] Starting to load model nvidia/Qwen3-Next-80B-A3B-Instruct-NVFP4...
(EngineCore_DP0 pid=23247) INFO 02-23 19:09:26 [nvfp4_utils.py:85] Using NvFp4LinearBackend.FLASHINFER_CUTLASS for NVFP4 GEMM
(EngineCore_DP0 pid=23247) INFO 02-23 19:09:26 [nvfp4.py:169] Using 'FLASHINFER_CUTLASS' NvFp4 MoE backend out of potential backends: ['FLASHINFER_TRTLLM', 'FLASHINFER_CUTEDSL', 'FLASHINFER_CUTLASS', 'VLLM_CUTLASS', 'MARLIN'].
(EngineCore_DP0 pid=23247) INFO 02-23 19:09:26 [cuda.py:402] Using FLASHINFER attention backend out of potential backends: ['FLASHINFER', 'TRITON_ATTN'].
Loading safetensors using Fastsafetensor loader:   0% Completed | 0/11 [00:00<?, ?it/s]
(EngineCore_DP0 pid=23247) /home/dev/dev/vllm_source/.venv/lib/python3.12/site-packages/fastsafetensors/copier/gds.py:185: UserWarning: GDS is not supported in this platform but nogds is False. use nogds=True
(EngineCore_DP0 pid=23247)   warnings.warn(
Loading safetensors using Fastsafetensor loader:   9% Completed | 1/11 [00:02<00:27,  2.79s/it]
Loading safetensors using Fastsafetensor loader:  18% Completed | 2/11 [00:05<00:24,  2.69s/it]
Loading safetensors using Fastsafetensor loader:  27% Completed | 3/11 [00:08<00:21,  2.69s/it]
Loading safetensors using Fastsafetensor loader:  36% Completed | 4/11 [00:10<00:18,  2.62s/it]
Loading safetensors using Fastsafetensor loader:  45% Completed | 5/11 [00:13<00:15,  2.59s/it]
Loading safetensors using Fastsafetensor loader:  55% Completed | 6/11 [00:15<00:12,  2.59s/it]
Loading safetensors using Fastsafetensor loader:  64% Completed | 7/11 [00:18<00:10,  2.59s/it]
Loading safetensors using Fastsafetensor loader:  73% Completed | 8/11 [00:20<00:07,  2.60s/it]
Loading safetensors using Fastsafetensor loader:  82% Completed | 9/11 [00:23<00:05,  2.60s/it]
Loading safetensors using Fastsafetensor loader:  91% Completed | 10/11 [00:24<00:02,  2.13s/it]
Loading safetensors using Fastsafetensor loader: 100% Completed | 11/11 [00:24<00:00,  2.24s/it]
(EngineCore_DP0 pid=23247)
(EngineCore_DP0 pid=23247) INFO 02-23 19:09:54 [default_loader.py:293] Loading weights took 24.69 seconds
(EngineCore_DP0 pid=23247) INFO 02-23 19:09:54 [nvfp4.py:410] Using MoEPrepareAndFinalizeNoEP
(EngineCore_DP0 pid=23247) WARNING 02-23 19:09:54 [kv_cache.py:94] Checkpoint does not provide a q scaling factor. Setting it to k_scale. This only matters for FP8 Attention backends (flash-attn or flashinfer).
(EngineCore_DP0 pid=23247) WARNING 02-23 19:09:54 [kv_cache.py:108] Using KV cache scaling factor 1.0 for fp8_e4m3. If this is unintended, verify that k/v_scale scaling factors are properly set in the checkpoint.
(EngineCore_DP0 pid=23247) WARNING 02-23 19:09:54 [kv_cache.py:147] Using uncalibrated q_scale 1.0 and/or prob_scale 1.0 with fp8 attention. This may cause accuracy issues. Please make sure q/prob scaling factors are available in the fp8 checkpoint.
(EngineCore_DP0 pid=23247) INFO 02-23 19:09:55 [gpu_model_runner.py:4236] Model loading took 44.2 GiB memory and 28.832492 seconds
(EngineCore_DP0 pid=23247) INFO 02-23 19:10:00 [backends.py:916] Using cache directory: /home/dev/.cache/vllm/torch_compile_cache/f66b42c46f/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=23247) INFO 02-23 19:10:00 [backends.py:976] Dynamo bytecode transform time: 4.26 s
(EngineCore_DP0 pid=23247) INFO 02-23 19:10:00 [backends.py:350] Cache the graph of compile range (1, 2048) for later use
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029] EngineCore failed to start.
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029] Traceback (most recent call last):
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]   File "/home/dev/dev/vllm_source/vllm/v1/engine/core.py", line 1019, in run_engine_core
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]   File "/home/dev/dev/vllm_source/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     return func(*args, **kwargs)
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]   File "/home/dev/dev/vllm_source/vllm/v1/engine/core.py", line 763, in __init__
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     super().__init__(
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]   File "/home/dev/dev/vllm_source/vllm/v1/engine/core.py", line 114, in __init__
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]   File "/home/dev/dev/vllm_source/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     return func(*args, **kwargs)
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]   File "/home/dev/dev/vllm_source/vllm/v1/engine/core.py", line 248, in _initialize_kv_caches
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]   File "/home/dev/dev/vllm_source/vllm/v1/executor/abstract.py", line 128, in determine_available_memory
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     return self.collective_rpc("determine_available_memory")
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]   File "/home/dev/dev/vllm_source/vllm/v1/executor/uniproc_executor.py", line 75, in collective_rpc
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]   File "/home/dev/dev/vllm_source/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     return func(*args, **kwargs)
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]   File "/home/dev/dev/vllm_source/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     return func(*args, **kwargs)
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]   File "/home/dev/dev/vllm_source/vllm/v1/worker/gpu_worker.py", line 371, in determine_available_memory
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     self.model_runner.profile_run()
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]   File "/home/dev/dev/vllm_source/vllm/v1/worker/gpu_model_runner.py", line 5189, in profile_run
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     hidden_states, last_hidden_states = self._dummy_run(
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]                                         ^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]   File "/home/dev/dev/vllm_source/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     return func(*args, **kwargs)
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]   File "/home/dev/dev/vllm_source/vllm/v1/worker/gpu_model_runner.py", line 4887, in _dummy_run
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     outputs = self.model(
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]               ^^^^^^^^^^^
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]   File "/home/dev/dev/vllm_source/vllm/compilation/cuda_graph.py", line 222, in __call__
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     return self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]   File "/home/dev/dev/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]   File "/home/dev/dev/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]   File "/home/dev/dev/vllm_source/vllm/model_executor/models/qwen3_next.py", line 1376, in forward
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     hidden_states = self.model(
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]                     ^^^^^^^^^^^
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]   File "/home/dev/dev/vllm_source/vllm/compilation/decorators.py", line 558, in __call__
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     output = self.aot_compiled_fn(self, *args, **kwargs)
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]   File "/home/dev/dev/vllm_source/.venv/lib/python3.12/site-packages/torch/_dynamo/aot_compile.py", line 124, in __call__
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     return self.fn(*args, **kwargs)
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]            ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]   File "/home/dev/dev/vllm_source/vllm/model_executor/models/qwen3_next.py", line 1133, in forward
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     def forward(
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]   File "/home/dev/dev/vllm_source/vllm/compilation/caching.py", line 185, in __call__
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     return self.optimized_call(*args, **kwargs)
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]   File "/home/dev/dev/vllm_source/.venv/lib/python3.12/site-packages/torch/fx/graph_module.py", line 936, in call_wrapped
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     return self._wrapped_call(self, *args, **kwargs)
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]   File "/home/dev/dev/vllm_source/.venv/lib/python3.12/site-packages/torch/fx/graph_module.py", line 455, in __call__
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     raise e
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]   File "/home/dev/dev/vllm_source/.venv/lib/python3.12/site-packages/torch/fx/graph_module.py", line 442, in __call__
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]   File "/home/dev/dev/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]   File "/home/dev/dev/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]   File "<eval_with_key>.103", line 451, in forward
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     submod_2 = self.submod_2(getitem_4, s72, getitem_3, l_self_modules_layers_modules_0_modules_linear_attn_modules_norm_parameters_weight_, l_self_modules_layers_modules_0_modules_linear_attn_modules_out_proj_parameters_input_global_scale_inv_, l_self_modules_layers_modules_0_modules_linear_attn_modules_out_proj_parameters_weight_scale_, l_self_modules_layers_modules_0_modules_linear_attn_modules_out_proj_parameters_weight_, l_self_modules_layers_modules_0_modules_linear_attn_modules_out_proj_parameters_alpha_, getitem_5, l_self_modules_layers_modules_0_modules_post_attention_layernorm_parameters_weight_, getitem_6, l_self_modules_layers_modules_1_modules_input_layernorm_parameters_weight_, l_self_modules_layers_modules_1_modules_linear_attn_modules_in_proj_qkvz_parameters_weight_, l_self_modules_layers_modules_1_modules_linear_attn_modules_in_proj_ba_parameters_weight_);  getitem_4 = getitem_3 = l_self_modules_layers_modules_0_modules_linear_attn_modules_norm_parameters_weight_ = l_self_modules_layers_modules_0_modules_linear_attn_modules_out_proj_parameters_input_global_scale_inv_ = l_self_modules_layers_modules_0_modules_linear_attn_modules_out_proj_parameters_weight_scale_ = l_self_modules_layers_modules_0_modules_linear_attn_modules_out_proj_parameters_weight_ = l_self_modules_layers_modules_0_modules_linear_attn_modules_out_proj_parameters_alpha_ = getitem_5 = l_self_modules_layers_modules_0_modules_post_attention_layernorm_parameters_weight_ = getitem_6 = l_self_modules_layers_modules_1_modules_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_1_modules_linear_attn_modules_in_proj_qkvz_parameters_weight_ = l_self_modules_layers_modules_1_modules_linear_attn_modules_in_proj_ba_parameters_weight_ = None
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]   File "/home/dev/dev/vllm_source/vllm/compilation/cuda_graph.py", line 222, in __call__
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     return self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]   File "/home/dev/dev/vllm_source/vllm/compilation/piecewise_backend.py", line 343, in __call__
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     return range_entry.runnable(*args)
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]   File "/home/dev/dev/vllm_source/.venv/lib/python3.12/site-packages/torch/_inductor/standalone_compile.py", line 122, in __call__
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     return self._compiled_fn(*args)
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]            ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]   File "/home/dev/dev/vllm_source/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1181, in _fn
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     return fn(*args, **kwargs)
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]   File "/home/dev/dev/vllm_source/.venv/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 1148, in forward
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     return compiled_fn(full_args)
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]            ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]   File "/home/dev/dev/vllm_source/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 1962, in __call__
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     return self.compiled_fn(*args, **kwargs)
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]   File "/home/dev/dev/vllm_source/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 357, in runtime_wrapper
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     all_outs = call_func_at_runtime_with_args(
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]   File "/home/dev/dev/vllm_source/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 134, in call_func_at_runtime_with_args
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     out = normalize_as_list(f(args))
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]                             ^^^^^^^
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]   File "/home/dev/dev/vllm_source/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 531, in wrapper
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     return compiled_fn(runtime_args)
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]   File "/home/dev/dev/vllm_source/.venv/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 638, in __call__
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     return self.current_callable(inputs)
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]   File "/home/dev/dev/vllm_source/.venv/lib/python3.12/site-packages/torch/_inductor/utils.py", line 3220, in run
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     out = model(new_inputs)
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]           ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]   File "/tmp/torchinductor_dev/qj/cqjmvk2t5jqhc5pzphte336xjj3n3giws6lsk3q7grp4bfmcatc2.py", line 1436, in call
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     buf15 = torch.ops.vllm.moe_forward_shared.default(buf12, buf13, buf14, 'from_forward_context')
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]   File "/home/dev/dev/vllm_source/.venv/lib/python3.12/site-packages/torch/_ops.py", line 819, in __call__
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     return self._op(*args, **kwargs)
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]   File "/home/dev/dev/vllm_source/vllm/model_executor/layers/fused_moe/runner/default_moe_runner.py", line 91, in _moe_forward_shared
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     return layer.runner.forward_impl(
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]            ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]   File "/home/dev/dev/vllm_source/vllm/model_executor/layers/fused_moe/runner/default_moe_runner.py", line 705, in forward_impl
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     topk_weights, topk_ids = self.router.select_experts(
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]   File "/home/dev/dev/vllm_source/vllm/model_executor/layers/fused_moe/router/base_router.py", line 235, in select_experts
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     topk_weights, topk_ids = self._compute_routing(
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]                              ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]   File "/home/dev/dev/vllm_source/vllm/model_executor/layers/fused_moe/router/fused_topk_router.py", line 156, in _compute_routing
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     topk_weights, topk_ids, token_expert_indices = fused_topk(
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]                                                    ^^^^^^^^^^^
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]   File "/home/dev/dev/vllm_source/vllm/model_executor/layers/fused_moe/router/fused_topk_router.py", line 98, in fused_topk
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     topk_weights, topk_ids = topk_func(
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]                              ^^^^^^^^^^
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]   File "/home/dev/dev/vllm_source/vllm/model_executor/layers/fused_moe/router/fused_topk_router.py", line 24, in vllm_topk_softmax
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     ops.topk_softmax(
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]   File "/home/dev/dev/vllm_source/vllm/_custom_ops.py", line 2216, in topk_softmax
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     torch.ops._moe_C.topk_softmax(
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]   File "/home/dev/dev/vllm_source/.venv/lib/python3.12/site-packages/torch/_ops.py", line 1319, in __getattr__
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029]     raise AttributeError(
(EngineCore_DP0 pid=23247) ERROR 02-23 19:10:01 [core.py:1029] AttributeError: '_OpNamespace' '_moe_C' object has no attribute 'topk_softmax'
(EngineCore_DP0 pid=23247) Process EngineCore_DP0:
(EngineCore_DP0 pid=23247) Traceback (most recent call last):
(EngineCore_DP0 pid=23247)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=23247)     self.run()
(EngineCore_DP0 pid=23247)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=23247)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=23247)   File "/home/dev/dev/vllm_source/vllm/v1/engine/core.py", line 1033, in run_engine_core
(EngineCore_DP0 pid=23247)     raise e
(EngineCore_DP0 pid=23247)   File "/home/dev/dev/vllm_source/vllm/v1/engine/core.py", line 1019, in run_engine_core
(EngineCore_DP0 pid=23247)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=23247)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247)   File "/home/dev/dev/vllm_source/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=23247)     return func(*args, **kwargs)
(EngineCore_DP0 pid=23247)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247)   File "/home/dev/dev/vllm_source/vllm/v1/engine/core.py", line 763, in __init__
(EngineCore_DP0 pid=23247)     super().__init__(
(EngineCore_DP0 pid=23247)   File "/home/dev/dev/vllm_source/vllm/v1/engine/core.py", line 114, in __init__
(EngineCore_DP0 pid=23247)     num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
(EngineCore_DP0 pid=23247)                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247)   File "/home/dev/dev/vllm_source/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=23247)     return func(*args, **kwargs)
(EngineCore_DP0 pid=23247)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247)   File "/home/dev/dev/vllm_source/vllm/v1/engine/core.py", line 248, in _initialize_kv_caches
(EngineCore_DP0 pid=23247)     available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore_DP0 pid=23247)                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247)   File "/home/dev/dev/vllm_source/vllm/v1/executor/abstract.py", line 128, in determine_available_memory
(EngineCore_DP0 pid=23247)     return self.collective_rpc("determine_available_memory")
(EngineCore_DP0 pid=23247)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247)   File "/home/dev/dev/vllm_source/vllm/v1/executor/uniproc_executor.py", line 75, in collective_rpc
(EngineCore_DP0 pid=23247)     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=23247)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247)   File "/home/dev/dev/vllm_source/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore_DP0 pid=23247)     return func(*args, **kwargs)
(EngineCore_DP0 pid=23247)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247)   File "/home/dev/dev/vllm_source/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore_DP0 pid=23247)     return func(*args, **kwargs)
(EngineCore_DP0 pid=23247)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247)   File "/home/dev/dev/vllm_source/vllm/v1/worker/gpu_worker.py", line 371, in determine_available_memory
(EngineCore_DP0 pid=23247)     self.model_runner.profile_run()
(EngineCore_DP0 pid=23247)   File "/home/dev/dev/vllm_source/vllm/v1/worker/gpu_model_runner.py", line 5189, in profile_run
(EngineCore_DP0 pid=23247)     hidden_states, last_hidden_states = self._dummy_run(
(EngineCore_DP0 pid=23247)                                         ^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247)   File "/home/dev/dev/vllm_source/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore_DP0 pid=23247)     return func(*args, **kwargs)
(EngineCore_DP0 pid=23247)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247)   File "/home/dev/dev/vllm_source/vllm/v1/worker/gpu_model_runner.py", line 4887, in _dummy_run
(EngineCore_DP0 pid=23247)     outputs = self.model(
(EngineCore_DP0 pid=23247)               ^^^^^^^^^^^
(EngineCore_DP0 pid=23247)   File "/home/dev/dev/vllm_source/vllm/compilation/cuda_graph.py", line 222, in __call__
(EngineCore_DP0 pid=23247)     return self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=23247)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247)   File "/home/dev/dev/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=23247)     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=23247)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247)   File "/home/dev/dev/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore_DP0 pid=23247)     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=23247)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247)   File "/home/dev/dev/vllm_source/vllm/model_executor/models/qwen3_next.py", line 1376, in forward
(EngineCore_DP0 pid=23247)     hidden_states = self.model(
(EngineCore_DP0 pid=23247)                     ^^^^^^^^^^^
(EngineCore_DP0 pid=23247)   File "/home/dev/dev/vllm_source/vllm/compilation/decorators.py", line 558, in __call__
(EngineCore_DP0 pid=23247)     output = self.aot_compiled_fn(self, *args, **kwargs)
(EngineCore_DP0 pid=23247)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247)   File "/home/dev/dev/vllm_source/.venv/lib/python3.12/site-packages/torch/_dynamo/aot_compile.py", line 124, in __call__
(EngineCore_DP0 pid=23247)     return self.fn(*args, **kwargs)
(EngineCore_DP0 pid=23247)            ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247)   File "/home/dev/dev/vllm_source/vllm/model_executor/models/qwen3_next.py", line 1133, in forward
(EngineCore_DP0 pid=23247)     def forward(
(EngineCore_DP0 pid=23247)   File "/home/dev/dev/vllm_source/vllm/compilation/caching.py", line 185, in __call__
(EngineCore_DP0 pid=23247)     return self.optimized_call(*args, **kwargs)
(EngineCore_DP0 pid=23247)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247)   File "/home/dev/dev/vllm_source/.venv/lib/python3.12/site-packages/torch/fx/graph_module.py", line 936, in call_wrapped
(EngineCore_DP0 pid=23247)     return self._wrapped_call(self, *args, **kwargs)
(EngineCore_DP0 pid=23247)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247)   File "/home/dev/dev/vllm_source/.venv/lib/python3.12/site-packages/torch/fx/graph_module.py", line 455, in __call__
(EngineCore_DP0 pid=23247)     raise e
(EngineCore_DP0 pid=23247)   File "/home/dev/dev/vllm_source/.venv/lib/python3.12/site-packages/torch/fx/graph_module.py", line 442, in __call__
(EngineCore_DP0 pid=23247)     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(EngineCore_DP0 pid=23247)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247)   File "/home/dev/dev/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=23247)     return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=23247)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247)   File "/home/dev/dev/vllm_source/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore_DP0 pid=23247)     return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=23247)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247)   File "<eval_with_key>.103", line 451, in forward
(EngineCore_DP0 pid=23247)     submod_2 = self.submod_2(getitem_4, s72, getitem_3, l_self_modules_layers_modules_0_modules_linear_attn_modules_norm_parameters_weight_, l_self_modules_layers_modules_0_modules_linear_attn_modules_out_proj_parameters_input_global_scale_inv_, l_self_modules_layers_modules_0_modules_linear_attn_modules_out_proj_parameters_weight_scale_, l_self_modules_layers_modules_0_modules_linear_attn_modules_out_proj_parameters_weight_, l_self_modules_layers_modules_0_modules_linear_attn_modules_out_proj_parameters_alpha_, getitem_5, l_self_modules_layers_modules_0_modules_post_attention_layernorm_parameters_weight_, getitem_6, l_self_modules_layers_modules_1_modules_input_layernorm_parameters_weight_, l_self_modules_layers_modules_1_modules_linear_attn_modules_in_proj_qkvz_parameters_weight_, l_self_modules_layers_modules_1_modules_linear_attn_modules_in_proj_ba_parameters_weight_);  getitem_4 = getitem_3 = l_self_modules_layers_modules_0_modules_linear_attn_modules_norm_parameters_weight_ = l_self_modules_layers_modules_0_modules_linear_attn_modules_out_proj_parameters_input_global_scale_inv_ = l_self_modules_layers_modules_0_modules_linear_attn_modules_out_proj_parameters_weight_scale_ = l_self_modules_layers_modules_0_modules_linear_attn_modules_out_proj_parameters_weight_ = l_self_modules_layers_modules_0_modules_linear_attn_modules_out_proj_parameters_alpha_ = getitem_5 = l_self_modules_layers_modules_0_modules_post_attention_layernorm_parameters_weight_ = getitem_6 = l_self_modules_layers_modules_1_modules_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_1_modules_linear_attn_modules_in_proj_qkvz_parameters_weight_ = l_self_modules_layers_modules_1_modules_linear_attn_modules_in_proj_ba_parameters_weight_ = None
(EngineCore_DP0 pid=23247)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247)   File "/home/dev/dev/vllm_source/vllm/compilation/cuda_graph.py", line 222, in __call__
(EngineCore_DP0 pid=23247)     return self.runnable(*args, **kwargs)
(EngineCore_DP0 pid=23247)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247)   File "/home/dev/dev/vllm_source/vllm/compilation/piecewise_backend.py", line 343, in __call__
(EngineCore_DP0 pid=23247)     return range_entry.runnable(*args)
(EngineCore_DP0 pid=23247)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247)   File "/home/dev/dev/vllm_source/.venv/lib/python3.12/site-packages/torch/_inductor/standalone_compile.py", line 122, in __call__
(EngineCore_DP0 pid=23247)     return self._compiled_fn(*args)
(EngineCore_DP0 pid=23247)            ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247)   File "/home/dev/dev/vllm_source/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1181, in _fn
(EngineCore_DP0 pid=23247)     return fn(*args, **kwargs)
(EngineCore_DP0 pid=23247)            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247)   File "/home/dev/dev/vllm_source/.venv/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 1148, in forward
(EngineCore_DP0 pid=23247)     return compiled_fn(full_args)
(EngineCore_DP0 pid=23247)            ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247)   File "/home/dev/dev/vllm_source/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 1962, in __call__
(EngineCore_DP0 pid=23247)     return self.compiled_fn(*args, **kwargs)
(EngineCore_DP0 pid=23247)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247)   File "/home/dev/dev/vllm_source/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 357, in runtime_wrapper
(EngineCore_DP0 pid=23247)     all_outs = call_func_at_runtime_with_args(
(EngineCore_DP0 pid=23247)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247)   File "/home/dev/dev/vllm_source/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 134, in call_func_at_runtime_with_args
(EngineCore_DP0 pid=23247)     out = normalize_as_list(f(args))
(EngineCore_DP0 pid=23247)                             ^^^^^^^
(EngineCore_DP0 pid=23247)   File "/home/dev/dev/vllm_source/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 531, in wrapper
(EngineCore_DP0 pid=23247)     return compiled_fn(runtime_args)
(EngineCore_DP0 pid=23247)            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247)   File "/home/dev/dev/vllm_source/.venv/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 638, in __call__
(EngineCore_DP0 pid=23247)     return self.current_callable(inputs)
(EngineCore_DP0 pid=23247)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247)   File "/home/dev/dev/vllm_source/.venv/lib/python3.12/site-packages/torch/_inductor/utils.py", line 3220, in run
(EngineCore_DP0 pid=23247)     out = model(new_inputs)
(EngineCore_DP0 pid=23247)           ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247)   File "/tmp/torchinductor_dev/qj/cqjmvk2t5jqhc5pzphte336xjj3n3giws6lsk3q7grp4bfmcatc2.py", line 1436, in call
(EngineCore_DP0 pid=23247)     buf15 = torch.ops.vllm.moe_forward_shared.default(buf12, buf13, buf14, 'from_forward_context')
(EngineCore_DP0 pid=23247)             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247)   File "/home/dev/dev/vllm_source/.venv/lib/python3.12/site-packages/torch/_ops.py", line 819, in __call__
(EngineCore_DP0 pid=23247)     return self._op(*args, **kwargs)
(EngineCore_DP0 pid=23247)            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247)   File "/home/dev/dev/vllm_source/vllm/model_executor/layers/fused_moe/runner/default_moe_runner.py", line 91, in _moe_forward_shared
(EngineCore_DP0 pid=23247)     return layer.runner.forward_impl(
(EngineCore_DP0 pid=23247)            ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247)   File "/home/dev/dev/vllm_source/vllm/model_executor/layers/fused_moe/runner/default_moe_runner.py", line 705, in forward_impl
(EngineCore_DP0 pid=23247)     topk_weights, topk_ids = self.router.select_experts(
(EngineCore_DP0 pid=23247)                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247)   File "/home/dev/dev/vllm_source/vllm/model_executor/layers/fused_moe/router/base_router.py", line 235, in select_experts
(EngineCore_DP0 pid=23247)     topk_weights, topk_ids = self._compute_routing(
(EngineCore_DP0 pid=23247)                              ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247)   File "/home/dev/dev/vllm_source/vllm/model_executor/layers/fused_moe/router/fused_topk_router.py", line 156, in _compute_routing
(EngineCore_DP0 pid=23247)     topk_weights, topk_ids, token_expert_indices = fused_topk(
(EngineCore_DP0 pid=23247)                                                    ^^^^^^^^^^^
(EngineCore_DP0 pid=23247)   File "/home/dev/dev/vllm_source/vllm/model_executor/layers/fused_moe/router/fused_topk_router.py", line 98, in fused_topk
(EngineCore_DP0 pid=23247)     topk_weights, topk_ids = topk_func(
(EngineCore_DP0 pid=23247)                              ^^^^^^^^^^
(EngineCore_DP0 pid=23247)   File "/home/dev/dev/vllm_source/vllm/model_executor/layers/fused_moe/router/fused_topk_router.py", line 24, in vllm_topk_softmax
(EngineCore_DP0 pid=23247)     ops.topk_softmax(
(EngineCore_DP0 pid=23247)   File "/home/dev/dev/vllm_source/vllm/_custom_ops.py", line 2216, in topk_softmax
(EngineCore_DP0 pid=23247)     torch.ops._moe_C.topk_softmax(
(EngineCore_DP0 pid=23247)     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=23247)   File "/home/dev/dev/vllm_source/.venv/lib/python3.12/site-packages/torch/_ops.py", line 1319, in __getattr__
(EngineCore_DP0 pid=23247)     raise AttributeError(
(EngineCore_DP0 pid=23247) AttributeError: '_OpNamespace' '_moe_C' object has no attribute 'topk_softmax'
[rank0]:[W223 19:10:02.247586813 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=23205) Traceback (most recent call last):
(APIServer pid=23205)   File "/home/dev/dev/vllm_source/.venv/bin/vllm", line 6, in <module>
(APIServer pid=23205)     sys.exit(main())
(APIServer pid=23205)              ^^^^^^
(APIServer pid=23205)   File "/home/dev/dev/vllm_source/vllm/entrypoints/cli/main.py", line 73, in main
(APIServer pid=23205)     args.dispatch_function(args)
(APIServer pid=23205)   File "/home/dev/dev/vllm_source/vllm/entrypoints/cli/serve.py", line 112, in cmd
(APIServer pid=23205)     uvloop.run(run_server(args))
(APIServer pid=23205)   File "/home/dev/dev/vllm_source/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=23205)     return __asyncio.run(
(APIServer pid=23205)            ^^^^^^^^^^^^^^
(APIServer pid=23205)   File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
(APIServer pid=23205)     return runner.run(main)
(APIServer pid=23205)            ^^^^^^^^^^^^^^^^
(APIServer pid=23205)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=23205)     return self._loop.run_until_complete(task)
(APIServer pid=23205)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=23205)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=23205)   File "/home/dev/dev/vllm_source/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=23205)     return await main
(APIServer pid=23205)            ^^^^^^^^^^
(APIServer pid=23205)   File "/home/dev/dev/vllm_source/vllm/entrypoints/openai/api_server.py", line 471, in run_server
(APIServer pid=23205)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=23205)   File "/home/dev/dev/vllm_source/vllm/entrypoints/openai/api_server.py", line 490, in run_server_worker
(APIServer pid=23205)     async with build_async_engine_client(
(APIServer pid=23205)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=23205)     return await anext(self.gen)
(APIServer pid=23205)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=23205)   File "/home/dev/dev/vllm_source/vllm/entrypoints/openai/api_server.py", line 96, in build_async_engine_client
(APIServer pid=23205)     async with build_async_engine_client_from_engine_args(
(APIServer pid=23205)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=23205)     return await anext(self.gen)
(APIServer pid=23205)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=23205)   File "/home/dev/dev/vllm_source/vllm/entrypoints/openai/api_server.py", line 137, in build_async_engine_client_from_engine_args
(APIServer pid=23205)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=23205)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=23205)   File "/home/dev/dev/vllm_source/vllm/v1/engine/async_llm.py", line 223, in from_vllm_config
(APIServer pid=23205)     return cls(
(APIServer pid=23205)            ^^^^
(APIServer pid=23205)   File "/home/dev/dev/vllm_source/vllm/v1/engine/async_llm.py", line 152, in __init__
(APIServer pid=23205)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=23205)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=23205)   File "/home/dev/dev/vllm_source/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=23205)     return func(*args, **kwargs)
(APIServer pid=23205)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=23205)   File "/home/dev/dev/vllm_source/vllm/v1/engine/core_client.py", line 125, in make_async_mp_client
(APIServer pid=23205)     return AsyncMPClient(*client_args)
(APIServer pid=23205)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=23205)   File "/home/dev/dev/vllm_source/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=23205)     return func(*args, **kwargs)
(APIServer pid=23205)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=23205)   File "/home/dev/dev/vllm_source/vllm/v1/engine/core_client.py", line 839, in __init__
(APIServer pid=23205)     super().__init__(
(APIServer pid=23205)   File "/home/dev/dev/vllm_source/vllm/v1/engine/core_client.py", line 493, in __init__
(APIServer pid=23205)     with launch_core_engines(vllm_config, executor_class, log_stats) as (
(APIServer pid=23205)   File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=23205)     next(self.gen)
(APIServer pid=23205)   File "/home/dev/dev/vllm_source/vllm/v1/engine/utils.py", line 925, in launch_core_engines
(APIServer pid=23205)     wait_for_engine_startup(
(APIServer pid=23205)   File "/home/dev/dev/vllm_source/vllm/v1/engine/utils.py", line 984, in wait_for_engine_startup
(APIServer pid=23205)     raise RuntimeError(
(APIServer pid=23205) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

@eugr
Copy link
Copy Markdown

eugr commented Feb 23, 2026

@robertgshaw2-redhat - this is a second DSV3-related PR that breaks vLLM on DGX Spark (and other sm12x). I believe you need to guard it properly.

@mgoin, @johnnynunez - FYI.

EDIT: the first PR was this one: #34758

@mgoin
Copy link
Copy Markdown
Member

mgoin commented Feb 23, 2026

Thank you for reporting @eugr @stavinsky and sorry for the disruption. I should have a fix here #35123

@robertgshaw2-redhat
Copy link
Copy Markdown
Collaborator Author

thanks, sorry for the issues

@stavinsky
Copy link
Copy Markdown

always happy to help, guys

@eugr
Copy link
Copy Markdown

eugr commented Feb 23, 2026

no problem, it's a big project with a very wide hardware support. Stuff happens.

@robertgshaw2-redhat
Copy link
Copy Markdown
Collaborator Author

no problem, it's a big project with a very wide hardware support. Stuff happens.

I have to say, I dont quite guy why this did not break on SM89 where we run a lot of tests.

@mgoin
Copy link
Copy Markdown
Member

mgoin commented Feb 24, 2026

@robertgshaw2-redhat It is because the CI image is built with a wide ranging TORCH_CUDA_ARCH_LIST, basically including all source files and cases across CUDA arches. You would only run into this issue if you build for just your arch i.e. TORCH_CUDA_ARCH_LIST=12.0 since you wouldn't build those source files.

llsj14 pushed a commit to llsj14/vllm that referenced this pull request Mar 1, 2026
…llm-project#34302)

Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026
…llm-project#34302)

Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
askliar pushed a commit to askliar/vllm that referenced this pull request Mar 9, 2026
…llm-project#34302)

Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Andrii Skliar <askliar@nvidia.com>
Copilot AI pushed a commit to machov/vllm that referenced this pull request Mar 10, 2026
…llm-project#34302)

Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
EricccYang pushed a commit to EricccYang/vllm that referenced this pull request Apr 1, 2026
…llm-project#34302)

Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: EricccYang <yangyang4991@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build deepseek Related to DeepSeek models performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants