fix: Add SM120 (RTX Blackwell) support for FlashInfer CUTLASS NVFP4 MoE kernels by renehonig · Pull Request #33417 · vllm-project/vllm

renehonig · 2026-01-30T12:55:32Z

Summary

This PR adds SM120 (RTX Blackwell) device capability family support to the NVFP4 MoE kernel backend selection code. The NVFP4 quantization kernels check for specific GPU architecture families, but currently only recognize SM9.0 (Hopper) and SM10.x (B100/B200 data center Blackwell), missing SM12.0 (RTX Blackwell workstation GPUs).

Problem

On RTX Blackwell GPUs (e.g., RTX PRO 6000 Blackwell Workstation Edition with compute capability 12.0), vLLM v0.15.0 crashes when loading MiniMax-M2.1-NVFP4 or other NVFP4 MoE models with:

RuntimeError: FlashInfer-CUTLASS MoE kernel does not support current device sm_120

Root Cause

The is_device_capability_family(100) check returns False for SM12.0 devices because:

SM12.0 → family = 120 // 10 = 12
SM10.x → family = 100 // 10 = 10
12 != 10, so the check fails

This is a regression introduced in commit 42135d6 ([MoE Refactor] Oracle Select FP8+NVFP4 Kernels In Priority #32414).

Solution

Add or current_platform.is_device_capability_family(120) checks alongside existing SM100 family checks in all NVFP4 MoE kernel selection code.

Files Changed

vllm/model_executor/layers/fused_moe/flashinfer_cutlass_moe.py
vllm/model_executor/layers/fused_moe/flashinfer_cutedsl_moe.py
vllm/model_executor/layers/fused_moe/flashinfer_trtllm_moe.py
vllm/model_executor/layers/quantization/utils/flashinfer_fp4_moe.py

Testing

Tested on RTX PRO 6000 Blackwell Workstation Edition with MiniMax-M2.1-NVFP4 model - inference working successfully after fix.

Related Issue

Fixes #33416

For Maintainers

This is a regression bugfix affecting NVFP4 MoE models on RTX Blackwell GPUs (SM12.0).
Please consider cherry-picking this to releases/v0.15.0 for inclusion in v0.15.1.

mergify · 2026-01-30T12:56:12Z

Documentation preview: https://vllm--33417.org.readthedocs.build/en/33417/

mergify · 2026-01-30T12:56:19Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @renehonig.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

renehonig · 2026-01-31T08:12:44Z

Ah sorry @renehonig can you fix the merge conflict?

@mgoin hopefully ok now.

kaigouthro · 2026-02-01T02:54:26Z

YES! Works on 5090rtx Confirmed running

shahizat · 2026-02-01T11:32:23Z

Hello all,

I've just tested this on both the RTX 5090 and the RTX 6000 Pro Blackwell, but I am still facing an issue when running
https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
.

The error I’m getting is:

(EngineCore_DP0 pid=189129) ValueError: NvFp4 MoE backend 'FLASHINFER_CUTLASS' does not support the deployment configuration since kernel does not support current device.

Steps to reproduce:

uv venv .vllm --python 3.12
source .vllm/bin/activate

uv pip install --force-reinstall torch torchvision torchaudio triton --index-url https://download.pytorch.org/whl/cu130

export TORCH_CUDA_ARCH_LIST="12.0"
export CUDA_HOME=/usr/local/cuda-13
export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas
export PATH="${CUDA_HOME}/bin:$PATH"

git clone https://github.com/vllm-project/vllm.git
cd vllm 
python3 use_existing_torch.py 
uv pip install -r requirements/build.txt
MAX_JOBS=$(nproc) python3 setup.py bdist_wheel

uv pip install --no-deps dist/vllm*.whl
uv pip install -r requirements/common.txt

wget https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4/resolve/main/nano_v3_reasoning_parser.py

VLLM_USE_FLASHINFER_MOE_FP4=1 \
VLLM_FLASHINFER_MOE_BACKEND=latency \
vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
  --served-model-name model \
  --max-num-seqs 8 \
  --tensor-parallel-size 1 \
  --max-model-len 262144 \
  --port 8000 \
  --trust-remote-code \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser-plugin nano_v3_reasoning_parser.py \
  --reasoning-parser nano_v3 \
  --kv-cache-dtype fp8

cc @johnnynunez @mgoin

Extend device capability checks to include SM110 and SM120 GPU families, matching the approach used in flashinfer_cutlass_moe.py and cutlass_moe.py after PR vllm-project#33417. These files were not updated in vllm-project#33417 and still only checked for SM100: - flashinfer_fp4_moe.py - flashinfer_trtllm_moe.py - flashinfer_cutedsl_moe.py - flashinfer_utils.py The fix adds explicit family checks for SM100/110/120, enabling support for: - SM100-109: Blackwell data center (B100, B200) - SM110-119: Future Blackwell variants - SM120-129: Blackwell consumer/workstation (RTX 5090, DGX Spark GB10) Tested on RTX 5090 (SM120) and DGX Spark GB10 (SM121) with nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 Signed-off-by: code4me2 <velvetmoon222999@gmail.com>

Extend device capability checks to include SM110 and SM120 GPU families, matching the approach used in flashinfer_cutlass_moe.py and cutlass_moe.py after PR vllm-project#33417. These files were not updated in vllm-project#33417 and still only checked for SM100: - flashinfer_fp4_moe.py - flashinfer_trtllm_moe.py - flashinfer_cutedsl_moe.py - flashinfer_utils.py The fix adds explicit family checks for SM100/110/120 using any() for cleaner, more maintainable code, enabling support for: - SM100-109: Blackwell data center (B100, B200) - SM110-119: Future Blackwell variants - SM120-129: Blackwell consumer/workstation (RTX 5090, DGX Spark GB10) Tested on RTX 5090 (SM120) and DGX Spark GB10 (SM121) with nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 Signed-off-by: code4me2 <velvetmoon222999@gmail.com>

…oE kernels (#33417) Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: mgoin <mgoin64@gmail.com> (cherry picked from commit 0797811)

…oE kernels (vllm-project#33417) Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: mgoin <mgoin64@gmail.com> Signed-off-by: Pai <416932041@qq.com>

rnik12 · 2026-02-07T12:17:52Z

gptoss20b getting OOM on 15.1

e hit rate: 0.0%
(APIServer pid=1) INFO:     172.20.30.250:56406 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     172.20.30.250:56406 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 02-06 11:48:39 [loggers.py:257] Engine 000: Avg prompt throughput: 1856.6 tokens/s, Avg generation throughput: 44.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.1%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 02-06 11:48:49 [loggers.py:257] Engine 000: Avg prompt throughput: 1608.6 tokens/s, Avg generation throughput: 122.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO:     172.20.30.250:56406 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 02-06 11:48:59 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 116.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(Worker_TP0 pid=53) Exception in thread WorkerAsyncOutputCopy:
(Worker_TP0 pid=53) Traceback (most recent call last):
(Worker_TP0 pid=53)   File "/usr/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
(Worker_TP0 pid=53)     self.run()
(Worker_TP0 pid=53)   File "/usr/lib/python3.12/threading.py", line 1012, in run
(Worker_TP0 pid=53)     self._target(*self._args, **self._kwargs)
(Worker_TP0 pid=53)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 832, in async_output_busy_loop
(Worker_TP0 pid=53)     self.enqueue_output(output)
(Worker_TP0 pid=53)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 809, in enqueue_output
(Worker_TP0 pid=53)     output = output.get_output()
(Worker_TP0 pid=53)              ^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=53)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 239, in get_output
(Worker_TP0 pid=53)     self.async_copy_ready_event.synchronize()
(Worker_TP0 pid=53) torch.AcceleratorError: CUDA error: unspecified launch failure
(Worker_TP0 pid=53) Search for `cudaErrorLaunchFailure' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
(Worker_TP0 pid=53) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(Worker_TP0 pid=53) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(Worker_TP0 pid=53) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(Worker_TP0 pid=53) 
[rank0]:[W206 11:49:01.060117336 CUDAGuardImpl.h:122] Warning: CUDA warning: unspecified launch failure (function destroyEvent)
terminate called after throwing an instance of 'c10::AcceleratorError'
  what():  CUDA error: unspecified launch failure
Search for `cudaErrorLaunchFailure' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7fdb5d165b80 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x11fb7 (0x7fdbd7766fb7 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0xcacea8 (0x7fdb5de95ea8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0xca7ed3 (0x7fdb5de90ed3 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xcafa05 (0x7fdb5de98a05 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0x4827af (0x7fdbc92df7af in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
frame #6: c10::TensorImpl::~TensorImpl() + 0x9 (0x7fdb5d142d69 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #7: <unknown function> + 0x7cb658 (0x7fdbc9628658 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0x7cb9c5 (0x7fdbc96289c5 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
frame #9: VLLM::Worker_TP0() [0x59e650]
frame #10: VLLM::Worker_TP0() [0x57f256]
frame #11: VLLM::Worker_TP0() [0x57e276]
frame #12: VLLM::Worker_TP0() [0x57e26f]
frame #13: VLLM::Worker_TP0() [0x57e26f]
frame #14: VLLM::Worker_TP0() [0x597457]
frame #15: VLLM::Worker_TP0() [0x59e4a6]
frame #16: _PyEval_EvalFrameDefault + 0x5102 (0x54ece2 in VLLM::Worker_TP0)
frame #17: VLLM::Worker_TP0() [0x599e7d]
frame #18: VLLM::Worker_TP0() [0x599a46]
frame #19: VLLM::Worker_TP0() [0x6a87f9]
frame #20: VLLM::Worker_TP0() [0x6a87a8]
frame #21: <unknown function> + 0x94ac3 (0x7fdbd836aac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #22: clone + 0x44 (0x7fdbd83fba84 in /usr/lib/x86_64-linux-gnu/libc.so.6)

(EngineCore_DP0 pid=31) ERROR 02-06 11:49:05 [multiproc_executor.py:246] Worker proc VllmWorker-0 died unexpectedly, shutting down executor.
(Worker_TP1 pid=54) INFO 02-06 11:49:05 [multiproc_executor.py:730] Parent process exited, terminating worker
(Worker_TP1 pid=54) INFO 02-06 11:49:05 [multiproc_executor.py:774] WorkerProc shutting down.
[rank1]:[W206 11:49:05.331421908 TCPStore.cpp:125] [c10d] recvValue failed on SocketImpl(fd=59, addr=[localhost]:38368, remote=[localhost]:35681): Connection reset by peer
Exception raised from recvBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:679 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f6b75d7cb80 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5ffc5d1 (0x7f6b58b035d1 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5ffda33 (0x7f6b58b04a33 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5ffe57a (0x7f6b58b0557a in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x31e (0x7f6b58b0029e in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c8 (0x7f6afc673c88 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdc253 (0x7f6b6e6b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x94ac3 (0x7f6b76ae1ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #8: clone + 0x44 (0x7f6b76b72a84 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[W206 11:49:05.344733168 ProcessGroupNCCL.cpp:1771] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Connection reset by peer
[rank1]:[W206 11:49:06.345031987 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=59, addr=[localhost]:38368, remote=[localhost]:35681): Broken pipe
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f6b75d7cb80 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5ffc5d1 (0x7f6b58b035d1 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5ffce62 (0x7f6b58b03e62 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5ffe96e (0x7f6b58b0596e in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x30e (0x7f6b58b0028e in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c8 (0x7f6afc673c88 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdc253 (0x7f6b6e6b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x94ac3 (0x7f6b76ae1ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #8: clone + 0x44 (0x7f6b76b72a84 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[W206 11:49:06.353301187 ProcessGroupNCCL.cpp:1771] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe
[rank1]:[W206 11:49:07.353511180 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=59, addr=[localhost]:38368, remote=[localhost]:35681): Broken pipe
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f6b75d7cb80 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5ffc5d1 (0x7f6b58b035d1 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5ffce62 (0x7f6b58b03e62 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5ffe96e (0x7f6b58b0596e in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x30e (0x7f6b58b0028e in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c8 (0x7f6afc673c88 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdc253 (0x7f6b6e6b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x94ac3 (0x7f6b76ae1ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #8: clone + 0x44 (0x7f6b76b72a84 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[W206 11:49:07.361547662 ProcessGroupNCCL.cpp:1771] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe
[rank1]:[W206 11:49:08.361763130 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=59, addr=[localhost]:38368, remote=[localhost]:35681): Broken pipe
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f6b75d7cb80 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5ffc5d1 (0x7f6b58b035d1 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5ffce62 (0x7f6b58b03e62 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5ffe96e (0x7f6b58b0596e in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x30e (0x7f6b58b0028e in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c8 (0x7f6afc673c88 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdc253 (0x7f6b6e6b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x94ac3 (0x7f6b76ae1ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #8: clone + 0x44 (0x7f6b76b72a84 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[W206 11:49:08.370013162 ProcessGroupNCCL.cpp:1771] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe
[rank1]:[W206 11:49:09.370230072 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=59, addr=[localhost]:38368, remote=[localhost]:35681): Broken pipe
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f6b75d7cb80 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5ffc5d1 (0x7f6b58b035d1 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5ffce62 (0x7f6b58b03e62 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5ffe96e (0x7f6b58b0596e in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x30e (0x7f6b58b0028e in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c8 (0x7f6afc673c88 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdc253 (0x7f6b6e6b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x94ac3 (0x7f6b76ae1ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #8: clone + 0x44 (0x7f6b76b72a84 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[W206 11:49:09.378170434 ProcessGroupNCCL.cpp:1771] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe
(APIServer pid=1) INFO 02-06 11:49:09 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.2%, Prefix cache hit rate: 0.0%
[rank1]:[W206 11:49:10.378365150 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=59, addr=[localhost]:38368, remote=[localhost]:35681): Broken pipe
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f6b75d7cb80 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5ffc5d1 (0x7f6b58b035d1 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5ffce62 (0x7f6b58b03e62 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5ffe96e (0x7f6b58b0596e in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x30e (0x7f6b58b0028e in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c8 (0x7f6afc673c88 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdc253 (0x7f6b6e6b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x94ac3 (0x7f6b76ae1ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #8: clone + 0x44 (0x7f6b76b72a84 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[W206 11:49:10.386303703 ProcessGroupNCCL.cpp:1771] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe
[rank1]:[W206 11:49:11.386499882 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=59, addr=[localhost]:38368, remote=[localhost]:35681): Broken pipe
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f6b75d7cb80 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5ffc5d1 (0x7f6b58b035d1 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5ffce62 (0x7f6b58b03e62 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5ffe96e (0x7f6b58b0596e in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x30e (0x7f6b58b0028e in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c8 (0x7f6afc673c88 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdc253 (0x7f6b6e6b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x94ac3 (0x7f6b76ae1ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #8: clone + 0x44 (0x7f6b76b72a84 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[W206 11:49:11.394879888 ProcessGroupNCCL.cpp:1771] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe
[rank1]:[W206 11:49:12.395049544 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=59, addr=[localhost]:38368, remote=[localhost]:35681): Broken pipe
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f6b75d7cb80 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5ffc5d1 (0x7f6b58b035d1 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5ffce62 (0x7f6b58b03e62 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5ffe96e (0x7f6b58b0596e in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x30e (0x7f6b58b0028e in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c8 (0x7f6afc673c88 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdc253 (0x7f6b6e6b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x94ac3 (0x7f6b76ae1ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #8: clone + 0x44 (0x7f6b76b72a84 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[W206 11:49:12.403095916 ProcessGroupNCCL.cpp:1771] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.15.1) with config: model='/root/.cache/huggingface/hub/models--openai--gpt-oss-20b/snapshots/6cee5e81ee83917806bbde320786a8fb61efebee', speculative_config=None, tokenizer='/root/.cache/huggingface/hub/models--openai--gpt-oss-20b/snapshots/6cee5e81ee83917806bbde320786a8fb61efebee', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=mxfp4, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='openai_gptoss', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/root/.cache/huggingface/hub/models--openai--gpt-oss-20b/snapshots/6cee5e81ee83917806bbde320786a8fb61efebee, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512, 528, 544, 560, 576, 592, 608, 624, 640, 656, 672, 688, 704, 720, 736, 752, 768, 784, 800, 816, 832, 848, 864, 880, 896, 912, 928, 944, 960, 976, 992, 1008, 1024, 1040, 1056, 1072, 1088, 1104, 1120, 1136, 1152, 1168, 1184, 1200, 1216, 1232, 1248, 1264, 1280, 1296, 1312, 1328, 1344, 1360, 1376, 1392, 1408, 1424, 1440, 1456, 1472, 1488, 1504, 1520, 1536, 1552, 1568, 1584, 1600, 1616, 1632, 1648, 1664, 1680, 1696, 1712, 1728, 1744, 1760, 1776, 1792, 1808, 1824, 1840, 1856, 1872, 1888, 1904, 1920, 1936, 1952, 1968, 1984, 2000, 2016, 2032, 2048], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': True}, 'max_cudagraph_capture_size': 2048, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': True}, 'local_cache_dir': None, 'static_all_moe_layers': []}, 
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=['chatcmpl-8f1b51351cffef8e-b5ae6e2d'],resumed_req_ids=set(),new_token_ids_lens=[],all_token_ids_lens={},new_block_ids=[([32281, 32282, 32283, 32284, 32285, 32286, 32287, 32288, 32289, 32290, 32291, 32292, 32293, 32294, 32295, 32296, 32297, 32298, 32299, 32300, 32301, 32302, 32303, 32304, 32305, 32306, 32307, 32308, 32309, 32310, 32311, 32312, 32313, 32314, 32315, 32316, 32317, 32318, 32319, 32320, 32321, 32322, 32323, 32324, 32325, 32326, 32327, 32328, 32329, 32330, 32331, 32332, 32333, 32334, 32335, 32336, 32337, 32338, 32339, 32340, 32341, 32342, 32343, 32344, 32345, 32346, 32347, 32348, 32349, 32350, 32351, 32352, 32353, 32354, 32355, 32356, 32357, 32358, 32359, 32360, 32361, 32362, 32363, 32364, 32365, 32366, 32367, 32368, 32369, 32370, 32371, 32372, 32373, 32374, 32375, 32376, 32377, 32378, 32379, 32380, 32381, 32382, 32383, 32384, 32385, 32386, 32387, 32388, 32389, 32390, 32391, 32392, 32393, 32394, 32395, 32396, 32397, 32398, 32399, 32400, 32401, 32402, 32403, 32404, 32405, 32406, 32407, 32408, 32409, 32410, 32411, 32412, 32413, 32414, 32415, 32416, 32417, 32418, 32419, 32420, 32421, 32422, 32423, 32424, 32425, 32426, 32427, 32428, 32429, 32430, 32431, 32432, 32433, 32434, 32435, 32436, 32437, 32438, 32439, 32440, 32441, 32442, 32443, 32444, 32445, 32446, 32447, 32448, 32449, 32450, 32451, 32452, 32453, 32454, 32455, 32456, 32457, 32458, 32459, 32460, 32461, 32462, 32463, 32464, 32465, 32466, 32467, 32468, 32469, 32470, 32471, 32472, 32473, 32474, 32475, 32476, 32477, 32478, 32479, 32480, 32481, 32482, 32483, 32484, 32485, 32486, 32487, 32488, 32489, 32490, 32491, 32492, 32493, 32494, 32495, 32496, 32497, 32498, 32499, 32500, 32501, 32502, 32503, 32504, 32505, 32506, 32507, 32508, 32509, 32510, 32511, 32512, 32513, 32514, 32515, 32516, 32517, 32518, 32519, 32520, 32521, 32522, 32523, 32524, 32525, 32526, 32527, 32528, 32529, 32530, 32531, 32532, 32533, 32534, 32535, 32536, 32537, 32538, 32539, 32540, 32541, 32542, 32543, 32544, 32545, 32546, 32547, 32548, 32549, 32550, 32551, 32552, 32553, 32554, 32555, 32556, 32557, 32558, 32559, 32560, 32561, 32562, 32563, 32564, 32565, 32566, 32567, 32568, 32569, 32570, 32571, 32572, 32573, 32574, 32575, 32576, 32577, 32578, 32579, 32580, 32581, 32582, 32583, 32584, 32585, 32586, 32587, 32588, 32589, 32590, 32591, 32592, 32593, 32594, 32595, 32596, 32597, 32598, 32599, 32600, 32601, 32602, 32603, 32604, 32605, 32606, 32607, 32608, 32609, 32610, 32611, 32612, 32613, 32614, 32615, 32616, 32617, 32618, 32619, 32620, 32621, 32622, 32623, 32624, 32625, 32626, 32627, 32628, 32629, 32630, 32631, 32632, 32633, 32634, 32635, 32636, 32637, 32638, 32639, 32640, 32641, 32642, 32643, 32644, 32645, 32646, 32647, 32648, 32649, 32650, 32651, 32652, 32653, 32654, 32655, 32656, 32657, 32658, 32659, 32660, 32661, 32662, 32663, 32664, 32665, 32666, 32667, 32668, 32669, 32670, 32671, 32672, 32673, 32674, 32675, 32676, 32677, 32678, 32679, 32680, 32681, 32682, 32683, 32684, 32685, 32686, 32687, 32688, 32689, 32690, 32691, 32692, 32693, 32694, 32695, 32696, 32697, 32698, 32699, 32700, 32701, 32702, 32703, 32704, 32705, 32706, 32707, 32708, 32709, 32710, 32711, 32712, 32713, 32714, 32715, 32716, 32717, 32718, 32719, 32720, 32721, 32722, 32723, 32724, 32725, 32726, 32727, 32728, 32729, 32730, 32731, 32732, 32733, 32734, 32735, 32736, 32737, 32738, 32739, 32740, 32741, 32742, 32743, 32744, 32745, 32746, 32747, 32748, 32749, 32750, 32751, 32752, 32753, 32754, 32755, 32756, 32757, 32758, 32759, 32760, 32761, 32762, 32763, 32764, 32765, 32766, 32767, 32768, 32769, 32770, 32771, 32772, 32773, 32774, 32775, 32776, 32777, 32778, 32779, 32780, 32781, 32782, 32783, 32784, 32785, 32786, 32787, 32788, 32789, 32790, 32791, 32792], [32793, 32794, 32795, 32796, 32797, 32798, 32799, 32800, 32801, 32802, 32803, 32804, 32805, 32806, 32807, 32808, 32809, 32810, 32811, 32812, 32813, 32814, 32815, 32816, 32817, 32818, 32819, 32820, 32821, 32822, 32823, 32824, 32825, 32826, 32827, 32828, 32829, 32830, 32831, 32832, 32833, 32834, 32835, 32836, 32837, 32838, 32839, 32840, 32841, 32842, 32843, 32844, 32845, 32846, 32847, 32848, 32849, 32850, 32851, 32852, 32853, 32854, 32855, 32856, 32857, 32858, 32859, 32860, 32861, 32862, 32863, 32864, 32865, 32866, 32867, 32868, 32869, 32870, 32871, 32872, 32873, 32874, 32875, 32876, 32877, 32878, 32879, 32880, 32881, 32882, 32883, 32884, 32885, 32886, 32887, 32888, 32889, 32890, 32891, 32892, 32893, 32894, 32895, 32896, 32897, 32898, 32899, 32900, 32901, 32902, 32903, 32904, 32905, 32906, 32907, 32908, 32909, 32910, 32911, 32912, 32913, 32914, 32915, 32916, 32917, 32918, 32919, 32920, 32921, 32922, 32923, 32924, 32925, 32926, 32927, 32928, 32929, 32930, 32931, 32932, 32933, 32934, 32935, 32936, 32937, 32938, 32939, 32940, 32941, 32942, 32943, 32944, 32945, 32946, 32947, 32948, 32949, 32950, 32951, 32952, 32953, 32954, 32955, 32956, 32957, 32958, 32959, 32960, 32961, 32962, 32963, 32964, 32965, 32966, 32967, 32968, 32969, 32970, 32971, 32972, 32973, 32974, 32975, 32976, 32977, 32978, 32979, 32980, 32981, 32982, 32983, 32984, 32985, 32986, 32987, 32988, 32989, 32990, 32991, 32992, 32993, 32994, 32995, 32996, 32997, 32998, 32999, 33000, 33001, 33002, 33003, 33004, 33005, 33006, 33007, 33008, 33009, 33010, 33011, 33012, 33013, 33014, 33015, 33016, 33017, 33018, 33019, 33020, 33021, 33022, 33023, 33024, 33025, 33026, 33027, 33028, 33029, 33030, 33031, 33032, 33033, 33034, 33035, 33036, 33037, 33038, 33039, 33040, 33041, 33042, 33043, 33044, 33045, 33046, 33047, 33048, 33049, 33050, 33051, 33052, 33053, 33054, 33055, 33056, 33057, 33058, 33059, 33060, 33061, 33062, 33063, 33064, 33065, 33066, 33067, 33068, 33069, 33070, 33071, 33072, 33073, 33074, 33075, 33076, 33077, 33078, 33079, 33080, 33081, 33082, 33083, 33084, 33085, 33086, 33087, 33088, 33089, 33090, 33091, 33092, 33093, 33094, 33095, 33096, 33097, 33098, 33099, 33100, 33101, 33102, 33103, 33104, 33105, 33106, 33107, 33108, 33109, 33110, 33111, 33112, 33113, 33114, 33115, 33116, 33117, 33118, 33119, 33120, 33121, 33122, 33123, 33124, 33125, 33126, 33127, 33128, 33129, 33130, 33131, 33132, 33133, 33134, 33135, 33136, 33137, 33138, 33139, 33140, 33141, 33142, 33143, 33144, 33145, 33146, 33147, 33148, 33149, 33150, 33151, 33152, 33153, 33154, 33155, 33156, 33157, 33158, 33159, 33160, 33161, 33162, 33163, 33164, 33165, 33166, 33167, 33168, 33169, 33170, 33171, 33172, 33173, 33174, 33175, 33176, 33177, 33178, 33179, 33180, 33181, 33182, 33183, 33184, 33185, 33186, 33187, 33188, 33189, 33190, 33191, 33192, 33193, 33194, 33195, 33196, 33197, 33198, 33199, 33200, 33201, 33202, 33203, 33204, 33205, 33206, 33207, 33208, 33209, 33210, 33211, 33212, 33213, 33214, 33215, 33216, 33217, 33218, 33219, 33220, 33221, 33222, 33223, 33224, 33225, 33226, 33227, 33228, 33229, 33230, 33231, 33232, 33233, 33234, 33235, 33236, 33237, 33238, 33239, 33240, 33241, 33242, 33243, 33244, 33245, 33246, 33247, 33248, 33249, 33250, 33251, 33252, 33253, 33254, 33255, 33256, 33257, 33258, 33259, 33260, 33261, 33262, 33263, 33264, 33265, 33266, 33267, 33268, 33269, 33270, 33271, 33272, 33273, 33274, 33275, 33276, 33277, 33278, 33279, 33280, 33281, 33282, 33283, 33284, 33285, 33286, 33287, 33288, 33289, 33290, 33291, 33292, 33293, 33294, 33295, 33296, 33297, 33298, 33299, 33300, 33301, 33302, 33303, 33304])],num_computed_tokens=[8192],num_output_tokens=[0]), num_scheduled_tokens={chatcmpl-8f1b51351cffef8e-b5ae6e2d: 8192}, total_num_scheduled_tokens=8192, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], has_structured_output_requests=false, pending_structured_output_tokens=false, num_invalid_spec_tokens=null, kv_connector_metadata=null, ec_connector_metadata=null)
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.04009833378858241, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, kv_cache_eviction_events=[], spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None)
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948] EngineCore encountered a fatal error.
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948] Traceback (most recent call last):
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 939, in run_engine_core
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]     engine_core.run_busy_loop()
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 966, in run_busy_loop
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]     self._process_engine_step()
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 999, in _process_engine_step
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]     outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 486, in step_with_batch_queue
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]     model_output = future.result()
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]                    ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 80, in result
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]     return super().result()
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]            ^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]     return self.__get_result()
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]     raise self._exception
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 84, in wait_for_response
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]     response = self.aggregate(get_response())
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 351, in get_response
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]     status, result = mq.dequeue(
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]                      ^^^^^^^^^^^
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 616, in dequeue
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]     with self.acquire_read(timeout, cancel, indefinite) as buf:
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]   File "/usr/lib/python3.12/contextlib.py", line 137, in __enter__
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]     return next(self.gen)
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]            ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 531, in acquire_read
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]     raise RuntimeError("cancelled")
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948] RuntimeError: cancelled
(APIServer pid=1) ERROR 02-06 11:49:13 [async_llm.py:693] AsyncLLM output_handler failed.
(APIServer pid=1) ERROR 02-06 11:49:13 [async_llm.py:693] Traceback (most recent call last):
(APIServer pid=1) ERROR 02-06 11:49:13 [async_llm.py:693]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 649, in output_handler
(APIServer pid=1) ERROR 02-06 11:49:13 [async_llm.py:693]     outputs = await engine_core.get_output_async()
(APIServer pid=1) ERROR 02-06 11:49:13 [async_llm.py:693]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) ERROR 02-06 11:49:13 [async_llm.py:693]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 894, in get_output_async
(APIServer pid=1) ERROR 02-06 11:49:13 [async_llm.py:693]     raise self._format_exception(outputs) from None
(APIServer pid=1) ERROR 02-06 11:49:13 [async_llm.py:693] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=1) INFO:     172.20.30.250:56406 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(EngineCore_DP0 pid=31) Process EngineCore_DP0:
(EngineCore_DP0 pid=31) Traceback (most recent call last):
(EngineCore_DP0 pid=31)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=31)     self.run()
(EngineCore_DP0 pid=31)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=31)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=31)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 950, in run_engine_core
(EngineCore_DP0 pid=31)     raise e
(EngineCore_DP0 pid=31)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 939, in run_engine_core
(EngineCore_DP0 pid=31)     engine_core.run_busy_loop()
(EngineCore_DP0 pid=31)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 966, in run_busy_loop
(EngineCore_DP0 pid=31)     self._process_engine_step()
(EngineCore_DP0 pid=31)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 999, in _process_engine_step
(EngineCore_DP0 pid=31)     outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=31)                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=31)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 486, in step_with_batch_queue
(EngineCore_DP0 pid=31)     model_output = future.result()
(EngineCore_DP0 pid=31)                    ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=31)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 80, in result
(EngineCore_DP0 pid=31)     return super().result()
(EngineCore_DP0 pid=31)            ^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=31)   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore_DP0 pid=31)     return self.__get_result()
(EngineCore_DP0 pid=31)            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=31)   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore_DP0 pid=31)     raise self._exception
(EngineCore_DP0 pid=31)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 84, in wait_for_response
(EngineCore_DP0 pid=31)     response = self.aggregate(get_response())
(EngineCore_DP0 pid=31)                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=31)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 351, in get_response
(EngineCore_DP0 pid=31)     status, result = mq.dequeue(
(EngineCore_DP0 pid=31)                      ^^^^^^^^^^^
(EngineCore_DP0 pid=31)   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 616, in dequeue
(EngineCore_DP0 pid=31)     with self.acquire_read(timeout, cancel, indefinite) as buf:
(EngineCore_DP0 pid=31)          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=31)   File "/usr/lib/python3.12/contextlib.py", line 137, in __enter__
(EngineCore_DP0 pid=31)     return next(self.gen)
(EngineCore_DP0 pid=31)            ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=31)   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 531, in acquire_read
(EngineCore_DP0 pid=31)     raise RuntimeError("cancelled")
(EngineCore_DP0 pid=31) RuntimeError: cancelled
(APIServer pid=1) INFO:     172.20.30.250:56406 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     172.20.30.250:56406 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     172.20.30.250:56406 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     172.20.30.250:56406 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     Shutting down
(APIServer pid=1) INFO:     Waiting for application shutdown.
(APIServer pid=1) INFO:     Application shutdown complete.
(APIServer pid=1) INFO:     Finished server process [1]
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 2 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 2 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

geraldstanje · 2026-03-08T20:43:55Z

hi @renehonig @shahizat @mgoin does your pr also fix the following when running gpt-oss-20b?

(EngineCore_DP0 pid=281) INFO 03-05 17:53:04 [cuda.py:367] Using TRITON_ATTN attention backend out of potential backends: ['TRITON_ATTN'].
(EngineCore_DP0 pid=281) INFO 03-05 17:53:04 [mxfp4.py:157] Using Marlin backend

EngineCore_DP0 pid=280) WARNING 03-05 17:16:17 [marlin_utils_fp4.py:338] Your GPU does not have native support for FP4 computation but FP4 quantization is being used. Weight-only FP4 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.

and what about the following pr? #31089

kaigouthro · 2026-03-17T00:39:52Z

If it helps anyone, i run this,with a added hack of setting matmul 32 precision to high in torch.. getting 12000-15000 tokens per second read, 250-280 generation on nemotron 30 nan 3A... max 131000 context at around 88% vram allocated on 5090rtx.. swap space is just in case, but it does a fairly good job as long as i don'y overdo it with the same repetitive inputs. this is with the git package release of 0.16, which includes the nvfp4 stuff.. so, this works... but also, if anyone knows about the scaling factor for the kv cache .. three seems to be not much i cam acrss as to what that even does.. could be something or not if anyone as advice there.. the works here but if i can improve i'm all ears.. PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \ FLASHINFER_DISABLE_VERSION_CHECK=1 \ VLLM_USE_FLASHINFER_MOE_FP4=1 \ VLLM_FLASHINFER_MOE_BACKEND=throughput \ vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \ --served-model-name nemotron \ --max-num-seqs 6 \ --tensor-parallel-size 1 \ --max-model-len 130000 \ --port 3337 \ --trust-remote-code \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --reasoning-parser nano_v3 \ --reasoning-parser-plugin nano_v3_reasoning_parser.py \ --mamba-ssm-cache-dtype float16 \ --kv-cache-dtype fp8 \ --quantization modelopt_fp4 \ --gpu-memory-utilization 0.875 \ --max-num-batched-tokens 8192 \ --swap-space 4

…

On Sun, Mar 8, 2026 at 2:44 PM geraldstanje ***@***.***> wrote: *geraldstanje* left a comment (vllm-project/vllm#33417) <#33417 (comment)> hi @renehonig <https://github.com/renehonig> does your pr also fix #31089 <#31089> ? — Reply to this email directly, view it on GitHub <#33417 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAF53A4WBQAV4ZOEPNPY6YT4PXSTBAVCNFSM6AAAAACTN4OPJCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DAMJZHE4DANJVHE> . You are receiving this because you commented.Message ID: ***@***.***>

Copilot AI review requested due to automatic review settings January 30, 2026 12:55

renehonig requested review from ApostaC, DarkLight1337, LucasWilkinson, NickLucche, aarnphm, chaunceyjiang, gshtras, mgoin, orozery, pavanimajety, robertgshaw2-redhat, tjtanaa, tlrmchlsmth, yewentao256 and ywang96 as code owners January 30, 2026 12:55

Copilot started reviewing on behalf of renehonig January 30, 2026 12:55 View session

mergify bot added documentation Improvements or additions to documentation ci/build frontend multi-modality Related to multi-modality (#4194) new-model Requests to new models nvidia rocm Related to AMD ROCm labels Jan 30, 2026

github-project-automation bot added this to NVIDIA Jan 30, 2026

mergify bot added the v1 label Jan 30, 2026

github-project-automation bot moved this to Todo in AMD Jan 30, 2026

github-project-automation bot added this to AMD Jan 30, 2026

renehonig force-pushed the fix/sm120-rtx-blackwell-support branch from 7a32add to 9b71b86 Compare January 31, 2026 07:45

mergify bot removed the needs-rebase label Jan 31, 2026

vllm-bot merged commit 0797811 into vllm-project:main Jan 31, 2026
44 of 54 checks passed

github-project-automation bot moved this from Ready to Done in NVIDIA Jan 31, 2026

github-project-automation bot moved this from Todo to Done in AMD Jan 31, 2026

danisereb mentioned this pull request Feb 1, 2026

[Bug]: sm_120 -NvFp4 MoE backend 'FLASHINFER_CUTLASS' does not support the deployment configuration since kernel does not support current device. #33333

Closed

1 task

Code4me2 mentioned this pull request Feb 1, 2026

[Bugfix] Add SM110/SM120 device capability checks for NVFP4 MoE backends #33516

Closed

khluu pushed a commit that referenced this pull request Feb 2, 2026

fix: Add SM120 (RTX Blackwell) support for FlashInfer CUTLASS NVFP4 M…

15ebd0c

…oE kernels (#33417) Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: mgoin <mgoin64@gmail.com> (cherry picked from commit 0797811)

mgoin mentioned this pull request Feb 2, 2026

[Bug]: Qwen3-Next-80B-A3B-Instruct-NVFP4 can't run with 0.15.0 #33544

Open

1 task

aabbccddwasd mentioned this pull request Feb 9, 2026

[Revert] Fix performance regression for GLM-4.7-GPTQ decode and MTP acceptance rate #33771

Merged

5 tasks

qsang-nv mentioned this pull request Mar 2, 2026

[Bug] NVFP4 mm_fp4 GEMM broken on SM120 (RTX PRO 6000 Blackwell) - all backends fail flashinfer-ai/flashinfer#2577

Open

This was referenced Mar 8, 2026

CUDA illegal memory access in MoE layer with MiniMax-M2.5 NVFP4 on Blackwell (SM120) #35566

Open

[Bug]: Qwen3.5 NVFP4 Checkpoint has poor accuracy #36094

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Add SM120 (RTX Blackwell) support for FlashInfer CUTLASS NVFP4 MoE kernels#33417

fix: Add SM120 (RTX Blackwell) support for FlashInfer CUTLASS NVFP4 MoE kernels#33417
vllm-bot merged 1 commit intovllm-project:mainfrom
renehonig:fix/sm120-rtx-blackwell-support

renehonig commented Jan 30, 2026 •

edited

Loading

Uh oh!

mergify bot commented Jan 30, 2026

Uh oh!

mergify bot commented Jan 30, 2026

Uh oh!

renehonig commented Jan 31, 2026

Uh oh!

Uh oh!

kaigouthro commented Feb 1, 2026

Uh oh!

shahizat commented Feb 1, 2026

Uh oh!

rnik12 commented Feb 7, 2026

Uh oh!

geraldstanje commented Mar 8, 2026 •

edited

Loading

Uh oh!

kaigouthro commented Mar 17, 2026 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Uh oh!

Conversation

renehonig commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Root Cause

Solution

Files Changed

Testing

Related Issue

For Maintainers

Uh oh!

mergify bot commented Jan 30, 2026

Uh oh!

mergify bot commented Jan 30, 2026

Uh oh!

renehonig commented Jan 31, 2026

Uh oh!

Uh oh!

kaigouthro commented Feb 1, 2026

Uh oh!

shahizat commented Feb 1, 2026

Uh oh!

rnik12 commented Feb 7, 2026

Uh oh!

geraldstanje commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kaigouthro commented Mar 17, 2026 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

renehonig commented Jan 30, 2026 •

edited

Loading

geraldstanje commented Mar 8, 2026 •

edited

Loading