Skip to content

fix: Add SM120 (RTX Blackwell) support for FlashInfer CUTLASS NVFP4 MoE kernels#33417

Merged
vllm-bot merged 1 commit intovllm-project:mainfrom
renehonig:fix/sm120-rtx-blackwell-support
Jan 31, 2026
Merged

fix: Add SM120 (RTX Blackwell) support for FlashInfer CUTLASS NVFP4 MoE kernels#33417
vllm-bot merged 1 commit intovllm-project:mainfrom
renehonig:fix/sm120-rtx-blackwell-support

Conversation

@renehonig
Copy link
Contributor

@renehonig renehonig commented Jan 30, 2026

Summary

This PR adds SM120 (RTX Blackwell) device capability family support to the NVFP4 MoE kernel backend selection code. The NVFP4 quantization kernels check for specific GPU architecture families, but currently only recognize SM9.0 (Hopper) and SM10.x (B100/B200 data center Blackwell), missing SM12.0 (RTX Blackwell workstation GPUs).

Problem

On RTX Blackwell GPUs (e.g., RTX PRO 6000 Blackwell Workstation Edition with compute capability 12.0), vLLM v0.15.0 crashes when loading MiniMax-M2.1-NVFP4 or other NVFP4 MoE models with:

RuntimeError: FlashInfer-CUTLASS MoE kernel does not support current device sm_120

Root Cause

The is_device_capability_family(100) check returns False for SM12.0 devices because:

  • SM12.0 → family = 120 // 10 = 12
  • SM10.x → family = 100 // 10 = 10
  • 12 != 10, so the check fails

This is a regression introduced in commit 42135d6 ([MoE Refactor] Oracle Select FP8+NVFP4 Kernels In Priority #32414).

Solution

Add or current_platform.is_device_capability_family(120) checks alongside existing SM100 family checks in all NVFP4 MoE kernel selection code.

Files Changed

  • vllm/model_executor/layers/fused_moe/flashinfer_cutlass_moe.py
  • vllm/model_executor/layers/fused_moe/flashinfer_cutedsl_moe.py
  • vllm/model_executor/layers/fused_moe/flashinfer_trtllm_moe.py
  • vllm/model_executor/layers/quantization/utils/flashinfer_fp4_moe.py

Testing

Tested on RTX PRO 6000 Blackwell Workstation Edition with MiniMax-M2.1-NVFP4 model - inference working successfully after fix.

Related Issue

Fixes #33416


For Maintainers

This is a regression bugfix affecting NVFP4 MoE models on RTX Blackwell GPUs (SM12.0).
Please consider cherry-picking this to releases/v0.15.0 for inclusion in v0.15.1.

@mergify
Copy link

mergify bot commented Jan 30, 2026

Documentation preview: https://vllm--33417.org.readthedocs.build/en/33417/

@mergify mergify bot added documentation Improvements or additions to documentation ci/build frontend multi-modality Related to multi-modality (#4194) new-model Requests to new models nvidia rocm Related to AMD ROCm labels Jan 30, 2026
@mergify mergify bot added the v1 label Jan 30, 2026
@github-project-automation github-project-automation bot moved this to Todo in AMD Jan 30, 2026
@mergify
Copy link

mergify bot commented Jan 30, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @renehonig.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@renehonig renehonig force-pushed the fix/sm120-rtx-blackwell-support branch from 7a32add to 9b71b86 Compare January 31, 2026 07:45
@mergify mergify bot removed the needs-rebase label Jan 31, 2026
@renehonig
Copy link
Contributor Author

Ah sorry @renehonig can you fix the merge conflict?

@mgoin hopefully ok now.

@vllm-bot vllm-bot merged commit 0797811 into vllm-project:main Jan 31, 2026
44 of 54 checks passed
@github-project-automation github-project-automation bot moved this from Ready to Done in NVIDIA Jan 31, 2026
@github-project-automation github-project-automation bot moved this from Todo to Done in AMD Jan 31, 2026
@kaigouthro
Copy link

YES! Works on 5090rtx Confirmed running

@shahizat
Copy link

shahizat commented Feb 1, 2026

Hello all,

I've just tested this on both the RTX 5090 and the RTX 6000 Pro Blackwell, but I am still facing an issue when running
https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4
.

The error I’m getting is:

(EngineCore_DP0 pid=189129) ValueError: NvFp4 MoE backend 'FLASHINFER_CUTLASS' does not support the deployment configuration since kernel does not support current device.

Steps to reproduce:

uv venv .vllm --python 3.12
source .vllm/bin/activate

uv pip install --force-reinstall torch torchvision torchaudio triton --index-url https://download.pytorch.org/whl/cu130

export TORCH_CUDA_ARCH_LIST="12.0"
export CUDA_HOME=/usr/local/cuda-13
export TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas
export PATH="${CUDA_HOME}/bin:$PATH"

git clone https://github.com/vllm-project/vllm.git
cd vllm 
python3 use_existing_torch.py 
uv pip install -r requirements/build.txt
MAX_JOBS=$(nproc) python3 setup.py bdist_wheel

uv pip install --no-deps dist/vllm*.whl
uv pip install -r requirements/common.txt

wget https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4/resolve/main/nano_v3_reasoning_parser.py

VLLM_USE_FLASHINFER_MOE_FP4=1 \
VLLM_FLASHINFER_MOE_BACKEND=latency \
vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 \
  --served-model-name model \
  --max-num-seqs 8 \
  --tensor-parallel-size 1 \
  --max-model-len 262144 \
  --port 8000 \
  --trust-remote-code \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser-plugin nano_v3_reasoning_parser.py \
  --reasoning-parser nano_v3 \
  --kv-cache-dtype fp8

cc @johnnynunez @mgoin

Code4me2 added a commit to Code4me2/vllm that referenced this pull request Feb 1, 2026
Extend device capability checks to include SM110 and SM120 GPU families,
matching the approach used in flashinfer_cutlass_moe.py and cutlass_moe.py
after PR vllm-project#33417.

These files were not updated in vllm-project#33417 and still only checked for SM100:
- flashinfer_fp4_moe.py
- flashinfer_trtllm_moe.py
- flashinfer_cutedsl_moe.py
- flashinfer_utils.py

The fix adds explicit family checks for SM100/110/120, enabling support for:
- SM100-109: Blackwell data center (B100, B200)
- SM110-119: Future Blackwell variants
- SM120-129: Blackwell consumer/workstation (RTX 5090, DGX Spark GB10)

Tested on RTX 5090 (SM120) and DGX Spark GB10 (SM121) with
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4

Signed-off-by: code4me2 <velvetmoon222999@gmail.com>
Code4me2 added a commit to Code4me2/vllm that referenced this pull request Feb 1, 2026
Extend device capability checks to include SM110 and SM120 GPU families,
matching the approach used in flashinfer_cutlass_moe.py and cutlass_moe.py
after PR vllm-project#33417.

These files were not updated in vllm-project#33417 and still only checked for SM100:
- flashinfer_fp4_moe.py
- flashinfer_trtllm_moe.py
- flashinfer_cutedsl_moe.py
- flashinfer_utils.py

The fix adds explicit family checks for SM100/110/120 using any() for
cleaner, more maintainable code, enabling support for:
- SM100-109: Blackwell data center (B100, B200)
- SM110-119: Future Blackwell variants
- SM120-129: Blackwell consumer/workstation (RTX 5090, DGX Spark GB10)

Tested on RTX 5090 (SM120) and DGX Spark GB10 (SM121) with
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4

Signed-off-by: code4me2 <velvetmoon222999@gmail.com>
khluu pushed a commit that referenced this pull request Feb 2, 2026
…oE kernels (#33417)

Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
(cherry picked from commit 0797811)
PiratePai pushed a commit to PiratePai/epd_shm that referenced this pull request Feb 3, 2026
…oE kernels (vllm-project#33417)

Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Pai <416932041@qq.com>
@rnik12
Copy link

rnik12 commented Feb 7, 2026

gptoss20b getting OOM on 15.1

e hit rate: 0.0%
(APIServer pid=1) INFO:     172.20.30.250:56406 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     172.20.30.250:56406 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 02-06 11:48:39 [loggers.py:257] Engine 000: Avg prompt throughput: 1856.6 tokens/s, Avg generation throughput: 44.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.1%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 02-06 11:48:49 [loggers.py:257] Engine 000: Avg prompt throughput: 1608.6 tokens/s, Avg generation throughput: 122.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.0%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO:     172.20.30.250:56406 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 02-06 11:48:59 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 116.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(Worker_TP0 pid=53) Exception in thread WorkerAsyncOutputCopy:
(Worker_TP0 pid=53) Traceback (most recent call last):
(Worker_TP0 pid=53)   File "/usr/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
(Worker_TP0 pid=53)     self.run()
(Worker_TP0 pid=53)   File "/usr/lib/python3.12/threading.py", line 1012, in run
(Worker_TP0 pid=53)     self._target(*self._args, **self._kwargs)
(Worker_TP0 pid=53)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 832, in async_output_busy_loop
(Worker_TP0 pid=53)     self.enqueue_output(output)
(Worker_TP0 pid=53)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 809, in enqueue_output
(Worker_TP0 pid=53)     output = output.get_output()
(Worker_TP0 pid=53)              ^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=53)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 239, in get_output
(Worker_TP0 pid=53)     self.async_copy_ready_event.synchronize()
(Worker_TP0 pid=53) torch.AcceleratorError: CUDA error: unspecified launch failure
(Worker_TP0 pid=53) Search for `cudaErrorLaunchFailure' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
(Worker_TP0 pid=53) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(Worker_TP0 pid=53) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(Worker_TP0 pid=53) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(Worker_TP0 pid=53) 
[rank0]:[W206 11:49:01.060117336 CUDAGuardImpl.h:122] Warning: CUDA warning: unspecified launch failure (function destroyEvent)
terminate called after throwing an instance of 'c10::AcceleratorError'
  what():  CUDA error: unspecified launch failure
Search for `cudaErrorLaunchFailure' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7fdb5d165b80 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x11fb7 (0x7fdbd7766fb7 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0xcacea8 (0x7fdb5de95ea8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0xca7ed3 (0x7fdb5de90ed3 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xcafa05 (0x7fdb5de98a05 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0x4827af (0x7fdbc92df7af in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
frame #6: c10::TensorImpl::~TensorImpl() + 0x9 (0x7fdb5d142d69 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #7: <unknown function> + 0x7cb658 (0x7fdbc9628658 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0x7cb9c5 (0x7fdbc96289c5 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
frame #9: VLLM::Worker_TP0() [0x59e650]
frame #10: VLLM::Worker_TP0() [0x57f256]
frame #11: VLLM::Worker_TP0() [0x57e276]
frame #12: VLLM::Worker_TP0() [0x57e26f]
frame #13: VLLM::Worker_TP0() [0x57e26f]
frame #14: VLLM::Worker_TP0() [0x597457]
frame #15: VLLM::Worker_TP0() [0x59e4a6]
frame #16: _PyEval_EvalFrameDefault + 0x5102 (0x54ece2 in VLLM::Worker_TP0)
frame #17: VLLM::Worker_TP0() [0x599e7d]
frame #18: VLLM::Worker_TP0() [0x599a46]
frame #19: VLLM::Worker_TP0() [0x6a87f9]
frame #20: VLLM::Worker_TP0() [0x6a87a8]
frame #21: <unknown function> + 0x94ac3 (0x7fdbd836aac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #22: clone + 0x44 (0x7fdbd83fba84 in /usr/lib/x86_64-linux-gnu/libc.so.6)

(EngineCore_DP0 pid=31) ERROR 02-06 11:49:05 [multiproc_executor.py:246] Worker proc VllmWorker-0 died unexpectedly, shutting down executor.
(Worker_TP1 pid=54) INFO 02-06 11:49:05 [multiproc_executor.py:730] Parent process exited, terminating worker
(Worker_TP1 pid=54) INFO 02-06 11:49:05 [multiproc_executor.py:774] WorkerProc shutting down.
[rank1]:[W206 11:49:05.331421908 TCPStore.cpp:125] [c10d] recvValue failed on SocketImpl(fd=59, addr=[localhost]:38368, remote=[localhost]:35681): Connection reset by peer
Exception raised from recvBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:679 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f6b75d7cb80 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5ffc5d1 (0x7f6b58b035d1 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5ffda33 (0x7f6b58b04a33 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5ffe57a (0x7f6b58b0557a in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x31e (0x7f6b58b0029e in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c8 (0x7f6afc673c88 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdc253 (0x7f6b6e6b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x94ac3 (0x7f6b76ae1ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #8: clone + 0x44 (0x7f6b76b72a84 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[W206 11:49:05.344733168 ProcessGroupNCCL.cpp:1771] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Connection reset by peer
[rank1]:[W206 11:49:06.345031987 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=59, addr=[localhost]:38368, remote=[localhost]:35681): Broken pipe
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f6b75d7cb80 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5ffc5d1 (0x7f6b58b035d1 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5ffce62 (0x7f6b58b03e62 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5ffe96e (0x7f6b58b0596e in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x30e (0x7f6b58b0028e in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c8 (0x7f6afc673c88 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdc253 (0x7f6b6e6b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x94ac3 (0x7f6b76ae1ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #8: clone + 0x44 (0x7f6b76b72a84 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[W206 11:49:06.353301187 ProcessGroupNCCL.cpp:1771] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe
[rank1]:[W206 11:49:07.353511180 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=59, addr=[localhost]:38368, remote=[localhost]:35681): Broken pipe
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f6b75d7cb80 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5ffc5d1 (0x7f6b58b035d1 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5ffce62 (0x7f6b58b03e62 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5ffe96e (0x7f6b58b0596e in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x30e (0x7f6b58b0028e in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c8 (0x7f6afc673c88 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdc253 (0x7f6b6e6b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x94ac3 (0x7f6b76ae1ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #8: clone + 0x44 (0x7f6b76b72a84 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[W206 11:49:07.361547662 ProcessGroupNCCL.cpp:1771] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe
[rank1]:[W206 11:49:08.361763130 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=59, addr=[localhost]:38368, remote=[localhost]:35681): Broken pipe
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f6b75d7cb80 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5ffc5d1 (0x7f6b58b035d1 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5ffce62 (0x7f6b58b03e62 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5ffe96e (0x7f6b58b0596e in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x30e (0x7f6b58b0028e in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c8 (0x7f6afc673c88 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdc253 (0x7f6b6e6b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x94ac3 (0x7f6b76ae1ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #8: clone + 0x44 (0x7f6b76b72a84 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[W206 11:49:08.370013162 ProcessGroupNCCL.cpp:1771] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe
[rank1]:[W206 11:49:09.370230072 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=59, addr=[localhost]:38368, remote=[localhost]:35681): Broken pipe
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f6b75d7cb80 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5ffc5d1 (0x7f6b58b035d1 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5ffce62 (0x7f6b58b03e62 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5ffe96e (0x7f6b58b0596e in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x30e (0x7f6b58b0028e in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c8 (0x7f6afc673c88 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdc253 (0x7f6b6e6b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x94ac3 (0x7f6b76ae1ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #8: clone + 0x44 (0x7f6b76b72a84 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[W206 11:49:09.378170434 ProcessGroupNCCL.cpp:1771] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe
(APIServer pid=1) INFO 02-06 11:49:09 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.2%, Prefix cache hit rate: 0.0%
[rank1]:[W206 11:49:10.378365150 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=59, addr=[localhost]:38368, remote=[localhost]:35681): Broken pipe
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f6b75d7cb80 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5ffc5d1 (0x7f6b58b035d1 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5ffce62 (0x7f6b58b03e62 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5ffe96e (0x7f6b58b0596e in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x30e (0x7f6b58b0028e in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c8 (0x7f6afc673c88 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdc253 (0x7f6b6e6b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x94ac3 (0x7f6b76ae1ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #8: clone + 0x44 (0x7f6b76b72a84 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[W206 11:49:10.386303703 ProcessGroupNCCL.cpp:1771] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe
[rank1]:[W206 11:49:11.386499882 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=59, addr=[localhost]:38368, remote=[localhost]:35681): Broken pipe
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f6b75d7cb80 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5ffc5d1 (0x7f6b58b035d1 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5ffce62 (0x7f6b58b03e62 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5ffe96e (0x7f6b58b0596e in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x30e (0x7f6b58b0028e in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c8 (0x7f6afc673c88 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdc253 (0x7f6b6e6b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x94ac3 (0x7f6b76ae1ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #8: clone + 0x44 (0x7f6b76b72a84 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[W206 11:49:11.394879888 ProcessGroupNCCL.cpp:1771] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe
[rank1]:[W206 11:49:12.395049544 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=59, addr=[localhost]:38368, remote=[localhost]:35681): Broken pipe
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f6b75d7cb80 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5ffc5d1 (0x7f6b58b035d1 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5ffce62 (0x7f6b58b03e62 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5ffe96e (0x7f6b58b0596e in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x30e (0x7f6b58b0028e in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x3c8 (0x7f6afc673c88 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdc253 (0x7f6b6e6b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x94ac3 (0x7f6b76ae1ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #8: clone + 0x44 (0x7f6b76b72a84 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[W206 11:49:12.403095916 ProcessGroupNCCL.cpp:1771] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.15.1) with config: model='/root/.cache/huggingface/hub/models--openai--gpt-oss-20b/snapshots/6cee5e81ee83917806bbde320786a8fb61efebee', speculative_config=None, tokenizer='/root/.cache/huggingface/hub/models--openai--gpt-oss-20b/snapshots/6cee5e81ee83917806bbde320786a8fb61efebee', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=mxfp4, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='openai_gptoss', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/root/.cache/huggingface/hub/models--openai--gpt-oss-20b/snapshots/6cee5e81ee83917806bbde320786a8fb61efebee, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512, 528, 544, 560, 576, 592, 608, 624, 640, 656, 672, 688, 704, 720, 736, 752, 768, 784, 800, 816, 832, 848, 864, 880, 896, 912, 928, 944, 960, 976, 992, 1008, 1024, 1040, 1056, 1072, 1088, 1104, 1120, 1136, 1152, 1168, 1184, 1200, 1216, 1232, 1248, 1264, 1280, 1296, 1312, 1328, 1344, 1360, 1376, 1392, 1408, 1424, 1440, 1456, 1472, 1488, 1504, 1520, 1536, 1552, 1568, 1584, 1600, 1616, 1632, 1648, 1664, 1680, 1696, 1712, 1728, 1744, 1760, 1776, 1792, 1808, 1824, 1840, 1856, 1872, 1888, 1904, 1920, 1936, 1952, 1968, 1984, 2000, 2016, 2032, 2048], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': True}, 'max_cudagraph_capture_size': 2048, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': True}, 'local_cache_dir': None, 'static_all_moe_layers': []}, 
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=['chatcmpl-8f1b51351cffef8e-b5ae6e2d'],resumed_req_ids=set(),new_token_ids_lens=[],all_token_ids_lens={},new_block_ids=[([32281, 32282, 32283, 32284, 32285, 32286, 32287, 32288, 32289, 32290, 32291, 32292, 32293, 32294, 32295, 32296, 32297, 32298, 32299, 32300, 32301, 32302, 32303, 32304, 32305, 32306, 32307, 32308, 32309, 32310, 32311, 32312, 32313, 32314, 32315, 32316, 32317, 32318, 32319, 32320, 32321, 32322, 32323, 32324, 32325, 32326, 32327, 32328, 32329, 32330, 32331, 32332, 32333, 32334, 32335, 32336, 32337, 32338, 32339, 32340, 32341, 32342, 32343, 32344, 32345, 32346, 32347, 32348, 32349, 32350, 32351, 32352, 32353, 32354, 32355, 32356, 32357, 32358, 32359, 32360, 32361, 32362, 32363, 32364, 32365, 32366, 32367, 32368, 32369, 32370, 32371, 32372, 32373, 32374, 32375, 32376, 32377, 32378, 32379, 32380, 32381, 32382, 32383, 32384, 32385, 32386, 32387, 32388, 32389, 32390, 32391, 32392, 32393, 32394, 32395, 32396, 32397, 32398, 32399, 32400, 32401, 32402, 32403, 32404, 32405, 32406, 32407, 32408, 32409, 32410, 32411, 32412, 32413, 32414, 32415, 32416, 32417, 32418, 32419, 32420, 32421, 32422, 32423, 32424, 32425, 32426, 32427, 32428, 32429, 32430, 32431, 32432, 32433, 32434, 32435, 32436, 32437, 32438, 32439, 32440, 32441, 32442, 32443, 32444, 32445, 32446, 32447, 32448, 32449, 32450, 32451, 32452, 32453, 32454, 32455, 32456, 32457, 32458, 32459, 32460, 32461, 32462, 32463, 32464, 32465, 32466, 32467, 32468, 32469, 32470, 32471, 32472, 32473, 32474, 32475, 32476, 32477, 32478, 32479, 32480, 32481, 32482, 32483, 32484, 32485, 32486, 32487, 32488, 32489, 32490, 32491, 32492, 32493, 32494, 32495, 32496, 32497, 32498, 32499, 32500, 32501, 32502, 32503, 32504, 32505, 32506, 32507, 32508, 32509, 32510, 32511, 32512, 32513, 32514, 32515, 32516, 32517, 32518, 32519, 32520, 32521, 32522, 32523, 32524, 32525, 32526, 32527, 32528, 32529, 32530, 32531, 32532, 32533, 32534, 32535, 32536, 32537, 32538, 32539, 32540, 32541, 32542, 32543, 32544, 32545, 32546, 32547, 32548, 32549, 32550, 32551, 32552, 32553, 32554, 32555, 32556, 32557, 32558, 32559, 32560, 32561, 32562, 32563, 32564, 32565, 32566, 32567, 32568, 32569, 32570, 32571, 32572, 32573, 32574, 32575, 32576, 32577, 32578, 32579, 32580, 32581, 32582, 32583, 32584, 32585, 32586, 32587, 32588, 32589, 32590, 32591, 32592, 32593, 32594, 32595, 32596, 32597, 32598, 32599, 32600, 32601, 32602, 32603, 32604, 32605, 32606, 32607, 32608, 32609, 32610, 32611, 32612, 32613, 32614, 32615, 32616, 32617, 32618, 32619, 32620, 32621, 32622, 32623, 32624, 32625, 32626, 32627, 32628, 32629, 32630, 32631, 32632, 32633, 32634, 32635, 32636, 32637, 32638, 32639, 32640, 32641, 32642, 32643, 32644, 32645, 32646, 32647, 32648, 32649, 32650, 32651, 32652, 32653, 32654, 32655, 32656, 32657, 32658, 32659, 32660, 32661, 32662, 32663, 32664, 32665, 32666, 32667, 32668, 32669, 32670, 32671, 32672, 32673, 32674, 32675, 32676, 32677, 32678, 32679, 32680, 32681, 32682, 32683, 32684, 32685, 32686, 32687, 32688, 32689, 32690, 32691, 32692, 32693, 32694, 32695, 32696, 32697, 32698, 32699, 32700, 32701, 32702, 32703, 32704, 32705, 32706, 32707, 32708, 32709, 32710, 32711, 32712, 32713, 32714, 32715, 32716, 32717, 32718, 32719, 32720, 32721, 32722, 32723, 32724, 32725, 32726, 32727, 32728, 32729, 32730, 32731, 32732, 32733, 32734, 32735, 32736, 32737, 32738, 32739, 32740, 32741, 32742, 32743, 32744, 32745, 32746, 32747, 32748, 32749, 32750, 32751, 32752, 32753, 32754, 32755, 32756, 32757, 32758, 32759, 32760, 32761, 32762, 32763, 32764, 32765, 32766, 32767, 32768, 32769, 32770, 32771, 32772, 32773, 32774, 32775, 32776, 32777, 32778, 32779, 32780, 32781, 32782, 32783, 32784, 32785, 32786, 32787, 32788, 32789, 32790, 32791, 32792], [32793, 32794, 32795, 32796, 32797, 32798, 32799, 32800, 32801, 32802, 32803, 32804, 32805, 32806, 32807, 32808, 32809, 32810, 32811, 32812, 32813, 32814, 32815, 32816, 32817, 32818, 32819, 32820, 32821, 32822, 32823, 32824, 32825, 32826, 32827, 32828, 32829, 32830, 32831, 32832, 32833, 32834, 32835, 32836, 32837, 32838, 32839, 32840, 32841, 32842, 32843, 32844, 32845, 32846, 32847, 32848, 32849, 32850, 32851, 32852, 32853, 32854, 32855, 32856, 32857, 32858, 32859, 32860, 32861, 32862, 32863, 32864, 32865, 32866, 32867, 32868, 32869, 32870, 32871, 32872, 32873, 32874, 32875, 32876, 32877, 32878, 32879, 32880, 32881, 32882, 32883, 32884, 32885, 32886, 32887, 32888, 32889, 32890, 32891, 32892, 32893, 32894, 32895, 32896, 32897, 32898, 32899, 32900, 32901, 32902, 32903, 32904, 32905, 32906, 32907, 32908, 32909, 32910, 32911, 32912, 32913, 32914, 32915, 32916, 32917, 32918, 32919, 32920, 32921, 32922, 32923, 32924, 32925, 32926, 32927, 32928, 32929, 32930, 32931, 32932, 32933, 32934, 32935, 32936, 32937, 32938, 32939, 32940, 32941, 32942, 32943, 32944, 32945, 32946, 32947, 32948, 32949, 32950, 32951, 32952, 32953, 32954, 32955, 32956, 32957, 32958, 32959, 32960, 32961, 32962, 32963, 32964, 32965, 32966, 32967, 32968, 32969, 32970, 32971, 32972, 32973, 32974, 32975, 32976, 32977, 32978, 32979, 32980, 32981, 32982, 32983, 32984, 32985, 32986, 32987, 32988, 32989, 32990, 32991, 32992, 32993, 32994, 32995, 32996, 32997, 32998, 32999, 33000, 33001, 33002, 33003, 33004, 33005, 33006, 33007, 33008, 33009, 33010, 33011, 33012, 33013, 33014, 33015, 33016, 33017, 33018, 33019, 33020, 33021, 33022, 33023, 33024, 33025, 33026, 33027, 33028, 33029, 33030, 33031, 33032, 33033, 33034, 33035, 33036, 33037, 33038, 33039, 33040, 33041, 33042, 33043, 33044, 33045, 33046, 33047, 33048, 33049, 33050, 33051, 33052, 33053, 33054, 33055, 33056, 33057, 33058, 33059, 33060, 33061, 33062, 33063, 33064, 33065, 33066, 33067, 33068, 33069, 33070, 33071, 33072, 33073, 33074, 33075, 33076, 33077, 33078, 33079, 33080, 33081, 33082, 33083, 33084, 33085, 33086, 33087, 33088, 33089, 33090, 33091, 33092, 33093, 33094, 33095, 33096, 33097, 33098, 33099, 33100, 33101, 33102, 33103, 33104, 33105, 33106, 33107, 33108, 33109, 33110, 33111, 33112, 33113, 33114, 33115, 33116, 33117, 33118, 33119, 33120, 33121, 33122, 33123, 33124, 33125, 33126, 33127, 33128, 33129, 33130, 33131, 33132, 33133, 33134, 33135, 33136, 33137, 33138, 33139, 33140, 33141, 33142, 33143, 33144, 33145, 33146, 33147, 33148, 33149, 33150, 33151, 33152, 33153, 33154, 33155, 33156, 33157, 33158, 33159, 33160, 33161, 33162, 33163, 33164, 33165, 33166, 33167, 33168, 33169, 33170, 33171, 33172, 33173, 33174, 33175, 33176, 33177, 33178, 33179, 33180, 33181, 33182, 33183, 33184, 33185, 33186, 33187, 33188, 33189, 33190, 33191, 33192, 33193, 33194, 33195, 33196, 33197, 33198, 33199, 33200, 33201, 33202, 33203, 33204, 33205, 33206, 33207, 33208, 33209, 33210, 33211, 33212, 33213, 33214, 33215, 33216, 33217, 33218, 33219, 33220, 33221, 33222, 33223, 33224, 33225, 33226, 33227, 33228, 33229, 33230, 33231, 33232, 33233, 33234, 33235, 33236, 33237, 33238, 33239, 33240, 33241, 33242, 33243, 33244, 33245, 33246, 33247, 33248, 33249, 33250, 33251, 33252, 33253, 33254, 33255, 33256, 33257, 33258, 33259, 33260, 33261, 33262, 33263, 33264, 33265, 33266, 33267, 33268, 33269, 33270, 33271, 33272, 33273, 33274, 33275, 33276, 33277, 33278, 33279, 33280, 33281, 33282, 33283, 33284, 33285, 33286, 33287, 33288, 33289, 33290, 33291, 33292, 33293, 33294, 33295, 33296, 33297, 33298, 33299, 33300, 33301, 33302, 33303, 33304])],num_computed_tokens=[8192],num_output_tokens=[0]), num_scheduled_tokens={chatcmpl-8f1b51351cffef8e-b5ae6e2d: 8192}, total_num_scheduled_tokens=8192, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], has_structured_output_requests=false, pending_structured_output_tokens=false, num_invalid_spec_tokens=null, kv_connector_metadata=null, ec_connector_metadata=null)
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.04009833378858241, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, kv_cache_eviction_events=[], spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None)
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948] EngineCore encountered a fatal error.
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948] Traceback (most recent call last):
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 939, in run_engine_core
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]     engine_core.run_busy_loop()
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 966, in run_busy_loop
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]     self._process_engine_step()
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 999, in _process_engine_step
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]     outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 486, in step_with_batch_queue
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]     model_output = future.result()
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]                    ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 80, in result
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]     return super().result()
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]            ^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]     return self.__get_result()
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]     raise self._exception
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 84, in wait_for_response
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]     response = self.aggregate(get_response())
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 351, in get_response
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]     status, result = mq.dequeue(
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]                      ^^^^^^^^^^^
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 616, in dequeue
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]     with self.acquire_read(timeout, cancel, indefinite) as buf:
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]   File "/usr/lib/python3.12/contextlib.py", line 137, in __enter__
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]     return next(self.gen)
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]            ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 531, in acquire_read
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948]     raise RuntimeError("cancelled")
(EngineCore_DP0 pid=31) ERROR 02-06 11:49:13 [core.py:948] RuntimeError: cancelled
(APIServer pid=1) ERROR 02-06 11:49:13 [async_llm.py:693] AsyncLLM output_handler failed.
(APIServer pid=1) ERROR 02-06 11:49:13 [async_llm.py:693] Traceback (most recent call last):
(APIServer pid=1) ERROR 02-06 11:49:13 [async_llm.py:693]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 649, in output_handler
(APIServer pid=1) ERROR 02-06 11:49:13 [async_llm.py:693]     outputs = await engine_core.get_output_async()
(APIServer pid=1) ERROR 02-06 11:49:13 [async_llm.py:693]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) ERROR 02-06 11:49:13 [async_llm.py:693]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 894, in get_output_async
(APIServer pid=1) ERROR 02-06 11:49:13 [async_llm.py:693]     raise self._format_exception(outputs) from None
(APIServer pid=1) ERROR 02-06 11:49:13 [async_llm.py:693] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=1) INFO:     172.20.30.250:56406 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(EngineCore_DP0 pid=31) Process EngineCore_DP0:
(EngineCore_DP0 pid=31) Traceback (most recent call last):
(EngineCore_DP0 pid=31)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=31)     self.run()
(EngineCore_DP0 pid=31)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=31)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=31)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 950, in run_engine_core
(EngineCore_DP0 pid=31)     raise e
(EngineCore_DP0 pid=31)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 939, in run_engine_core
(EngineCore_DP0 pid=31)     engine_core.run_busy_loop()
(EngineCore_DP0 pid=31)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 966, in run_busy_loop
(EngineCore_DP0 pid=31)     self._process_engine_step()
(EngineCore_DP0 pid=31)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 999, in _process_engine_step
(EngineCore_DP0 pid=31)     outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=31)                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=31)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 486, in step_with_batch_queue
(EngineCore_DP0 pid=31)     model_output = future.result()
(EngineCore_DP0 pid=31)                    ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=31)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 80, in result
(EngineCore_DP0 pid=31)     return super().result()
(EngineCore_DP0 pid=31)            ^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=31)   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore_DP0 pid=31)     return self.__get_result()
(EngineCore_DP0 pid=31)            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=31)   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore_DP0 pid=31)     raise self._exception
(EngineCore_DP0 pid=31)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 84, in wait_for_response
(EngineCore_DP0 pid=31)     response = self.aggregate(get_response())
(EngineCore_DP0 pid=31)                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=31)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 351, in get_response
(EngineCore_DP0 pid=31)     status, result = mq.dequeue(
(EngineCore_DP0 pid=31)                      ^^^^^^^^^^^
(EngineCore_DP0 pid=31)   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 616, in dequeue
(EngineCore_DP0 pid=31)     with self.acquire_read(timeout, cancel, indefinite) as buf:
(EngineCore_DP0 pid=31)          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=31)   File "/usr/lib/python3.12/contextlib.py", line 137, in __enter__
(EngineCore_DP0 pid=31)     return next(self.gen)
(EngineCore_DP0 pid=31)            ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=31)   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 531, in acquire_read
(EngineCore_DP0 pid=31)     raise RuntimeError("cancelled")
(EngineCore_DP0 pid=31) RuntimeError: cancelled
(APIServer pid=1) INFO:     172.20.30.250:56406 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     172.20.30.250:56406 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     172.20.30.250:56406 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     172.20.30.250:56406 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     Shutting down
(APIServer pid=1) INFO:     Waiting for application shutdown.
(APIServer pid=1) INFO:     Application shutdown complete.
(APIServer pid=1) INFO:     Finished server process [1]
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 2 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 2 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

@geraldstanje
Copy link

geraldstanje commented Mar 8, 2026

hi @renehonig @shahizat @mgoin does your pr also fix the following when running gpt-oss-20b?

(EngineCore_DP0 pid=281) INFO 03-05 17:53:04 [cuda.py:367] Using TRITON_ATTN attention backend out of potential backends: ['TRITON_ATTN'].
(EngineCore_DP0 pid=281) INFO 03-05 17:53:04 [mxfp4.py:157] Using Marlin backend

EngineCore_DP0 pid=280) WARNING 03-05 17:16:17 [marlin_utils_fp4.py:338] Your GPU does not have native support for FP4 computation but FP4 quantization is being used. Weight-only FP4 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.

and what about the following pr? #31089

@kaigouthro
Copy link

kaigouthro commented Mar 17, 2026 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build nvidia quantization ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.

[Bug] NVFP4 MoE kernels fail on RTX Blackwell (SM12.0) - device capability family check missing SM120

8 participants