[Bugfix][ROCm] Include float8_e4m3fnuz in NCCL Dtype Dispatching by micah-wil · Pull Request #33713 · vllm-project/vllm

micah-wil · 2026-02-03T16:51:43Z

#33030 fixed dtypes in the Pynccl wrapper, but omitted the case for float8_e4m3fnuz which is used on MI300/325.

Repro command:

vllm serve QWen/Qwen3-30B-A3B-FP8 --enforce-eager --enable-eplb --all2all-backend allgather_reducescatter --eplb-config '{"window_size":10, "step_interval":100, "num_redundant_experts":0, "log_balancedness":true}' --tensor-parallel-size 2 --data-parallel-size 2  --enable-expert-parallel

When running this on main, I am seeing the following:

RuntimeError: Worker failed with error 'Unsupported dtype torch.float8_e4m3fnuz: should be one of int8, uint8, int32, int64, float16, float32, float64, bfloat16, float8e4m3.', please check the stack trace above for the root cause

With this PR, the server starts as expected. This fixes the Qwen3-30B-A3B-FP8-block Accuracy failure on AMD CI. When running bash .buildkite/scripts/scheduled_integration_test/qwen30b_a3b_fp8_block_ep_eplb.sh 0.8 200 8020 on MI300X, I am now seeing:

Evaluating: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:58<00:00,  3.44it/s]

Results:
Accuracy: 0.915
Invalid responses: 0.000
Total latency: 58.083 s
Questions per second: 3.443
Total output tokens: 21947
Output tokens per second: 377.858

RCCL only defines ncclFloat8e4m3 without distinguishing between the fn and fnuz variants, so that is why both torch dtype variants map to the same NCCL dtype.
https://github.com/ROCm/rocm-systems/blob/0334750b74d14d92102196d9bd435d3ca4fc67ed/projects/rccl/src/nccl.h.in#L468

Signed-off-by: Micah Williamson <micah.williamson@amd.com>

gemini-code-assist

Code Review

This pull request addresses a bug on ROCm platforms where torch.float8_e4m3fnuz was not supported for NCCL dtype dispatching, causing a RuntimeError on MI300/325 hardware. The change correctly adds this dtype to the mapping in vllm/distributed/device_communicators/pynccl_wrapper.py. This ensures both float8_e4m3fn and float8_e4m3fnuz variants map to ncclFloat8e4m3, which is the correct behavior as explained in the pull request description. The fix is targeted, correct, and resolves the issue.

vllm/distributed/device_communicators/pynccl_wrapper.py

Signed-off-by: Micah Williamson <micah.williamson@amd.com>

…m-project#33713) Signed-off-by: Micah Williamson <micah.williamson@amd.com> Signed-off-by: felix01.yu <felix01.yu@vipshop.com>

…m-project#33713) Signed-off-by: Micah Williamson <micah.williamson@amd.com>

add fp8_e4m3fnuz in nccl dtype dispatching

094234c

Signed-off-by: Micah Williamson <micah.williamson@amd.com>

mergify bot added rocm Related to AMD ROCm bug Something isn't working labels Feb 3, 2026

github-project-automation bot added this to AMD Feb 3, 2026

github-project-automation bot moved this to Todo in AMD Feb 3, 2026

gemini-code-assist bot reviewed Feb 3, 2026

View reviewed changes

gshtras reviewed Feb 3, 2026

View reviewed changes

vllm/distributed/device_communicators/pynccl_wrapper.py Outdated Show resolved Hide resolved

use current_platform.fp8_dtype()

a57707c

Signed-off-by: Micah Williamson <micah.williamson@amd.com>

gshtras approved these changes Feb 3, 2026

View reviewed changes

gshtras enabled auto-merge (squash) February 3, 2026 22:24

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 3, 2026

Merge branch 'main' into micah/nccl-dtype

915bcca

vllm-bot merged commit 1d367a7 into vllm-project:main Feb 4, 2026
48 of 49 checks passed

github-project-automation bot moved this from Todo to Done in AMD Feb 4, 2026

AndreasKaratzas mentioned this pull request Feb 4, 2026

[CI Failure]: mi325_4: Qwen3-30B-A3B-FP8-block Accuracy (H100) #33598

Closed

3 tasks

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026

[Bugfix][ROCm] Include float8_e4m3fnuz in NCCL Dtype Dispatching (vll…

ff86cb1

…m-project#33713) Signed-off-by: Micah Williamson <micah.williamson@amd.com>

tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026

[Bugfix][ROCm] Include float8_e4m3fnuz in NCCL Dtype Dispatching (vll…

4a48d5b

…m-project#33713) Signed-off-by: Micah Williamson <micah.williamson@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix][ROCm] Include float8_e4m3fnuz in NCCL Dtype Dispatching#33713

[Bugfix][ROCm] Include float8_e4m3fnuz in NCCL Dtype Dispatching#33713
vllm-bot merged 3 commits intovllm-project:mainfrom
ROCm:micah/nccl-dtype

micah-wil commented Feb 3, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

micah-wil commented Feb 3, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

micah-wil commented Feb 3, 2026 •

edited by github-actions bot

Loading