Skip to content

[Bugfix] Fix Dtypes for Pynccl Wrapper#33030

Merged
robertgshaw2-redhat merged 5 commits intomainfrom
fix-fp8-dtype-sending
Jan 26, 2026
Merged

[Bugfix] Fix Dtypes for Pynccl Wrapper#33030
robertgshaw2-redhat merged 5 commits intomainfrom
fix-fp8-dtype-sending

Conversation

@robertgshaw2-redhat
Copy link
Copy Markdown
Collaborator

@robertgshaw2-redhat robertgshaw2-redhat commented Jan 25, 2026

Purpose

Test Plan

  • ci

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Robert Shaw <robshaw@redhat.com>
@mergify mergify bot added the bug Something isn't working label Jan 25, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for fp8 data types in the pynccl wrapper, which is necessary for distributed communication with fp8 tensors. The changes correctly add ncclFloat8e4m3 to the data type enum and handle torch.float8_e4m3fn in the type conversion logic.

My review includes a suggestion to also handle the torch.float8_e4m3fnuz variant, as it appears to be used in other parts of the codebase, to prevent potential runtime errors. This will make the fp8 support more robust.

Comment on lines +96 to 102
if dtype == torch.float8_e4m3fn:
return cls.ncclFloat8e4m3
raise ValueError(
f"Unsupported dtype {dtype}: should be one of "
f"int8, uint8, int32, int64, float16, float32, float64, bfloat16."
f"int8, uint8, int32, int64, float16, float32, float64, bfloat16,"
" float8e4m3."
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The codebase, for example in vllm/model_executor/layers/quantization/utils/fp8_utils.py, seems to use both torch.float8_e4m3fn and torch.float8_e4m3fnuz. This function should handle both types to avoid ValueError during collective communication operations with torch.float8_e4m3fnuz tensors. The error message is also updated for clarity.

For more complete FP8 support, you might also consider adding torch.float8_e5m2. This would involve adding ncclFloat8e5m2 to ncclDataTypeEnum and handling torch.float8_e5m2 in this method.

Suggested change
if dtype == torch.float8_e4m3fn:
return cls.ncclFloat8e4m3
raise ValueError(
f"Unsupported dtype {dtype}: should be one of "
f"int8, uint8, int32, int64, float16, float32, float64, bfloat16."
f"int8, uint8, int32, int64, float16, float32, float64, bfloat16,"
" float8e4m3."
)
if dtype in (torch.float8_e4m3fn, torch.float8_e4m3fnuz):
return cls.ncclFloat8e4m3
raise ValueError(
f"Unsupported dtype {dtype}: should be one of "
"int8, uint8, int32, int64, float16, float32, float64, bfloat16, "
"float8_e4m3fn, "
"float8_e4m3fnuz."
)

@robertgshaw2-redhat robertgshaw2-redhat changed the title [Bugfix] Add Support for Fp8 In Pynccl Wrapper [Bugfix] Fix Dtypes for Pynccl Wrapper Jan 25, 2026
@robertgshaw2-redhat robertgshaw2-redhat added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 25, 2026
Copy link
Copy Markdown
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks for the quick fix!

Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
@mergify mergify bot added the nvidia label Jan 25, 2026
@github-project-automation github-project-automation bot moved this to Ready in NVIDIA Jan 25, 2026
@robertgshaw2-redhat robertgshaw2-redhat enabled auto-merge (squash) January 25, 2026 15:12
@LopezCastroRoberto
Copy link
Copy Markdown
Contributor

LGTM too. Thanks for the fix, Rob

@robertgshaw2-redhat
Copy link
Copy Markdown
Collaborator Author

updating main for CI

@robertgshaw2-redhat robertgshaw2-redhat merged commit 43a013c into main Jan 26, 2026
62 checks passed
@robertgshaw2-redhat robertgshaw2-redhat deleted the fix-fp8-dtype-sending branch January 26, 2026 20:09
@github-project-automation github-project-automation bot moved this from Ready to Done in NVIDIA Jan 26, 2026
khluu pushed a commit that referenced this pull request Jan 26, 2026
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
(cherry picked from commit 43a013c)
apd10 pushed a commit to apd10/vllm that referenced this pull request Jan 31, 2026
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working nvidia ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[CI Failure]: MoE Integration Tests

4 participants