[Bugfix] Fix Dtypes for Pynccl Wrapper#33030
Conversation
There was a problem hiding this comment.
Code Review
This pull request adds support for fp8 data types in the pynccl wrapper, which is necessary for distributed communication with fp8 tensors. The changes correctly add ncclFloat8e4m3 to the data type enum and handle torch.float8_e4m3fn in the type conversion logic.
My review includes a suggestion to also handle the torch.float8_e4m3fnuz variant, as it appears to be used in other parts of the codebase, to prevent potential runtime errors. This will make the fp8 support more robust.
| if dtype == torch.float8_e4m3fn: | ||
| return cls.ncclFloat8e4m3 | ||
| raise ValueError( | ||
| f"Unsupported dtype {dtype}: should be one of " | ||
| f"int8, uint8, int32, int64, float16, float32, float64, bfloat16." | ||
| f"int8, uint8, int32, int64, float16, float32, float64, bfloat16," | ||
| " float8e4m3." | ||
| ) |
There was a problem hiding this comment.
The codebase, for example in vllm/model_executor/layers/quantization/utils/fp8_utils.py, seems to use both torch.float8_e4m3fn and torch.float8_e4m3fnuz. This function should handle both types to avoid ValueError during collective communication operations with torch.float8_e4m3fnuz tensors. The error message is also updated for clarity.
For more complete FP8 support, you might also consider adding torch.float8_e5m2. This would involve adding ncclFloat8e5m2 to ncclDataTypeEnum and handling torch.float8_e5m2 in this method.
| if dtype == torch.float8_e4m3fn: | |
| return cls.ncclFloat8e4m3 | |
| raise ValueError( | |
| f"Unsupported dtype {dtype}: should be one of " | |
| f"int8, uint8, int32, int64, float16, float32, float64, bfloat16." | |
| f"int8, uint8, int32, int64, float16, float32, float64, bfloat16," | |
| " float8e4m3." | |
| ) | |
| if dtype in (torch.float8_e4m3fn, torch.float8_e4m3fnuz): | |
| return cls.ncclFloat8e4m3 | |
| raise ValueError( | |
| f"Unsupported dtype {dtype}: should be one of " | |
| "int8, uint8, int32, int64, float16, float32, float64, bfloat16, " | |
| "float8_e4m3fn, " | |
| "float8_e4m3fnuz." | |
| ) |
LucasWilkinson
left a comment
There was a problem hiding this comment.
LGTM thanks for the quick fix!
Signed-off-by: Robert Shaw <robshaw@redhat.com>
|
LGTM too. Thanks for the fix, Rob |
|
updating main for CI |
Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com> (cherry picked from commit 43a013c)
Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com> Co-authored-by: Robert Shaw <robshaw@redhat.com>
Purpose
Test Plan
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.