Conversation
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request correctly addresses an issue with mxfp4 padding alignment on ROCm devices in the fused MoE layer. The original implementation incorrectly applied a hardcoded padding of 256 bytes for all ROCm devices, whereas this is only required for gfx950. The fix introduces a new utility function, get_padding_alignment, which dynamically determines the correct padding (128 or 256 bytes) based on the specific ROCm GPU architecture. This change is well-implemented and aligns with the intended behavior. The author's note about deduplicating this padding logic in the future is a good next step for improving code maintainability.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
Comment @cursor review or bugbot run to trigger another review on this PR
| hidden_size = round_up(hidden_size, 256) | ||
| elif current_platform.is_rocm(): | ||
| pad_align = get_padding_alignment() | ||
| hidden_size = round_up(hidden_size, pad_align) |
There was a problem hiding this comment.
Missing triton availability check in ROCm padding
Low Severity
The new ROCm branch calls get_padding_alignment() which accesses triton.runtime.driver.active.get_current_target().arch without verifying triton is properly available. If has_triton_kernels() is False on ROCm, get_mxfp4_backend() returns Mxfp4Backend.NONE, but the code still enters the elif current_platform.is_rocm(): branch. When triton isn't properly initialized (no active drivers), the triton object is a placeholder that lacks a runtime attribute, causing an AttributeError. The original code used a hardcoded value of 256 and didn't have this dependency.
|
Closing in favor of #34285 |
This line in vllm/model_executor/layers/fused_moe/layer.py originally from #22421 seems to be padding mxfp4 hidden_size on rocm to a multiple of 256. However, we only really need to pad to 256 for preshuffle reasons on gfx950; for all other cases e.g. gfx942, padding to 128 should be sufficient to avoid masked loads. See #28024 which is doing this correctly.
Ideally we should deduplicate the two padding functions across https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/layer.py and https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/mxfp4.py, since these are more or less identicalWill also be deduplicated/fixed in #30647, this PR fixes this individual issue until the other PR is merged.
Purpose
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.