[Bugfix] Fix import gemm_afp4wfp4 failure on AMD#26068
[Bugfix] Fix import gemm_afp4wfp4 failure on AMD#26068yeqcharlotte merged 2 commits intovllm-project:mainfrom
gemm_afp4wfp4 failure on AMD#26068Conversation
There was a problem hiding this comment.
Code Review
This pull request addresses a crash on AMD platforms caused by an attempt to JIT-compile a CUDA-only Triton kernel. The fix, which involves catching the resulting AttributeError, is correct and effectively resolves the immediate issue. In my review, I've also identified a latent bug within the same try-except block that could lead to a NameError under different import failure scenarios. I've provided a detailed explanation of this issue and recommended a refactoring approach to improve the robustness of this platform-specific import handling.
| except (ImportError, AttributeError): | ||
| dynamic_mxfp4_quant = gemm_afp4wfp4 = None |
There was a problem hiding this comment.
While adding AttributeError correctly fixes the crash on AMD, this try-except block has a latent bug. It is too broad, covering lines 26 to 98, but the except block only handles dynamic_mxfp4_quant and gemm_afp4wfp4.
If an ImportError or AttributeError occurs during the import of other symbols like shuffle_weight, or the conditional imports of gemm_a4w4 and per_1x32_f4_quant_hip, they will be undefined. This can lead to a NameError later when they are used, for example in process_weights_after_loading or gemm_with_dynamic_quant.
A more robust approach would be to refactor this. A good approach would be to have smaller, more focused try-except blocks for different sets of imports (e.g., common, ROCm-specific, CUDA-specific) and ensure all imported symbols are properly handled in their respective except blocks. This would also require adding checks in QuarkW4A4MXFP4.__init__ to ensure the required kernels are available for the selected execution path.
yewentao256
left a comment
There was a problem hiding this comment.
LGTM, thanks for the fix!
a168f7d to
9700c13
Compare
Signed-off-by: zhewenli <zhewenli@meta.com>
9700c13 to
1fbd9ae
Compare
Signed-off-by: zhewenli <zhewenli@meta.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: zhewenli <zhewenli@meta.com> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: zhewenli <zhewenli@meta.com> Signed-off-by: Karan Goel <3261985+karan@users.noreply.github.com>
Signed-off-by: zhewenli <zhewenli@meta.com>
Signed-off-by: zhewenli <zhewenli@meta.com>
Signed-off-by: zhewenli <zhewenli@meta.com>
Signed-off-by: zhewenli <zhewenli@meta.com>
Purpose
We are seeing these error after #25135 on AMD MI300X:
gemm_afp4wfp4is a CUDA only triton kernel, so it will crash under JIT-compile with ROCMTest Plan