[Bugfix] Fix fp8 DeepGemm compilation issues#30336
Conversation
Signed-off-by: ElizaWszola <ewszola@redhat.com>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Code Review
This pull request aims to fix compilation issues for fp8 DeepGemm by refactoring how device capabilities are checked. While the changes in vllm/utils/deep_gemm.py are correct, a critical bug has been introduced in vllm/model_executor/layers/quantization/utils/fp8_utils.py. A line of code was moved out of an else block, which will cause incorrect quantization behavior. This needs to be addressed.
yewentao256
left a comment
There was a problem hiding this comment.
Why is_deepgemm_e8m0 could not be used? If it is because of the @cache, I am thinking we should have something like #29038 instead of hardcode for is_blackwell.
|
@yewentao256 It's because of |
Got it, please refactor the class |
|
@yewentao256 Isn't Alternatively, before cleaning up this PR, I had implemented this kind of changes: aea97d1 but this felt a bit superfluous to me |
I think we don't actually need the static, just refactoring it to normal func. @varun-sundar-rabindranath also CC |
| weight_scale: torch.Tensor, | ||
| ) -> torch.Tensor: | ||
| if DeepGemmQuantScaleFMT.from_oracle() == DeepGemmQuantScaleFMT.UE8M0: | ||
| if self.use_deep_gemm_e8m0 and self.is_blackwell: |
There was a problem hiding this comment.
Shouldn't e8m0 also compatible with hopper?
There was a problem hiding this comment.
IIUC this is specifically for the case where e8m0 scales need to be packed, which is a Blackwell only case
|
OK Let's land this first since it is a blocker for CI, I can have follow up PR for refactoring the class later. |
Signed-off-by: Ubuntu <mjtaheri68@gmail.com>
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
is_deep_gemm_e8m0_used()andcurrent_platform.is_device_capability()are not compatible with Dynamo, causing failed compilations. This PR intends to fix this problem.Testing
Run inference on
Qwen/Qwen3-30B-A3B-FP8(one of the models affected) withVLLM_USE_DEEP_GEMM=1.