[Bugfix] Fix DeepGEMM after #29546 by zhewenl · Pull Request #30267 · vllm-project/vllm

zhewenl · 2025-12-08T16:06:20Z

Purpose

Fixed hopper incompatibility after #29546: _run_deepgemm method was changed to always use
per_token_group_quant_fp8_packed_for_deepgemm, which produces packed int32 scales, and the format is only supported on Blackwell:


2025-12-08 07:48:29 UTC | (Worker_DP1_TP0_EP2 pid=918) ERROR 12-07 23:48:29 [multiproc_executor.py:822]     return self.w8a8_block_fp8_linear.apply(
-- | --
2025-12-08 07:48:29 UTC | (Worker_DP1_TP0_EP2 pid=918) ERROR 12-07 23:48:29 [multiproc_executor.py:822]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-12-08 07:48:29 UTC | (Worker_DP1_TP0_EP2 pid=918) ERROR 12-07 23:48:29 [multiproc_executor.py:822]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/fp8_utils.py", line 255, in apply
2025-12-08 07:48:29 UTC | (Worker_DP1_TP0_EP2 pid=918) ERROR 12-07 23:48:29 [multiproc_executor.py:822]     output = self._run_deepgemm(input_2d, weight, weight_scale)
2025-12-08 07:48:29 UTC | (Worker_DP1_TP0_EP2 pid=918) ERROR 12-07 23:48:29 [multiproc_executor.py:822]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-12-08 07:48:29 UTC | (Worker_DP1_TP0_EP2 pid=918) ERROR 12-07 23:48:29 [multiproc_executor.py:822]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/fp8_utils.py", line 282, in _run_deepgemm
2025-12-08 07:48:29 UTC | (Worker_DP1_TP0_EP2 pid=918) ERROR 12-07 23:48:29 [multiproc_executor.py:822]     torch.ops.vllm.fp8_gemm_nt_op(
2025-12-08 07:48:29 UTC | (Worker_DP1_TP0_EP2 pid=918) ERROR 12-07 23:48:29 [multiproc_executor.py:822]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1255, in __call__
2025-12-08 07:48:29 UTC | (Worker_DP1_TP0_EP2 pid=918) ERROR 12-07 23:48:29 [multiproc_executor.py:822]     return self._op(*args, **kwargs)
2025-12-08 07:48:29 UTC | (Worker_DP1_TP0_EP2 pid=918) ERROR 12-07 23:48:29 [multiproc_executor.py:822]            ^^^^^^^^^^^^^^^^^^^^^^^^^
2025-12-08 07:48:29 UTC | (Worker_DP1_TP0_EP2 pid=918) ERROR 12-07 23:48:29 [multiproc_executor.py:822]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/utils/fp8_utils.py", line 170, in _fp8_gemm_nt_op
2025-12-08 07:48:29 UTC | (Worker_DP1_TP0_EP2 pid=918) ERROR 12-07 23:48:29 [multiproc_executor.py:822]     fp8_gemm_nt(
2025-12-08 07:48:29 UTC | (Worker_DP1_TP0_EP2 pid=918) ERROR 12-07 23:48:29 [multiproc_executor.py:822]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils/deep_gemm.py", line 186, in fp8_gemm_nt
2025-12-08 07:48:29 UTC | (Worker_DP1_TP0_EP2 pid=918) ERROR 12-07 23:48:29 [multiproc_executor.py:822]     return _fp8_gemm_nt_impl(*args, disable_ue8m0_cast=not use_ue8m0, **kwargs)
2025-12-08 07:48:29 UTC | (Worker_DP1_TP0_EP2 pid=918) ERROR 12-07 23:48:29 [multiproc_executor.py:822]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-12-08 07:48:29 UTC | (Worker_DP1_TP0_EP2 pid=918) ERROR 12-07 23:48:29 [multiproc_executor.py:822] RuntimeError: Assertion error (csrc/apis/../jit_kernels/impls/../heuristics/../../utils/layout.hpp:49): sfa_dtype == torch::kFloat and sfb_dtype == torch::kFloat

Test Plan

Passed CI: https://buildkite.com/vllm/ci/builds/42493#019aff0f-8d6e-4499-907f-010fc5594fc7

Signed-off-by: zhewenli <zhewenli@meta.com>

gemini-code-assist

Code Review

This pull request aims to fix an issue with DeepGEMM by using a configured flag self.use_deep_gemm_e8m0 instead of a hardcoded True for the use_ue8m0 parameter. While this change is a step in the right direction, it exposes a critical logic flaw where the DeepGEMM path can be taken even when E8M0 is not enabled, which will lead to an assertion failure. I've added a comment with details on how to address this.

vllm/model_executor/layers/quantization/utils/fp8_utils.py

Signed-off-by: zhewenli <zhewenli@meta.com>

vllm/model_executor/layers/quantization/utils/fp8_utils.py

Signed-off-by: Zhewen Li <zhewenli@meta.com>

yewentao256

LGTM, thanks for the work!
One small update before landed

yewentao256 · 2025-12-08T20:20:58Z

vllm/model_executor/layers/quantization/utils/fp8_utils.py

@@ -269,11 +270,14 @@ def _run_deepgemm(
        weight_scale: torch.Tensor,
    ) -> torch.Tensor:
        assert self.deepgemm_input_quant_op is not None


Suggested change

assert self.deepgemm_input_quant_op is not None

vllm/model_executor/layers/quantization/utils/fp8_utils.py

Signed-off-by: Zhewen Li <zhewenli@meta.com>

zhewenl · 2025-12-09T16:21:38Z

This PR is causing Dynamo compilation issues, fixing in #30336

Signed-off-by: zhewenli <zhewenli@meta.com> Signed-off-by: Zhewen Li <zhewenli@meta.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

update

823709d

Signed-off-by: zhewenli <zhewenli@meta.com>

gemini-code-assist bot reviewed Dec 8, 2025

View reviewed changes

vllm/model_executor/layers/quantization/utils/fp8_utils.py Outdated Show resolved Hide resolved

yewentao256 reviewed Dec 8, 2025

View reviewed changes

vllm/model_executor/layers/quantization/utils/fp8_utils.py Outdated Show resolved Hide resolved

update

2926f6f

Signed-off-by: zhewenli <zhewenli@meta.com>

yewentao256 reviewed Dec 8, 2025

View reviewed changes

vllm/model_executor/layers/quantization/utils/fp8_utils.py Outdated Show resolved Hide resolved

update

2edfcfc

Signed-off-by: Zhewen Li <zhewenli@meta.com>

zhewenl force-pushed the fix-ue8m0-hopper branch from f4a942c to 2edfcfc Compare December 8, 2025 17:40

zhewenl changed the title ~~[WIP] Fix DeepGEMM after #29646~~ [Bugfix] Fix DeepGEMM after #29546 Dec 8, 2025

zhewenl marked this pull request as ready for review December 8, 2025 19:13

zhewenl requested review from mgoin, pavanimajety, robertgshaw2-redhat and tlrmchlsmth as code owners December 8, 2025 19:13

mgoin assigned yewentao256 Dec 8, 2025

yewentao256 reviewed Dec 8, 2025

View reviewed changes

yewentao256 added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 8, 2025

update

b5245db

Signed-off-by: Zhewen Li <zhewenli@meta.com>

yewentao256 approved these changes Dec 8, 2025

View reviewed changes

yewentao256 mentioned this pull request Dec 8, 2025

fix: DeepSeek-V3.2 DeepGEMM RuntimeError #30251

Closed

5 tasks

yeqcharlotte enabled auto-merge (squash) December 9, 2025 00:57

yeqcharlotte disabled auto-merge December 9, 2025 00:57

yeqcharlotte enabled auto-merge (squash) December 9, 2025 00:57

yeqcharlotte merged commit ae339b1 into vllm-project:main Dec 9, 2025
55 checks passed

zhewenl deleted the fix-ue8m0-hopper branch December 9, 2025 01:28

yewentao256 mentioned this pull request Dec 9, 2025

[Bug]: DeepSeek-V3.2 DeepGEMM RuntimeError #30206

Closed

1 task

LucasWilkinson mentioned this pull request Dec 10, 2025

[BugFix] Fix AttributeError: 'MergedColumnParallelLinear' object has no attribute 'weight_scale' #30399

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Fix DeepGEMM after #29546 #30267

[Bugfix] Fix DeepGEMM after #29546 #30267
yeqcharlotte merged 4 commits intovllm-project:mainfrom
zhewenl:fix-ue8m0-hopper

zhewenl commented Dec 8, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yewentao256 left a comment

Uh oh!

yewentao256 Dec 8, 2025

Uh oh!

Uh oh!

Uh oh!

zhewenl commented Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

zhewenl commented Dec 8, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yewentao256 left a comment

Choose a reason for hiding this comment

Uh oh!

yewentao256 Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

zhewenl commented Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zhewenl commented Dec 8, 2025 •

edited by github-actions bot

Loading