[Feature] Enable E8M0 by Default on Hopper for DeepGEMM, 5% E2E throughput improvement by yewentao256 · Pull Request #26197 · vllm-project/vllm

yewentao256 · 2025-10-03T21:31:02Z

Purpose

Reduce the complexity
Speed up!

Test

Unit Test

~/vllm-source/tests/kernels/moe$ pytest test_deepgemm.py

collected 12 items                                                                                                                     

test_deepgemm.py ............                                                                                   [100%]

============================================ 12 passed in 96.91s (0:01:36) ============================================

collected 32 items                                                                                                    

test_batched_deepgemm.py ................................                                                       [100%]

============================================ 32 passed in 79.44s (0:01:19) ============================================

Acc

lm_eval --model vllm --model_args "pretrained=Qwen/Qwen3-30B-A3B-FP8,max_model_len=32768,enforce_eager=True" --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto

# now
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8560|±  |0.0097|
|     |       |strict-match    |     5|exact_match|↑  |0.8939|±  |0.0085|

# main
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8097|±  |0.0108|
|     |       |strict-match    |     5|exact_match|↑  |0.8870|±  |0.0087|

Perf

vllm bench throughput --model Qwen/Qwen3-30B-A3B-FP8 --load-format dummy --input-len 1000 --output-len 100 --trust_remote_code --enable-expert-parallel

# now
Throughput: 35.84 requests/s, 39427.04 total tokens/s, 3584.28 output tokens/s
# main
Throughput: 33.89 requests/s, 37275.64 total tokens/s, 3388.69 output tokens/s

gemini-code-assist

Code Review

This pull request enables E8M0 by default on Hopper for DeepGEMM by unifying the environment variables for Hopper and Blackwell GPUs. The changes correctly remove the Hopper-specific environment variable VLLM_USE_DEEP_GEMM_E8M0_HOPPER and update the logic to use the generic VLLM_USE_DEEP_GEMM_E8M0 for both. My review includes suggestions to improve code maintainability by updating an outdated comment and removing a redundant conditional check.

vllm/transformers_utils/config.py

vllm/utils/deep_gemm.py

Signed-off-by: yewentao256 <zhyanwentao@126.com>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>

Signed-off-by: yewentao256 <zhyanwentao@126.com>

djmmoss · 2025-10-03T23:26:58Z

How does this work on Hopper?

If I'm not mistaken: https://github.com/deepseek-ai/DeepGEMM/blob/239112cb4cd4e52587c662624aee6beda8bd9518/csrc/apis/layout.hpp#L22 and https://github.com/deepseek-ai/DeepGEMM/blob/239112cb4cd4e52587c662624aee6beda8bd9518/csrc/apis/layout.hpp#L32 disable the ue8m0 layout on hopper regardless of the flag.

benchislett · 2025-10-03T23:27:27Z

What does this flag actually do on hopper? Looking through the DeepGEMM code at a glance, it seems like E8M0/disabled doesn't change any behaviour. Could you help me understand what this flag controls and how it leads to the speedup you measured?

yewentao256 · 2025-10-04T15:00:49Z

How does this work on Hopper?

https://github.com/deepseek-ai/DeepGEMM/blob/239112cb4cd4e52587c662624aee6beda8bd9518/csrc/jit_kernels/impls/smxx_layout.hpp#L113

I think return get_mn_major_tma_aligned_tensor(sf); doesn't change the e8m0, it just make the TMA-aligned and MN-major tensor?

yewentao256 · 2025-10-04T15:02:41Z

What does this flag actually do on hopper? Looking through the DeepGEMM code at a glance, it seems like E8M0/disabled doesn't change any behaviour. Could you help me understand what this flag controls and how it leads to the speedup you measured?

E8M0 is false by default, which means if the model doesn't have scale_fmt: ue8m0 in the config, we will not convert to e8m0 result, which would be slower.

The conversion could be found

vllm/vllm/model_executor/layers/quantization/utils/fp8_utils.py

Line 813 in 5c057e0

def requant_weight_ue8m0_inplace(

Signed-off-by: yewentao256 <zhyanwentao@126.com>

benchislett · 2025-10-06T16:01:06Z

makes sense, thanks for the insight!

yewentao256 · 2025-10-06T22:46:34Z

@youkaichao CC

vllm/utils/deep_gemm.py

bnellnm

LGTM

youkaichao · 2025-10-08T07:35:21Z

to clarify, there's activation e8m0 and weight e8m0, two separate things.

on hopper, we should only use activation e8m0 when the model config says scale_fmt: ue8m0 , which is only for deepseek v3.1 or later.

mgoin · 2025-10-08T14:07:35Z

@youkaichao why did you merge this then? @yewentao256 showed improvements for Qwen/Qwen3-30B-A3B-FP8 which doesn't have scale_fmt: ue8m0, so my understanding is this PR is applying activation e8m0 for any model using DeepGEMM on Hopper

…ghput improvement (vllm-project#26197) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>

…ghput improvement (vllm-project#26197) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

gemini-code-assist bot reviewed Oct 3, 2025

View reviewed changes

vllm/transformers_utils/config.py Outdated Show resolved Hide resolved

vllm/utils/deep_gemm.py Outdated Show resolved Hide resolved

yewentao256 and others added 3 commits October 3, 2025 21:39

enable e8m0 by default on hopper

d1aa7cf

Signed-off-by: yewentao256 <zhyanwentao@126.com>

update through comments

2c5dcb7

Signed-off-by: yewentao256 <zhyanwentao@126.com>

yewentao256 force-pushed the wentao-enable-e8m0-by-default branch from 12ab8a8 to 2c5dcb7 Compare October 3, 2025 21:39

yewentao256 added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 3, 2025

yewentao256 and others added 2 commits October 4, 2025 11:03

Merge branch 'main' into wentao-enable-e8m0-by-default

3b78cd3

Merge branch 'main' into wentao-enable-e8m0-by-default

291f50a

Signed-off-by: yewentao256 <zhyanwentao@126.com>

Merge branch 'main' into wentao-enable-e8m0-by-default

d47803b

yewentao256 changed the title ~~[Feature] Enable E8M0 by Default on Hopper for DeepGEMM~~ [Feature] Enable E8M0 by Default on Hopper for DeepGEMM, 5% E2E throughput improvement Oct 6, 2025

bnellnm reviewed Oct 7, 2025

View reviewed changes

vllm/utils/deep_gemm.py Show resolved Hide resolved

bnellnm approved these changes Oct 7, 2025

View reviewed changes

youkaichao approved these changes Oct 8, 2025

View reviewed changes

youkaichao merged commit f860786 into main Oct 8, 2025
50 checks passed

youkaichao deleted the wentao-enable-e8m0-by-default branch October 8, 2025 07:33

jhaotingc mentioned this pull request Dec 10, 2025

[Bugfix] Fix fp8 DeepGemm compilation issues #30336

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Enable E8M0 by Default on Hopper for DeepGEMM, 5% E2E throughput improvement#26197

[Feature] Enable E8M0 by Default on Hopper for DeepGEMM, 5% E2E throughput improvement#26197
youkaichao merged 6 commits intomainfrom
wentao-enable-e8m0-by-default

yewentao256 commented Oct 3, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

djmmoss commented Oct 3, 2025 •

edited

Loading

Uh oh!

benchislett commented Oct 3, 2025

Uh oh!

yewentao256 commented Oct 4, 2025

Uh oh!

yewentao256 commented Oct 4, 2025

Uh oh!

benchislett commented Oct 6, 2025

Uh oh!

yewentao256 commented Oct 6, 2025

Uh oh!

Uh oh!

bnellnm left a comment

Uh oh!

Uh oh!

youkaichao commented Oct 8, 2025

Uh oh!

mgoin commented Oct 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Uh oh!

Conversation

yewentao256 commented Oct 3, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test

Unit Test

Acc

Perf

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

djmmoss commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benchislett commented Oct 3, 2025

Uh oh!

yewentao256 commented Oct 4, 2025

Uh oh!

yewentao256 commented Oct 4, 2025

Uh oh!

benchislett commented Oct 6, 2025

Uh oh!

yewentao256 commented Oct 6, 2025

Uh oh!

Uh oh!

bnellnm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

youkaichao commented Oct 8, 2025

Uh oh!

mgoin commented Oct 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

yewentao256 commented Oct 3, 2025 •

edited by github-actions bot

Loading

djmmoss commented Oct 3, 2025 •

edited

Loading