Skip to content

[Feature] Enable E8M0 by Default on Hopper for DeepGEMM, 5% E2E throughput improvement#26197

Merged
youkaichao merged 6 commits intomainfrom
wentao-enable-e8m0-by-default
Oct 8, 2025
Merged

[Feature] Enable E8M0 by Default on Hopper for DeepGEMM, 5% E2E throughput improvement#26197
youkaichao merged 6 commits intomainfrom
wentao-enable-e8m0-by-default

Conversation

@yewentao256
Copy link
Copy Markdown
Member

@yewentao256 yewentao256 commented Oct 3, 2025

Purpose

  1. Reduce the complexity
  2. Speed up!

Test

Unit Test

~/vllm-source/tests/kernels/moe$ pytest test_deepgemm.py

collected 12 items                                                                                                                     

test_deepgemm.py ............                                                                                   [100%]

============================================ 12 passed in 96.91s (0:01:36) ============================================
collected 32 items                                                                                                    

test_batched_deepgemm.py ................................                                                       [100%]

============================================ 32 passed in 79.44s (0:01:19) ============================================

Acc

lm_eval --model vllm --model_args "pretrained=Qwen/Qwen3-30B-A3B-FP8,max_model_len=32768,enforce_eager=True" --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto

# now
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match||0.8560|±  |0.0097|
|     |       |strict-match    |     5|exact_match||0.8939|±  |0.0085|

# main
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match||0.8097|±  |0.0108|
|     |       |strict-match    |     5|exact_match||0.8870|±  |0.0087|

Perf

vllm bench throughput --model Qwen/Qwen3-30B-A3B-FP8 --load-format dummy --input-len 1000 --output-len 100 --trust_remote_code --enable-expert-parallel

# now
Throughput: 35.84 requests/s, 39427.04 total tokens/s, 3584.28 output tokens/s
# main
Throughput: 33.89 requests/s, 37275.64 total tokens/s, 3388.69 output tokens/s

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables E8M0 by default on Hopper for DeepGEMM by unifying the environment variables for Hopper and Blackwell GPUs. The changes correctly remove the Hopper-specific environment variable VLLM_USE_DEEP_GEMM_E8M0_HOPPER and update the logic to use the generic VLLM_USE_DEEP_GEMM_E8M0 for both. My review includes suggestions to improve code maintainability by updating an outdated comment and removing a redundant conditional check.

yewentao256 and others added 3 commits October 3, 2025 21:39
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
@yewentao256 yewentao256 force-pushed the wentao-enable-e8m0-by-default branch from 12ab8a8 to 2c5dcb7 Compare October 3, 2025 21:39
@yewentao256 yewentao256 added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 3, 2025
@djmmoss
Copy link
Copy Markdown
Contributor

djmmoss commented Oct 3, 2025

@benchislett
Copy link
Copy Markdown
Collaborator

What does this flag actually do on hopper? Looking through the DeepGEMM code at a glance, it seems like E8M0/disabled doesn't change any behaviour. Could you help me understand what this flag controls and how it leads to the speedup you measured?

@yewentao256
Copy link
Copy Markdown
Member Author

How does this work on Hopper?

https://github.com/deepseek-ai/DeepGEMM/blob/239112cb4cd4e52587c662624aee6beda8bd9518/csrc/jit_kernels/impls/smxx_layout.hpp#L113

I think return get_mn_major_tma_aligned_tensor(sf); doesn't change the e8m0, it just make the TMA-aligned and MN-major tensor?

@yewentao256
Copy link
Copy Markdown
Member Author

What does this flag actually do on hopper? Looking through the DeepGEMM code at a glance, it seems like E8M0/disabled doesn't change any behaviour. Could you help me understand what this flag controls and how it leads to the speedup you measured?

E8M0 is false by default, which means if the model doesn't have scale_fmt: ue8m0 in the config, we will not convert to e8m0 result, which would be slower.

The conversion could be found

def requant_weight_ue8m0_inplace(

@benchislett
Copy link
Copy Markdown
Collaborator

makes sense, thanks for the insight!

@yewentao256 yewentao256 changed the title [Feature] Enable E8M0 by Default on Hopper for DeepGEMM [Feature] Enable E8M0 by Default on Hopper for DeepGEMM, 5% E2E throughput improvement Oct 6, 2025
@yewentao256
Copy link
Copy Markdown
Member Author

@youkaichao CC

Copy link
Copy Markdown
Collaborator

@bnellnm bnellnm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@youkaichao youkaichao merged commit f860786 into main Oct 8, 2025
50 checks passed
@youkaichao youkaichao deleted the wentao-enable-e8m0-by-default branch October 8, 2025 07:33
@youkaichao
Copy link
Copy Markdown
Member

to clarify, there's activation e8m0 and weight e8m0, two separate things.

on hopper, we should only use activation e8m0 when the model config says scale_fmt: ue8m0 , which is only for deepseek v3.1 or later.

@mgoin
Copy link
Copy Markdown
Member

mgoin commented Oct 8, 2025

@youkaichao why did you merge this then? @yewentao256 showed improvements for Qwen/Qwen3-30B-A3B-FP8 which doesn't have scale_fmt: ue8m0, so my understanding is this PR is applying activation e8m0 for any model using DeepGEMM on Hopper

Dhruvilbhatt pushed a commit to Dhruvilbhatt/vllm that referenced this pull request Oct 14, 2025
…ghput improvement (vllm-project#26197)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
…ghput improvement (vllm-project#26197)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025
…ghput improvement (vllm-project#26197)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025
…ghput improvement (vllm-project#26197)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
…ghput improvement (vllm-project#26197)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants