[Feature] Enable E8M0 by Default on Hopper for DeepGEMM, 5% E2E throughput improvement#26197
[Feature] Enable E8M0 by Default on Hopper for DeepGEMM, 5% E2E throughput improvement#26197youkaichao merged 6 commits intomainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request enables E8M0 by default on Hopper for DeepGEMM by unifying the environment variables for Hopper and Blackwell GPUs. The changes correctly remove the Hopper-specific environment variable VLLM_USE_DEEP_GEMM_E8M0_HOPPER and update the logic to use the generic VLLM_USE_DEEP_GEMM_E8M0 for both. My review includes suggestions to improve code maintainability by updating an outdated comment and removing a redundant conditional check.
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
12ab8a8 to
2c5dcb7
Compare
|
How does this work on Hopper? If I'm not mistaken: https://github.com/deepseek-ai/DeepGEMM/blob/239112cb4cd4e52587c662624aee6beda8bd9518/csrc/apis/layout.hpp#L22 and https://github.com/deepseek-ai/DeepGEMM/blob/239112cb4cd4e52587c662624aee6beda8bd9518/csrc/apis/layout.hpp#L32 disable the ue8m0 layout on hopper regardless of the flag. |
|
What does this flag actually do on hopper? Looking through the DeepGEMM code at a glance, it seems like E8M0/disabled doesn't change any behaviour. Could you help me understand what this flag controls and how it leads to the speedup you measured? |
I think |
E8M0 is false by default, which means if the model doesn't have The conversion could be found |
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
makes sense, thanks for the insight! |
|
@youkaichao CC |
|
to clarify, there's activation e8m0 and weight e8m0, two separate things. on hopper, we should only use activation e8m0 when the model config says |
|
@youkaichao why did you merge this then? @yewentao256 showed improvements for |
…ghput improvement (vllm-project#26197) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>
…ghput improvement (vllm-project#26197) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…ghput improvement (vllm-project#26197) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…ghput improvement (vllm-project#26197) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…ghput improvement (vllm-project#26197) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Purpose
Test
Unit Test
~/vllm-source/tests/kernels/moe$ pytest test_deepgemm.pycollected 12 items test_deepgemm.py ............ [100%] ============================================ 12 passed in 96.91s (0:01:36) ============================================collected 32 items test_batched_deepgemm.py ................................ [100%] ============================================ 32 passed in 79.44s (0:01:19) ============================================Acc
lm_eval --model vllm --model_args "pretrained=Qwen/Qwen3-30B-A3B-FP8,max_model_len=32768,enforce_eager=True" --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size autoPerf
vllm bench throughput --model Qwen/Qwen3-30B-A3B-FP8 --load-format dummy --input-len 1000 --output-len 100 --trust_remote_code --enable-expert-parallel