[Perf] Deepgemm fused layout kernel for activations, 4.3% throughput improvement, 10.7% TTFT improvement.#29546
Conversation
Signed-off-by: yewentao256 <zhyanwentao@126.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a fused CUDA kernel for activation quantization and scale packing, targeting performance improvements with DeepGEMM. The changes are well-motivated and backed by performance data showing significant gains. My review focuses on the correctness and maintainability of the new CUDA kernel and its Python integration. I've identified two high-severity issues: one related to obscure and fragile bit-packing logic in the CUDA kernel that should be refactored for clarity and robustness, and another in the Python wrapper which fails to use a pre-allocated output buffer, leading to unnecessary memory allocations.
There was a problem hiding this comment.
💡 Codex Review
vllm/vllm/model_executor/layers/quantization/utils/fp8_utils.py
Lines 282 to 283 in deecd2a
The DeepGEMM linear path now quantizes activations with use_ue8m0=True, producing UE8M0-packed int32 scales, but the subsequent fp8_gemm_nt_op call still forwards self.use_deep_gemm_e8m0. When VLLM_USE_DEEP_GEMM_E8M0 is false (the default on supported GPUs), this flag is false, so DeepGEMM will interpret the scale buffer as its non-E8M0 float format while it actually contains packed exponents, leading to incorrect matmul results whenever DeepGEMM is used without E8M0 enabled.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Signed-off-by: yewentao256 <zhyanwentao@126.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Signed-off-by: yewentao256 <zhyanwentao@126.com>
|
@youkaichao CC |
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: yewentao256 <zhyanwentao@126.com>
youkaichao
left a comment
There was a problem hiding this comment.
thanks for the great work!
for context, per offline discussion with @LyricZhao , deepgemm requires this packed scaling factor for optimal performance.
|
Deepseek 3.2 is broken on HEAD, and it looks that it's related to this PR: |
I also encountered this when running DeepSee 3.2 with h200 |
…improvement, 10.7% TTFT improvement. (vllm-project#29546) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
Signed-off-by: zhewenli <zhewenli@meta.com> Signed-off-by: Zhewen Li <zhewenli@meta.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
Purpose
Fused layout transform with per token group quant to get performance
Namely, pack scales into a uint32 earlier and remove an additional kernel call
Test
vllm serve deepseek-ai/DeepSeek-V3.1 -tp 8 --enable-expert-parallel --port 9256 --enforce_eagerAcc
lm_eval --model local-completions --model_args "base_url=http://127.0.0.1:9256/v1/completions,model=deepseek-ai/DeepSeek-V3.1,num_concurrent=1024" --tasks gsm8kPerf
vllm bench serve --model deepseek-ai/DeepSeek-V3.1 --dataset-name random --host 127.0.0.1 --port 9256 --random-input-len 2 --random-output-len 256 --request-rate inf --num-prompts 1024