Deepseek r1 fp8matmul #977

xuechendi · 2025-03-27T20:22:08Z

Decode latency improved ~1.5x

Default we disabled this feature

When enabling with original static fp8 path, this feature will remove the dequant/quant between kv_cache and matmul_qk.

VLLM_USE_FP8_MATMUL=true \

When working with INC static fp8, this feature can further enable fp8 pipeline PA since we leverage INC to provide batch2block_matmul scaling. Also, MOE will be faster due to scalar MOE from INC.

QUANT_CONFIG=scripts/inc_quant_with_fp8kv_config.json \
VLLM_REQUANT_FP8_INC=1 \
VLLM_ENABLE_RUNTIME_DEQUANT=1 \
VLLM_USE_FP8_MATMUL=true \

** with bs224_blocks7296 **
fp8 matmul is

kwisniewski98 · 2025-03-31T11:07:57Z

vllm/worker/hpu_model_runner.py

Are those env vars necessary? Why VLLM_PT_PROFILE cannot be used for profiling?

removed profiling in this PR. The 'VLLM_PT_PROFILE' produce dummy input which can't fully test MOE which makes profiling faster.

Signed-off-by: Chendi <[email protected]>

xuechendi requested review from afierka-intel, kzawora-intel, madamczyk-intel, mgawarkiewicz, michalkuligowski and vivekgoe as code owners March 27, 2025 20:22

xuechendi force-pushed the deepseek_r1_fp8matmul branch 3 times, most recently from cd2dcad to 8745f07 Compare March 27, 2025 21:27

michalkuligowski marked this pull request as draft March 28, 2025 08:35

kwisniewski98 reviewed Mar 31, 2025

View reviewed changes

xuechendi force-pushed the deepseek_r1_fp8matmul branch from 8745f07 to d2164b9 Compare April 1, 2025 02:14

xuechendi added 7 commits April 2, 2025 01:57

apply fp8 matmul

6ade24c

Signed-off-by: Chendi <[email protected]>

use split instead of concat

ee7bd9c

Signed-off-by: Chendi <[email protected]>

update scripts

7af5659

Signed-off-by: Chendi <[email protected]>

update fp8_matmul to work with no inc path

9ddd9d9

Signed-off-by: Chendi <[email protected]>

use different name to avoid being replaced by inc

6b55602

Signed-off-by: Chendi <[email protected]>

update benchmark scripts

a2004db

Signed-off-by: Chendi <[email protected]>

update README

d78fee7

Signed-off-by: Chendi <[email protected]>

xuechendi force-pushed the deepseek_r1_fp8matmul branch from 9d7d67b to d78fee7 Compare April 2, 2025 03:52

xuechendi marked this pull request as ready for review April 2, 2025 03:52

xuechendi added 6 commits April 2, 2025 03:56

rename run_accuracy.sh

d6fa27f

Signed-off-by: Chendi <[email protected]>

update scrips

6b9017c

Signed-off-by: Chendi <[email protected]>

update name

21da4ec

Signed-off-by: Chendi <[email protected]>

remove inc, in requirements-hpu.txt, use readme to install dependency

75fa415

Signed-off-by: Chendi <[email protected]>

remove batch2block

e91bc33

Signed-off-by: Chendi <[email protected]>

only call internal pipelined_pa when VLLM_USE_FP8_MATMUL enabled

3b9d805

Signed-off-by: Chendi <[email protected]>

xuechendi merged commit 109ac5d into HabanaAI:deepseek_r1 Apr 2, 2025
10 of 63 checks passed

yiliu30 mentioned this pull request May 14, 2025

[deepseek_r1] General PP enabling #1240

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Deepseek r1 fp8matmul #977

Deepseek r1 fp8matmul #977

Uh oh!

xuechendi commented Mar 27, 2025 •

edited by github-actions bot

Loading

Uh oh!

kwisniewski98 Mar 31, 2025

Uh oh!

xuechendi Apr 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Deepseek r1 fp8matmul #977

Deepseek r1 fp8matmul #977

Uh oh!

Conversation

xuechendi commented Mar 27, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kwisniewski98 Mar 31, 2025

Choose a reason for hiding this comment

Uh oh!

xuechendi Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xuechendi commented Mar 27, 2025 •

edited by github-actions bot

Loading