Skip to content

Conversation

@xuechendi
Copy link

@xuechendi xuechendi commented Mar 27, 2025

Decode latency improved ~1.5x

Default we disabled this feature

When enabling with original static fp8 path, this feature will remove the dequant/quant between kv_cache and matmul_qk.

VLLM_USE_FP8_MATMUL=true \

When working with INC static fp8, this feature can further enable fp8 pipeline PA since we leverage INC to provide batch2block_matmul scaling. Also, MOE will be faster due to scalar MOE from INC.

QUANT_CONFIG=scripts/inc_quant_with_fp8kv_config.json \
VLLM_REQUANT_FP8_INC=1 \
VLLM_ENABLE_RUNTIME_DEQUANT=1 \
VLLM_USE_FP8_MATMUL=true \

** with bs224_blocks7296 **
fp8 matmul is
image

image

@xuechendi xuechendi force-pushed the deepseek_r1_fp8matmul branch 3 times, most recently from cd2dcad to 8745f07 Compare March 27, 2025 21:27
@michalkuligowski michalkuligowski marked this pull request as draft March 28, 2025 08:35

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are those env vars necessary? Why VLLM_PT_PROFILE cannot be used for profiling?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed profiling in this PR. The 'VLLM_PT_PROFILE' produce dummy input which can't fully test MOE which makes profiling faster.

@xuechendi xuechendi force-pushed the deepseek_r1_fp8matmul branch from 8745f07 to d2164b9 Compare April 1, 2025 02:14
@xuechendi xuechendi force-pushed the deepseek_r1_fp8matmul branch from 9d7d67b to d78fee7 Compare April 2, 2025 03:52
@xuechendi xuechendi marked this pull request as ready for review April 2, 2025 03:52
@xuechendi xuechendi merged commit 109ac5d into HabanaAI:deepseek_r1 Apr 2, 2025
10 of 63 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants