-
Notifications
You must be signed in to change notification settings - Fork 133
Deepseek r1 fp8matmul #977
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deepseek r1 fp8matmul #977
Conversation
cd2dcad to
8745f07
Compare
vllm/worker/hpu_model_runner.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are those env vars necessary? Why VLLM_PT_PROFILE cannot be used for profiling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed profiling in this PR. The 'VLLM_PT_PROFILE' produce dummy input which can't fully test MOE which makes profiling faster.
8745f07 to
d2164b9
Compare
Signed-off-by: Chendi <[email protected]>
Signed-off-by: Chendi <[email protected]>
Signed-off-by: Chendi <[email protected]>
Signed-off-by: Chendi <[email protected]>
Signed-off-by: Chendi <[email protected]>
Signed-off-by: Chendi <[email protected]>
Signed-off-by: Chendi <[email protected]>
9d7d67b to
d78fee7
Compare
Signed-off-by: Chendi <[email protected]>
Signed-off-by: Chendi <[email protected]>
Signed-off-by: Chendi <[email protected]>
Signed-off-by: Chendi <[email protected]>
Signed-off-by: Chendi <[email protected]>
Signed-off-by: Chendi <[email protected]>
Decode latency improved ~1.5x
Default we disabled this feature
When enabling with original static fp8 path, this feature will remove the dequant/quant between kv_cache and matmul_qk.
When working with INC static fp8, this feature can further enable fp8 pipeline PA since we leverage INC to provide batch2block_matmul scaling. Also, MOE will be faster due to scalar MOE from INC.
** with bs224_blocks7296 **

fp8 matmul is