fp8 support by endurehero · Pull Request #54 · deepseek-ai/FlashMLA

endurehero · 2025-02-28T14:37:17Z

Functionality

Support FP8 WGMMA based on the async pipeline design of FlashMLA. The TransV part draws on the implementation of SmemTranspose64x64 in Fa3.
Currently, Q/K/V only support symmetric PerTensor quantization. Since the maximum value of P does not exceed 1, the f32tofp8_cast is directly used for quantization.

Performance

cuda driver version: 535.183.06
nvcc version: 12.8
torch version: 2.6

On the H20, MLA typically demonstrate a high degree of arithmetic intensity. Consequently, the Memory Floating - point Utilization (MFU) is employed as a performance metric.

On the H800, MLA typically encounter memory-bound situations. Consequently, the Memory Bandwidth Utilization (MBU) metric is adopted to evaluate the performance of the kernel. There is still a lot of room for optimization on the H800. Look forward to working together.

Reproduction

python3 ./tests/test_flash_mla.py --dtype e4m3

csrc/fp8_transpose_v.h

sijiac · 2025-03-01T04:37:03Z

awesome, did you mind adding a compile flag to save the time when FP8 is not needed? Thanks

endurehero · 2025-03-01T07:08:15Z

awesome, did you mind adding a compile flag to save the time when FP8 is not needed? Thanks

Of course. Already Done

beginlner · 2025-03-01T10:14:09Z

Great work! However, I can’t merge this PR at the moment because, based on our tests, per-sequence kvcache scaling significantly reduces accuracy for MLA.

endurehero · 2025-03-01T10:29:50Z

Great work! However, I can’t merge this PR at the moment because, based on our tests, per-sequence kvcache scaling significantly reduces accuracy for MLA.

What about the granularity of PerPageBlock? I can easily adapt it

beginlner · 2025-03-01T10:34:47Z

What about the granularity of PerPageBlock? I can easily adapt it

We think PerPageBlock is neither enough. kv_rope (64) needs to be bf16.

endurehero · 2025-03-01T11:07:27Z

What about the granularity of PerPageBlock? I can easily adapt it

We think PerPageBlock is neither enough. kv_rope (64) needs to be bf16.

Got it!

moses3017 · 2025-04-28T00:18:08Z

What about the granularity of PerPageBlock? I can easily adapt it

We think PerPageBlock is neither enough. kv_rope (64) needs to be bf16.

How about Qnope and Knope using 8-bit quantization, while Qrope and Krope maintain 16-bit data types?

beginlner · 2025-05-21T04:37:03Z

It's acceptable for Qnope and Knope to use per-(1 token × 128 channel) 8-bit quantization, while Qrope and Krope retain 16-bit precision.

shinezyy · 2025-05-21T11:38:47Z

It's acceptable for Qnope and Knope to use per-(1 token × 128 channel) 8-bit quantization, while Qrope and Krope retain 16-bit precision.

The outliners in RoPE cache are also discussed in this paper https://arxiv.org/pdf/2502.01563

Can we add a hadamard transform right after RoPE to distribute outliners to multiple head dims? (https://arxiv.org/abs/2404.00456)

TheTinyTeddy · 2025-06-03T11:41:15Z

It's acceptable for Qnope and Knope to use per-(1 token × 128 channel) 8-bit quantization, while Qrope and Krope retain 16-bit precision.

What precision should S×V be? BF16×BF16 or BF16×FP8 or FP8×FP8 per(1 token × 128 channel)?

hypdeb · 2025-06-29T17:27:56Z

Great work! However, I can’t merge this PR at the moment because, based on our tests, per-sequence kvcache scaling significantly reduces accuracy for MLA.

@beginlner Hello there, on which workload did you observe these accuracy issues?

Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>

MicroZHY · 2025-08-14T08:25:03Z

Hi @endurehero,

Thank you very much for sharing this impressive FP8 implementation and for the detailed performance numbers!

I have two quick questions that would help me understand the design better:

Branch difference
Could you kindly clarify the relationship between will_fp8_mr and the earlier will_fp8 branch? I noticed both names appear in the commit history, and I’d love to know what motivated the new branch and which improvements or fixes it contains.
Necessity of TransV
I see that the new FP8 path relies on the TransV routine, which borrows the 64×64 shared-memory transpose from FlashAttention-3. Would it be possible to briefly explain why TransV is indispensable for FP8 correctness or performance? I’m curious whether the same result could be achieved with a different layout or if this is a hard requirement for the WGMMA pipeline.

Thanks again for your time and for open-sourcing this work!

* Add files from deepseek-ai#54 Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> * FP8 now extends base implementation Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> * Fix typo Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> * Update tests Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> * Add to build Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> * Fix installation Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> * Fix FLASH_MLA_DISABLE_FP8 flag Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> * Fix param matchup Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> * typo Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> * Fix out dtype Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> * Fix IMA Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> * Extension name should be _flashmla_C Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> * Clean up Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> * Tighten FP8 error tolerance Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> * Add attribution to copied files Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> * Remove breakpoint Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> * Port cudagraph fix from #3 Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> --------- Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>

chenhongmin.will added 28 commits February 24, 2025 21:12

init fp8

dae0690

enable fp8

d833dbd

update gmem

b67a18f

fp8 shared mem

fed0499

enable fp8 compile

7409203

fix compile

c50d29d

enable fp8 api

dfe8ffc

add fp8 ut

8704188

update ut

ef644a5

update fp8 api

4b314cd

change to use per_tensor

f6fab1b

debug mode

29de9e0

fix Vt illegal

59f6917

add transv barrier

6a4eb63

add TransV

6dcea49

fix sV

dbd8c30

try fix

1757a6d

use mm1's Aregs instead of mma0's Cregs

d1689ab

use 64x64 transpose_v

855c985

fix compile

1df91af

reorg

0337732

use fa'3 transv

061af5f

fix mma0

fd1e662

fix combine

bfe38ab

reorg ut

4e055a6

enable scale

8b93985

Merge branch 'main' into will_fp8_mr

c7143a7

update readme

9887a55

endurehero closed this Feb 28, 2025

endurehero changed the title ~~support fp8~~ fp8 support Feb 28, 2025

update ut

9028983

endurehero reopened this Feb 28, 2025

endurehero mentioned this pull request Feb 28, 2025

FP8 Support #56

Open

tridao reviewed Feb 28, 2025

View reviewed changes

csrc/fp8_transpose_v.h Show resolved Hide resolved

update desc

6199b0b

endurehero force-pushed the will_fp8_mr branch from 9c088fe to 6199b0b Compare February 28, 2025 23:54

add env

7fafcd2

endurehero force-pushed the will_fp8_mr branch from 1eddaa3 to 7fafcd2 Compare March 1, 2025 07:00

beginlner closed this Mar 11, 2025

josephrocca mentioned this pull request Jun 5, 2025

[Bug]: FlashMLA V1 with FP8 KV cache not yet supported! vllm-project/vllm#18887

Closed

1 task

MatthewBonanni added a commit to MatthewBonanni/FlashMLA that referenced this pull request Aug 6, 2025

Add files from deepseek-ai#54

b88d3ba

Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>

MatthewBonanni mentioned this pull request Aug 6, 2025

Add FP8 support vllm-project/FlashMLA#4

Merged

MatthewBonanni added a commit to MatthewBonanni/FlashMLA that referenced this pull request Aug 7, 2025

Add files from deepseek-ai#54

19a9caf

Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>

MatthewBonanni added a commit to MatthewBonanni/FlashMLA that referenced this pull request Aug 11, 2025

Add files from deepseek-ai#54

35f0a6b

Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>

MicroZHY mentioned this pull request Aug 14, 2025

update fp8 support #82

Open

FlamingoPg mentioned this pull request Nov 11, 2025

[sgl-kernel] add fp8 kernel for FlashMLA sgl-project/FlashMLA#1

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fp8 support#54

fp8 support#54
endurehero wants to merge 31 commits intodeepseek-ai:mainfrom
endurehero:will_fp8_mr

endurehero commented Feb 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

sijiac commented Mar 1, 2025

Uh oh!

endurehero commented Mar 1, 2025 •

edited

Loading

Uh oh!

beginlner commented Mar 1, 2025 •

edited

Loading

Uh oh!

endurehero commented Mar 1, 2025

Uh oh!

beginlner commented Mar 1, 2025 •

edited

Loading

Uh oh!

endurehero commented Mar 1, 2025

Uh oh!

moses3017 commented Apr 28, 2025

Uh oh!

beginlner commented May 21, 2025 •

edited

Loading

Uh oh!

shinezyy commented May 21, 2025 •

edited

Loading

Uh oh!

TheTinyTeddy commented Jun 3, 2025 •

edited

Loading

Uh oh!

hypdeb commented Jun 29, 2025 •

edited

Loading

Uh oh!

MicroZHY commented Aug 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Conversation

endurehero commented Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Functionality

Performance

Reproduction

Uh oh!

Uh oh!

sijiac commented Mar 1, 2025

Uh oh!

endurehero commented Mar 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

beginlner commented Mar 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

endurehero commented Mar 1, 2025

Uh oh!

beginlner commented Mar 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

endurehero commented Mar 1, 2025

Uh oh!

moses3017 commented Apr 28, 2025

Uh oh!

beginlner commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shinezyy commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheTinyTeddy commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hypdeb commented Jun 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MicroZHY commented Aug 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

endurehero commented Feb 28, 2025 •

edited

Loading

endurehero commented Mar 1, 2025 •

edited

Loading

beginlner commented Mar 1, 2025 •

edited

Loading

beginlner commented Mar 1, 2025 •

edited

Loading

beginlner commented May 21, 2025 •

edited

Loading

shinezyy commented May 21, 2025 •

edited

Loading

TheTinyTeddy commented Jun 3, 2025 •

edited

Loading

hypdeb commented Jun 29, 2025 •

edited

Loading