Conversation
9c088fe to
6199b0b
Compare
|
awesome, did you mind adding a compile flag to save the time when FP8 is not needed? Thanks |
Of course. Already Done |
|
Great work! However, I can’t merge this PR at the moment because, based on our tests, per-sequence kvcache scaling significantly reduces accuracy for MLA. |
What about the granularity of PerPageBlock? I can easily adapt it |
We think PerPageBlock is neither enough. kv_rope (64) needs to be bf16. |
Got it! |
How about Qnope and Knope using 8-bit quantization, while Qrope and Krope maintain 16-bit data types? |
|
It's acceptable for Qnope and Knope to use per-(1 token × 128 channel) 8-bit quantization, while Qrope and Krope retain 16-bit precision. |
The outliners in RoPE cache are also discussed in this paper https://arxiv.org/pdf/2502.01563 Can we add a hadamard transform right after RoPE to distribute outliners to multiple head dims? (https://arxiv.org/abs/2404.00456) |
What precision should S×V be? BF16×BF16 or BF16×FP8 or FP8×FP8 per(1 token × 128 channel)? |
@beginlner Hello there, on which workload did you observe these accuracy issues? |
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
|
Hi @endurehero, Thank you very much for sharing this impressive FP8 implementation and for the detailed performance numbers! I have two quick questions that would help me understand the design better:
Thanks again for your time and for open-sourcing this work! |
* Add files from deepseek-ai#54 Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> * FP8 now extends base implementation Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> * Fix typo Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> * Update tests Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> * Add to build Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> * Fix installation Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> * Fix FLASH_MLA_DISABLE_FP8 flag Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> * Fix param matchup Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> * typo Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> * Fix out dtype Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> * Fix IMA Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> * Extension name should be _flashmla_C Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> * Clean up Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> * Tighten FP8 error tolerance Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> * Add attribution to copied files Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> * Remove breakpoint Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> * Port cudagraph fix from #3 Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com> --------- Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com>
Functionality
Support FP8 WGMMA based on the async pipeline design of FlashMLA. The TransV part draws on the implementation of SmemTranspose64x64 in Fa3.
Currently, Q/K/V only support symmetric PerTensor quantization. Since the maximum value of P does not exceed 1, the f32tofp8_cast is directly used for quantization.
Performance
On the H20, MLA typically demonstrate a high degree of arithmetic intensity. Consequently, the Memory Floating - point Utilization (MFU) is employed as a performance metric.

On the H800, MLA typically encounter memory-bound situations. Consequently, the Memory Bandwidth Utilization (MBU) metric is adopted to evaluate the performance of the kernel. There is still a lot of room for optimization on the H800. Look forward to working together.

Reproduction