Enhanced FP8 Support with Significant Performance Optimizations#122
Enhanced FP8 Support with Significant Performance Optimizations#122justinSmileDate wants to merge 3 commits intodeepseek-ai:mainfrom
Conversation
Added a namespace 'sm90' for better code organization.
|
Are any sglang updates needed for testing? |
|
Excellent work! I'm a member of the LLM Inference Team at AntGroup Theta, and I also serve as the maintainer for this AntGroup-H20-PR. We are interested in establishing a long-term connection. Would you be comfortable sharing your contact information (such as WeChat) by sending it to my email? My email is: moyun.zty@antgroup.com |
|
Look forward to this PR being merged. |
|
@justinSmileDate Does sglang need some adaptations for this PR? thanks |
I apologize for the delay in responding due to my busy work. I have sent the relevant information to your email address. Please check your inbox. |
Actually, only the FlashMLA repository needs to be replaced, and there is no need to adjust sglang. The version of sglang I'm using is v0.5.3. |
No, there is no need to modify sglang. The version of sglang I'm using is v0.5.3. |
Description
This PR ports and enhances FP8 support from #82 to the latest branch, delivering substantial performance improvements through optimized computational patterns and memory access strategies.
Key Changes
Compared to Original FlashMLA:
retrieve_rP_for_SP(sQ(8))operationsImprovements over PR #82:
ceil_divcalculationsPerformance Benchmarks
Test Configuration:
Key Performance Highlights:
Representative Performance Comparison:
Configuration: b=128, s_q=1, mean_sk=4096, h_q=16
Configuration: b=128, s_q=2, mean_sk=4096, h_q=32
Scalability Analysis:
Across Head Counts (mean_sk=4096, s_q=1):
Across Sequence Lengths (h_q=16, s_q=1):
Performance Improvements Summary
vs Original FlashMLA (bfloat16):
vs PR #82 (FP8):
Technical Advantages
Computational Efficiency:
Memory Optimization:
Scalability:
Testing & Validation
Usage
export ENABLE_SWAPAB=1 python3 -m sglang.launch_server XXX --quantization fp8 --kv-cache-dtype fp8_e4m3This PR delivers substantial performance improvements while maintaining full numerical correctness and compatibility with existing FP8 functionality.