[ROCM] Add support with Infinity Cache (LLC) awareness for improved performance #2147
[ROCM] Add support with Infinity Cache (LLC) awareness for improved performance #2147tianwyan wants to merge 6 commits intoDao-AILab:mainfrom
Conversation
|
@tridao Would you be interested in reviewing this for potential upstream to Dao-AILab/flash-attention? This adds RDNA3/gfx1100 support with ~2x performance improvement via AMD LLC awareness. |
|
Cc @rocking5566 are there folks who can review this PR? |
|
@tridao |
|
as discussed with @micmelesse , I'll rebase my current PR to coordinate with his incoming. :) |
|
The PR I mentioned is up here, #2178. |
|
RDNA4 (gfx1200 in my case) support would be great. I made some local changes like: And in I think the autotune configs are also suitable for RDNA4. Edit: Furthermore, this second script confirms the improved performance on gfx1200 thanks to LLC awareness: Therefore, I think RDNA4 support as a whole could be added to this PR. |
thanks for the information! a new PR with LLC-aware head grouping will be created soon which is going to be rebased on #2178 |
The rebased PR is #2217 @micmelesse @tridao |
|
the current PR is going to be closed, please go to #2217 |
Motivation
This PR enables Flash Attention Triton support for AMD RDNA3 (Navi) GPUs, specifically targeting the gfx1100 architecture. The goal is to bring Flash Attention performance optimizations to consumer-grade AMD GPUs while leveraging the unique Infinity Cache (LLC) architecture for improved memory throughput.
Technical Details
New Architecture Support:
Performance Optimizations:
Code Cleanup:
Test Plan
Test Result