[ROCM] Add support with Infinity Cache (LLC) awareness for performance improvement - [PR#2147 rebased on PR#2178] by tianwyan · Pull Request #2217 · Dao-AILab/flash-attention

tianwyan · 2026-01-29T04:54:20Z

Motivation

This PR enables Flash Attention Triton support for AMD RDNA3 (Navi) GPUs, specifically targeting the gfx1100 architecture. The goal is to bring Flash Attention performance optimizations to consumer-grade AMD GPUs while leveraging the unique Infinity Cache (LLC) architecture for improved memory throughput.

Technical Details

New Architecture Support:

Added gfx1100 (RDNA3/Navi 31) to the supported GPU architectures in the Triton Flash Attention backend

Performance Optimizations:

Implemented Infinity Cache (LLC) awareness to optimize memory access patterns and reduce DRAM bandwidth pressure
Enabled exp2 instruction by default for faster exponential calculations on RDNA3
Added additional Triton autotuning configurations optimized for Navi's wavefront and cache characteristics

Code Cleanup:

Renamed "L2 cache" terminology to "Infinity Cache (LLC)" throughout the codebase to accurately reflect AMD's cache hierarchy and avoid confusion with the traditional L2 cache

Test Plan

Functional testing on AMD Radeon RX 7900 XTX (gfx1100)
Verified Flash Attention forward pass correctness against reference implementation
Benchmarked memory bandwidth utilization with and without LLC awareness

Test Result

All existing Triton Flash Attention tests pass on gfx1100
~2-4x performance improvement with LLC-aware implementation on memory-bound attention workloads
LLC awareness significantly reduces DRAM bandwidth pressure by better utilizing the 96MB Infinity Cache on RDNA3

tianwyan · 2026-01-29T04:55:33Z

This PR is #2147 rebased on #2178. @tridao @micmelesse

tianwyan · 2026-01-29T10:58:51Z

LLC size detection will be added in future PR.

micmelesse · 2026-01-29T13:28:30Z

I will test this and post results soon.

…gmentation

jnolck · 2026-02-01T21:20:26Z

"If you comment on that PR, the most helpful technical detail to add is: "The 7900 XT also identifies as gfx1100 but has 80MB cache. While current tiling strategies seem safe, hardcoding 96MB is unsafe for future aggressive optimizations. I suggest detecting by Device ID or Compute Unit count (84 vs 96) if possible, or defaulting to the safer 80MB for gfx1100.""

Gemmini AI.

Something I noticed while playing around with this on my 7900xt.

0xDELUXA · 2026-02-01T21:23:54Z

#2217 (comment)

jnolck · 2026-02-01T22:44:08Z

#2217 (comment)

Missed that.

tianwyan · 2026-02-02T03:01:30Z

"If you comment on that PR, the most helpful technical detail to add is: "The 7900 XT also identifies as gfx1100 but has 80MB cache. While current tiling strategies seem safe, hardcoding 96MB is unsafe for future aggressive optimizations. I suggest detecting by Device ID or Compute Unit count (84 vs 96) if possible, or defaulting to the safer 80MB for gfx1100.""

Gemmini AI.

Something I noticed while playing around with this on my 7900xt.

Thanks for your comments! 👍 Yes, performance will be affected when tiled with group size exceeds LLC. but it is safer than no heads grouping which send all heads.
Looking for more effective way to detect current LLC and with minor overhead.
Appreciate for your insight here!!

tianwyan · 2026-02-02T06:57:31Z

"If you comment on that PR, the most helpful technical detail to add is: "The 7900 XT also identifies as gfx1100 but has 80MB cache. While current tiling strategies seem safe, hardcoding 96MB is unsafe for future aggressive optimizations. I suggest detecting by Device ID or Compute Unit count (84 vs 96) if possible, or defaulting to the safer 80MB for gfx1100.""
Gemmini AI.
Something I noticed while playing around with this on my 7900xt.

Thanks for your comments! 👍 Yes, performance will be affected when tiled with group size exceeds LLC. but it is safer than no heads grouping which send all heads. Looking for more effective way to detect current LLC and with minor overhead. Appreciate for your insight here!!

BTW, you can try to use env FLASH_ATTN_LLC_CACHE_MB to override the LLC size, you can see the performance changes when the override affected the group sizes because of LLC thrashing. You can tune the size with a best one. Thanks!
FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" FLASH_ATTN_LLC_CACHE_MB=80 python yourscript.py

a quick and simple test from my side:

…on RDNA (#5018) ## Motivation Long-sequence FMHA can become memory-bound when K/V working sets exceed Infinity Cache (LLC), causing repeated DRAM traffic across heads. This PR introduces LLC-aware launch ordering improvements for FMHA forward, and it is currently enabled only on gfx11 and gfx12. The approach is inspired by [`Dao-AILab/flash-attention#2217`](Dao-AILab/flash-attention#2217), adapted to CK’s kernel/runner structure and layout handling. In this context, `bshd` is the layout used in Flash-Attention, while `bhsd` is the default layout used by the CK Tile FMHA example. ## Technical Details This PR adds two complementary strategies: - For `bshd` input layout (`i_perm/o_perm=0`), enable explicit LLC-aware head grouping: - Estimate LLC size (env override, KFD sysfs, or arch default). - Compute group size from K/V bytes per head vs LLC target. - Launch FMHA forward repeatedly per head-group by slicing Q/K/V/O (and related tensors). - For `bhsd` input layout (`i_perm/o_perm=1`), apply implicit launch-order adjustment: - Keep a single kernel launch. - Reinterpret block linearization in `GetTileIndex` to make execution head-major, improving temporal locality of per-head K/V reuse. Additional integration updates: - Propagate `num_head_q_total` and `head_start` through FMHA args/kargs. - Use global head indexing for dropout RNG stream mapping so grouped launches keep deterministic/consistent dropout behavior. - Keep fallback behavior unchanged when grouping is not beneficial or disabled. ## Test Plan - `test_ck_tile_fmha` - `tile_example_fmha_fwd` ## Test Result - `test_ck_tile_fmha`: all tests passed. - `tile_example_fmha_fwd`: tested this on gfx1100, gfx1151, and gfx1201, and all of them show higher performance compared to the baseline. The improvement is consistent, and performance is well maintained even at long sequence lengths. ./build/bin/tile_example_fmha_fwd -prec=bf16 -mode=0 -b=1 -h=24 -d=128 -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1} - TFLOPs by sequence length target: gfx1100 layout: bhsd SeqLen | Before | After | Speedup -- | -- | -- | -- 1024 | 56.27 | 61.48 | 1.09x 4096 | 67.10 | 72.27 | 1.08x 8192 | 65.99 | 71.64 | 1.09x 12288 | 61.60 | 76.61 | 1.24x 16384 | 58.99 | 75.74 | 1.28x 20480 | 57.32 | 74.42 | 1.30x 24576 | 56.89 | 74.25 | 1.31x 27280 | 18.93 | 24.48 | 1.29x - TFLOPs by sequence length target: gfx1201 layout: bshd SeqLen | Before | After | Speedup -- | -- | -- | -- 1024 | 66.79 | 65.90 | 0.99x 4096 | 85.90 | 86.80 | 1.01x 8192 | 77.06 | 90.29 | 1.17x 12288 | 58.36 | 88.98 | 1.52x 16384 | 52.12 | 88.88 | 1.71x 20480 | 48.11 | 88.42 | 1.84x 24576 | 47.12 | 89.07 | 1.89x 27280 | 49.05 | 50.31 | 1.03x ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

[CK_TILE] Add LLC-aware FMHA head grouping and head-major scheduling on RDNA (#5018) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Motivation Long-sequence FMHA can become memory-bound when K/V working sets exceed Infinity Cache (LLC), causing repeated DRAM traffic across heads. This PR introduces LLC-aware launch ordering improvements for FMHA forward, and it is currently enabled only on gfx11 and gfx12. The approach is inspired by [`Dao-AILab/flash-attention#2217`](Dao-AILab/flash-attention#2217), adapted to CK’s kernel/runner structure and layout handling. In this context, `bshd` is the layout used in Flash-Attention, while `bhsd` is the default layout used by the CK Tile FMHA example. ## Technical Details This PR adds two complementary strategies: - For `bshd` input layout (`i_perm/o_perm=0`), enable explicit LLC-aware head grouping: - Estimate LLC size (env override, KFD sysfs, or arch default). - Compute group size from K/V bytes per head vs LLC target. - Launch FMHA forward repeatedly per head-group by slicing Q/K/V/O (and related tensors). - For `bhsd` input layout (`i_perm/o_perm=1`), apply implicit launch-order adjustment: - Keep a single kernel launch. - Reinterpret block linearization in `GetTileIndex` to make execution head-major, improving temporal locality of per-head K/V reuse. Additional integration updates: - Propagate `num_head_q_total` and `head_start` through FMHA args/kargs. - Use global head indexing for dropout RNG stream mapping so grouped launches keep deterministic/consistent dropout behavior. - Keep fallback behavior unchanged when grouping is not beneficial or disabled. ## Test Plan - `test_ck_tile_fmha` - `tile_example_fmha_fwd` ## Test Result - `test_ck_tile_fmha`: all tests passed. - `tile_example_fmha_fwd`: tested this on gfx1100, gfx1151, and gfx1201, and all of them show higher performance compared to the baseline. The improvement is consistent, and performance is well maintained even at long sequence lengths. ./build/bin/tile_example_fmha_fwd -prec=bf16 -mode=0 -b=1 -h=24 -d=128 -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1} - TFLOPs by sequence length target: gfx1100 layout: bhsd SeqLen | Before | After | Speedup -- | -- | -- | -- 1024 | 56.27 | 61.48 | 1.09x 4096 | 67.10 | 72.27 | 1.08x 8192 | 65.99 | 71.64 | 1.09x 12288 | 61.60 | 76.61 | 1.24x 16384 | 58.99 | 75.74 | 1.28x 20480 | 57.32 | 74.42 | 1.30x 24576 | 56.89 | 74.25 | 1.31x 27280 | 18.93 | 24.48 | 1.29x - TFLOPs by sequence length target: gfx1201 layout: bshd SeqLen | Before | After | Speedup -- | -- | -- | -- 1024 | 66.79 | 65.90 | 0.99x 4096 | 85.90 | 86.80 | 1.01x 8192 | 77.06 | 90.29 | 1.17x 12288 | 58.36 | 88.98 | 1.52x 16384 | 52.12 | 88.88 | 1.71x 20480 | 48.11 | 88.42 | 1.84x 24576 | 47.12 | 89.07 | 1.89x 27280 | 49.05 | 50.31 | 1.03x ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

…on RDNA (ROCm#5018) ## Motivation Long-sequence FMHA can become memory-bound when K/V working sets exceed Infinity Cache (LLC), causing repeated DRAM traffic across heads. This PR introduces LLC-aware launch ordering improvements for FMHA forward, and it is currently enabled only on gfx11 and gfx12. The approach is inspired by [`Dao-AILab/flash-attention#2217`](Dao-AILab/flash-attention#2217), adapted to CK’s kernel/runner structure and layout handling. In this context, `bshd` is the layout used in Flash-Attention, while `bhsd` is the default layout used by the CK Tile FMHA example. ## Technical Details This PR adds two complementary strategies: - For `bshd` input layout (`i_perm/o_perm=0`), enable explicit LLC-aware head grouping: - Estimate LLC size (env override, KFD sysfs, or arch default). - Compute group size from K/V bytes per head vs LLC target. - Launch FMHA forward repeatedly per head-group by slicing Q/K/V/O (and related tensors). - For `bhsd` input layout (`i_perm/o_perm=1`), apply implicit launch-order adjustment: - Keep a single kernel launch. - Reinterpret block linearization in `GetTileIndex` to make execution head-major, improving temporal locality of per-head K/V reuse. Additional integration updates: - Propagate `num_head_q_total` and `head_start` through FMHA args/kargs. - Use global head indexing for dropout RNG stream mapping so grouped launches keep deterministic/consistent dropout behavior. - Keep fallback behavior unchanged when grouping is not beneficial or disabled. ## Test Plan - `test_ck_tile_fmha` - `tile_example_fmha_fwd` ## Test Result - `test_ck_tile_fmha`: all tests passed. - `tile_example_fmha_fwd`: tested this on gfx1100, gfx1151, and gfx1201, and all of them show higher performance compared to the baseline. The improvement is consistent, and performance is well maintained even at long sequence lengths. ./build/bin/tile_example_fmha_fwd -prec=bf16 -mode=0 -b=1 -h=24 -d=128 -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1} - TFLOPs by sequence length target: gfx1100 layout: bhsd SeqLen | Before | After | Speedup -- | -- | -- | -- 1024 | 56.27 | 61.48 | 1.09x 4096 | 67.10 | 72.27 | 1.08x 8192 | 65.99 | 71.64 | 1.09x 12288 | 61.60 | 76.61 | 1.24x 16384 | 58.99 | 75.74 | 1.28x 20480 | 57.32 | 74.42 | 1.30x 24576 | 56.89 | 74.25 | 1.31x 27280 | 18.93 | 24.48 | 1.29x - TFLOPs by sequence length target: gfx1201 layout: bshd SeqLen | Before | After | Speedup -- | -- | -- | -- 1024 | 66.79 | 65.90 | 0.99x 4096 | 85.90 | 86.80 | 1.01x 8192 | 77.06 | 90.29 | 1.17x 12288 | 58.36 | 88.98 | 1.52x 16384 | 52.12 | 88.88 | 1.71x 20480 | 48.11 | 88.42 | 1.84x 24576 | 47.12 | 89.07 | 1.89x 27280 | 49.05 | 50.31 | 1.03x ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

…on RDNA (#5018) ## Motivation Long-sequence FMHA can become memory-bound when K/V working sets exceed Infinity Cache (LLC), causing repeated DRAM traffic across heads. This PR introduces LLC-aware launch ordering improvements for FMHA forward, and it is currently enabled only on gfx11 and gfx12. The approach is inspired by [`Dao-AILab/flash-attention#2217`](Dao-AILab/flash-attention#2217), adapted to CK’s kernel/runner structure and layout handling. In this context, `bshd` is the layout used in Flash-Attention, while `bhsd` is the default layout used by the CK Tile FMHA example. ## Technical Details This PR adds two complementary strategies: - For `bshd` input layout (`i_perm/o_perm=0`), enable explicit LLC-aware head grouping: - Estimate LLC size (env override, KFD sysfs, or arch default). - Compute group size from K/V bytes per head vs LLC target. - Launch FMHA forward repeatedly per head-group by slicing Q/K/V/O (and related tensors). - For `bhsd` input layout (`i_perm/o_perm=1`), apply implicit launch-order adjustment: - Keep a single kernel launch. - Reinterpret block linearization in `GetTileIndex` to make execution head-major, improving temporal locality of per-head K/V reuse. Additional integration updates: - Propagate `num_head_q_total` and `head_start` through FMHA args/kargs. - Use global head indexing for dropout RNG stream mapping so grouped launches keep deterministic/consistent dropout behavior. - Keep fallback behavior unchanged when grouping is not beneficial or disabled. ## Test Plan - `test_ck_tile_fmha` - `tile_example_fmha_fwd` ## Test Result - `test_ck_tile_fmha`: all tests passed. - `tile_example_fmha_fwd`: tested this on gfx1100, gfx1151, and gfx1201, and all of them show higher performance compared to the baseline. The improvement is consistent, and performance is well maintained even at long sequence lengths. ./build/bin/tile_example_fmha_fwd -prec=bf16 -mode=0 -b=1 -h=24 -d=128 -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1} - TFLOPs by sequence length target: gfx1100 layout: bhsd SeqLen | Before | After | Speedup -- | -- | -- | -- 1024 | 56.27 | 61.48 | 1.09x 4096 | 67.10 | 72.27 | 1.08x 8192 | 65.99 | 71.64 | 1.09x 12288 | 61.60 | 76.61 | 1.24x 16384 | 58.99 | 75.74 | 1.28x 20480 | 57.32 | 74.42 | 1.30x 24576 | 56.89 | 74.25 | 1.31x 27280 | 18.93 | 24.48 | 1.29x - TFLOPs by sequence length target: gfx1201 layout: bshd SeqLen | Before | After | Speedup -- | -- | -- | -- 1024 | 66.79 | 65.90 | 0.99x 4096 | 85.90 | 86.80 | 1.01x 8192 | 77.06 | 90.29 | 1.17x 12288 | 58.36 | 88.98 | 1.52x 16384 | 52.12 | 88.88 | 1.71x 20480 | 48.11 | 88.42 | 1.84x 24576 | 47.12 | 89.07 | 1.89x 27280 | 49.05 | 50.31 | 1.03x ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Tianwei Yang added 6 commits January 29, 2026 01:37

[Triton]Implemented LLC-aware head grouping algorithm for long seqlen

9f47750

[Navi]improved with default triton config

0d89764

[Navi]Add LLC-aware head grouping to varlen_fwd

34a8d66

[Triton]Add one more PRE_LOAD_V_OPTIONS for autotune

24f6449

[Navi]enable LLC aware heads grouping for more archs

85bda34

[Navi]Adjust the default LLC untilization and triton config parameters

6345e85

tianwyan mentioned this pull request Jan 29, 2026

[ROCM] Add support with Infinity Cache (LLC) awareness for improved performance #2147

Closed

0xDELUXA reviewed Jan 29, 2026

View reviewed changes

Comment thread flash_attn/flash_attn_triton_amd/llc_cache_aware.py Outdated

0xDELUXA reviewed Jan 29, 2026

View reviewed changes

Comment thread flash_attn/flash_attn_triton_amd/llc_cache_aware.py Outdated

[Navi]Correct some configs for navi4x

47b11e9

micmelesse requested changes Jan 29, 2026

View reviewed changes

Comment thread flash_attn/flash_attn_triton_amd/llc_cache_aware.py Outdated

Tianwei Yang added 5 commits January 30, 2026 06:55

[Navi] Moving the heads grouping to fwd_prefill.py from interface level

07072f1

[Navi] code clean up, reuse get_arch and remove CUs num

5a882dd

[Navi]Limits the heads grouping for known RDNA; changed the DEBUG env

3f04039

[ROCM] revert the fallback triton config

ca9a399

[Navi]use pre-allocate to reuse memory, avoids allocator overhead/fra…

6cfbf7b

…gmentation

jnolck mentioned this pull request Feb 1, 2026

Add --use-flash-attention flag. Comfy-Org/ComfyUI#7223

Merged

tianwyan closed this Feb 2, 2026

tianwyan reopened this Feb 2, 2026

hyoon1 mentioned this pull request Mar 2, 2026

[CK_TILE] Add LLC-aware FMHA head grouping and head-major scheduling on RDNA ROCm/rocm-libraries#5018

Merged

1 task

sleppyrobot mentioned this pull request Mar 17, 2026

help! silveroxides/ComfyUI-QuantOps#9

Closed

tianwyan mentioned this pull request Mar 26, 2026

[ROCM] Add support with Infinity Cache (LLC) awareness for performance improvement ROCm/aiter#2483

Open

0xDELUXA mentioned this pull request Mar 26, 2026

[ROCM] Fix windows issues #2385

Merged

tianwyan mentioned this pull request Apr 1, 2026

[AMD ROCm] Update CK and add RDNA 3/4 support #2400

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCM] Add support with Infinity Cache (LLC) awareness for performance improvement - [PR#2147 rebased on PR#2178]#2217

[ROCM] Add support with Infinity Cache (LLC) awareness for performance improvement - [PR#2147 rebased on PR#2178]#2217
tianwyan wants to merge 12 commits intoDao-AILab:mainfrom
ROCm:tianwyan/triton_navi_2147_2178

tianwyan commented Jan 29, 2026

Uh oh!

tianwyan commented Jan 29, 2026

Uh oh!

Uh oh!

Uh oh!

tianwyan commented Jan 29, 2026

Uh oh!

Uh oh!

micmelesse commented Jan 29, 2026

Uh oh!

jnolck commented Feb 1, 2026

Uh oh!

0xDELUXA commented Feb 1, 2026

Uh oh!

jnolck commented Feb 1, 2026

Uh oh!

tianwyan commented Feb 2, 2026

Uh oh!

tianwyan commented Feb 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

tianwyan commented Jan 29, 2026

Motivation

Technical Details

Test Plan

Test Result

Uh oh!

tianwyan commented Jan 29, 2026

Uh oh!

Uh oh!

Uh oh!

tianwyan commented Jan 29, 2026

Uh oh!

Uh oh!

micmelesse commented Jan 29, 2026

Uh oh!

jnolck commented Feb 1, 2026

Uh oh!

0xDELUXA commented Feb 1, 2026

Uh oh!

jnolck commented Feb 1, 2026

Uh oh!

tianwyan commented Feb 2, 2026

Uh oh!

tianwyan commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tianwyan commented Feb 2, 2026 •

edited

Loading