[CK] Fix RDNA3 FMHA tile-load paths by jammm · Pull Request #7016 · ROCm/rocm-libraries

jammm · 2026-05-02T07:21:10Z

Summary

Fix CK tile FMHA paths needed for RDNA3/RDNA4 targets.

Details

This PR addresses RDNA-specific issues hit while enabling xFormers CK FMHA on gfx11/gfx12:

On RDNA3, update FMHA P tile handling so the layout consumed by the second GEMM matches the WMMA path.

Testing

Validated downstream with xFormers CK/FMHA on gfx1201/gfx1151.

pytest --import-mode=importlib -q \
  tests/test_mem_eff_attention.py::test_forward \
  tests/test_mem_eff_attention.py::test_backward \
  tests/test_mem_eff_attention.py::test_dropout_ck

3844 passed, 5244 skipped, 26 warnings

0xDELUXA · 2026-05-02T11:02:40Z

Thanks @jammm for your help and guidance in enabling xformers CK on RDNA/Windows!

…am/gfx12-buffer-load-fallback

hyoon1 · 2026-05-05T16:04:01Z

I’m wondering if this fallback is actually safe.

My understanding is that on gfx12, we should not be calling async buffer load from the upper layer in the first place. If async buffer load is not supported or not intended to be used on gfx12, then silently falling back here may hide an incorrect/suboptimal code path.

Wouldn’t the proper fix be to update the kernel/code path that currently emits async buffer load, so that it calls the regular buffer load directly on gfx12 instead?

jammm · 2026-05-05T17:50:23Z

Wouldn’t the proper fix be to update the kernel/code path that currently emits async buffer load, so that it calls the regular buffer load directly on gfx12 instead?

Tried to fix this with the following. PTAL:
a13ca21

0xDELUXA · 2026-05-05T19:13:56Z

Wouldn’t the proper fix be to update the kernel/code path that currently emits async buffer load, so that it calls the regular buffer load directly on gfx12 instead?

Tried to fix this with the following. PTAL: a13ca21

I might be mistaken, but does this mean it’s safe to fall back to gfx103 and gfx11, and only gfx12 needs this change?

jammm · 2026-05-05T19:25:22Z

Neither supports async loads.

…

On Wednesday, May 6, 2026, DELUXA ***@***.***> wrote: *0xDELUXA* left a comment (ROCm/rocm-libraries#7016) <#7016 (comment)> Wouldn’t the proper fix be to update the kernel/code path that currently emits async buffer load, so that it calls the regular buffer load directly on gfx12 instead? Tried to fix this with the following. PTAL: a13ca21 <a13ca21> I might be mistaken, but does this mean that gfx103 and gfx11 support async buffer load, while gfx12 doesn’t? — Reply to this email directly, view it on GitHub <#7016 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AATCSOFZRZRZOCCX6S2ZLU34ZI4QVAVCNFSM6AAAAACYODWOT2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DGOBSGI2TANRXHE> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

gfx12 falls back from async global-to-LDS loads to sync VGPR loads plus LDS stores. The async raw API relies on buffer OOB behavior instead of tensor-coordinate validity, so keep the sync fallback aligned with that raw-load contract.

jammm · 2026-05-06T10:19:43Z

This PR now has CK fixes relevant to xformers for both RDNA3/4.

poyenc

The gfx11 P-tile remap (PermuteWarpGemmCToA) and GetSmemKPackV fix are correct — WMMA lane layout differences between GEMM C and A tiles require this permutation for the P×V matmul, and the K-pack needs to match kKPerThread for WMMA. No concerns with those changes.

However, I have concerns about the gfx12/gfx103 synchronous fallbacks added to the core tile infrastructure (load_tile.hpp, tile_window.hpp, tile_window_linear.hpp, amd_buffer_addressing.hpp).

These fallbacks are dead code for all currently-dispatched paths. gfx12 FMHA only dispatches "qr" / "qr_hpad" (fully synchronous, never calls async_load_tile*). gfx11 FMHA similarly dispatches "qr" only. No async pipeline is dispatched on either architecture today.

The deeper problem is that these fallbacks sit in core tile infrastructure shared by GEMM, FlatMM, Fused MoE, Sparse Attention, and FMHA. Any future code that accidentally instantiates an async pipeline on gfx12 will silently compile and run correctly — but with all load/compute overlap removed. Without the fallbacks, the same mistake would produce a compile-time static_assert or runtime illegal instruction — immediate, obvious failure. Silent performance degradation is much harder to catch than a crash or compile error.

Also, __gfx12__ covers gfx1250 (MI450), which has dedicated TENSOR_LOAD_TO_LDS hardware. A blanket gfx12 fallback would prevent future async pipeline work on MI450 from using the correct instruction, forcing everything through the synchronous path instead.

Suggestion: Remove the gfx12 fallbacks from core infrastructure and let unsupported paths fail loudly at compile time. If a specific pipeline needs gfx12 support, the fallback should live in that pipeline — not in the shared tile load layer where it silently affects everything.

Drop the shared RDNA/gfx12 synchronous fallbacks from the core tile-load path so unsupported async pipelines continue to fail loudly instead of silently losing overlap. Keep the gfx11 FMHA-specific WMMA layout and K-pack fixes in the pipeline layer.

jammm · 2026-05-11T06:34:42Z

@poyenc removed the fallbacks. PTAL ^^

hyoon1 · 2026-05-12T15:29:04Z

Async pipelines and the TRLoad/QS pipelines are neither used nor validated on gfx11 or gfx12, so we should avoid going down those paths. If execution somehow reaches them, there’s a risk that things could fail silently without any obvious warning or error. I’m not sure whether whole_k_prefetch has been properly validated either, but it seems like we should keep the code changes minimal and only touch the parts that are truly necessary.

jammm · 2026-05-12T16:10:25Z

Async pipelines and the TRLoad/QS pipelines are neither used nor validated on gfx11 or gfx12, so we should avoid going down those paths. If execution somehow reaches them, there’s a risk that things could fail silently without any obvious warning or error. I’m not sure whether whole_k_prefetch has been properly validated either, but it seems like we should keep the code changes minimal and only touch the parts that are truly necessary.

I've only touched the parts necessary to get xformers running. Tests are passing there. Is there anything you need on this PR to be done before we can merge merge?

hyoon1 · 2026-05-12T17:38:36Z

Async pipelines and the TRLoad/QS pipelines are neither used nor validated on gfx11 or gfx12, so we should avoid going down those paths. If execution somehow reaches them, there’s a risk that things could fail silently without any obvious warning or error. I’m not sure whether whole_k_prefetch has been properly validated either, but it seems like we should keep the code changes minimal and only touch the parts that are truly necessary.

I've only touched the parts necessary to get xformers running. Tests are passing there. Is there anything you need on this PR to be done before we can merge merge?

The async, trload, qs pipeline files currently contain gfx11-related code, but these pipelines are not used on RDNA. If this unverified path is ever taken, it could lead to potential issues. I'm also not sure whether the performance of these changes has actually been validated, or whether the modified code paths are truly covered by existing tests.

At a minimum, we should make sure RDNA cannot enter these pipelines. To prevent this more reliably, I think it would be better to remove the RDNA-related code from these pipeline files altogether.

I’m not very familiar with xFormers, but if we want to minimize the scope, it seems that only the patches related to qr_ks_vs_whole_k_prefetch should be necessary.

jammm · 2026-05-12T18:35:03Z

@hyoon1 makes sense. Neither gfx11 nor gfx12 supports async load/store, so the app calling CK (xformers in this case) shouldn't be going through the async pipeline. As for gfx12, it does have support for some transpose load instructions, but that's a separate topic.

I've pushed a commit that removes the gfx11 changes in the async/tr pipelines. The xformers PR has been modified to not use async pipelines for gfx11/12 ROCm/xformers@34064c9

jammm · 2026-05-13T08:26:25Z

@poyenc thanks! I enabled auto-merge but it's waiting on "Math CI Summary" which doesn't seem to have triggered yet for many hours.

poyenc · 2026-05-14T22:04:28Z

@jammm The Math CI failure on build #2 is unrelated to your changes — AITER's mha_bwd.cu is referencing fmha_bwd_launcher members (workspace_size, prepare_workspace) and initializer fields that were changed upstream. Build #3 is currently running; if it hits the same AITER stage failure, we can skip that test and re-trigger.

jammm · 2026-05-15T12:05:57Z

Math CI is still stuck, and AITER test was still failing. Can we skip those? @poyenc

poyenc · 2026-05-15T22:46:15Z

@jammm I have already turn AITER tests off.. now it fails on the Build CK and run Tests on gfx942 stage. And the error log indicates that CI failed to run the cat command

cat: /sys/module/amdgpu/version: No such file or directory

poyenc · 2026-05-18T04:05:35Z

The 24 run_sink_init_tests failures on gfx1201 (Build #6) are addressed by #7530.

Root cause: When -init_sink=1 -mask=1 is passed, traits.has_sink doesn't check init_sink_value, so the *_nsink (no-sink) kernel is dispatched. The reference expects sink-initialized output (zeros for masked positions), but the GPU produces standard attention output — resulting in 100% wrong values across all head dims (d=64/128/256), precisions (fp16/bf16), modes (batch/group), and layouts (bshd/bhsd).

Fix in #7530:

Includes init_sink_value != 0 in the has_sink trait check so the sink-enabled kernel is dispatched correctly.
Gates run_sink_init_tests behind an opt-in -g flag in smoke_test_fwd.sh, since sink=true kernels are excluded from CI builds by the *_nsink* CMake filter.

jammm · 2026-05-18T07:10:20Z

@poyenc thanks for the heads up! Given those failing tests are fixed in #7530. Can we skip the check and merge this PR?

[CK] Fix RDNA3 FMHA tile-load paths ## Summary Fix CK tile FMHA paths needed for RDNA3/RDNA4 targets. ## Details This PR addresses RDNA-specific issues hit while enabling xFormers CK FMHA on gfx11/gfx12: - On RDNA3, update FMHA P tile handling so the layout consumed by the second GEMM matches the WMMA path. ## Testing Validated downstream with xFormers CK/FMHA on gfx1201/gfx1151. ```text pytest --import-mode=importlib -q \ tests/test_mem_eff_attention.py::test_forward \ tests/test_mem_eff_attention.py::test_backward \ tests/test_mem_eff_attention.py::test_dropout_ck 3844 passed, 5244 skipped, 26 warnings

jammm requested a review from a team as a code owner May 2, 2026 07:21

github-actions Bot added the project: composablekernel label May 2, 2026

jammm mentioned this pull request May 2, 2026

Initial RDNA Windows bring-up for CK FMHA ROCm/xformers#86

Closed

assistant-librarian Bot added the organization: ROCm label May 2, 2026

jakpiase approved these changes May 4, 2026

View reviewed changes

poyenc reviewed May 5, 2026

View reviewed changes

jammm added 2 commits May 5, 2026 14:19

Fix gfx12 async buffer load fallback

bbaeb15

Merge branch 'develop' of github.com:ROCm/rocm-libraries into users/j…

9e55b0f

…am/gfx12-buffer-load-fallback

jammm force-pushed the users/jam/gfx12-buffer-load-fallback branch from acb80cd to 9e55b0f Compare May 5, 2026 06:38

jammm requested a review from poyenc May 5, 2026 06:41

Fix gfx11 FMHA P tile layout remaps

e5be663

Route gfx12 async tile loads through sync path

a13ca21

Preserve raw async tile-load semantics on gfx12

09ad9fa

gfx12 falls back from async global-to-LDS loads to sync VGPR loads plus LDS stores. The async raw API relies on buffer OOB behavior instead of tensor-coordinate validity, so keep the sync fallback aligned with that raw-load contract.

jammm changed the title ~~Fix gfx12 async buffer load fallback~~ [CK] Fix RDNA3/RDNA4 FMHA tile-load paths May 6, 2026

[CK] Fix gfx12 async tile-load fallback warnings

d17ec03

jammm mentioned this pull request May 9, 2026

(Continuted from PR #86) Initial RDNA Windows bring-up for CK FMHA# ROCm/xformers#87

Open

poyenc reviewed May 11, 2026

View reviewed changes

poyenc requested a review from a team May 12, 2026 03:08

[CK] Keep RDNA remap out of async FMHA paths

25012a7

jammm force-pushed the users/jam/gfx12-buffer-load-fallback branch from 1a7b409 to 25012a7 Compare May 12, 2026 18:37

poyenc approved these changes May 13, 2026 •

edited

Loading

View reviewed changes

jammm enabled auto-merge (squash) May 13, 2026 08:25

shumway disabled auto-merge May 14, 2026 20:43

Merge branch 'develop' into users/jam/gfx12-buffer-load-fallback

b19eb56

Merge branch 'develop' into users/jam/gfx12-buffer-load-fallback

f04e732

jammm changed the title ~~[CK] Fix RDNA3/RDNA4 FMHA tile-load paths~~ [CK] Fix RDNA3 FMHA tile-load paths May 15, 2026

jammm enabled auto-merge (squash) May 15, 2026 06:18

Merge branch 'develop' into users/jam/gfx12-buffer-load-fallback

981893b

Merge branch 'develop' into users/jam/gfx12-buffer-load-fallback

fb32b4b

jammm disabled auto-merge May 18, 2026 10:13

illsilin merged commit 2b73c00 into develop May 19, 2026
31 checks passed

illsilin deleted the users/jam/gfx12-buffer-load-fallback branch May 19, 2026 13:41

Conversation

jammm commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

Testing

Uh oh!

0xDELUXA commented May 2, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hyoon1 commented May 5, 2026

Uh oh!

jammm commented May 5, 2026

Uh oh!

0xDELUXA commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jammm commented May 5, 2026 via email

Uh oh!

jammm commented May 6, 2026

Uh oh!

poyenc left a comment

Choose a reason for hiding this comment

Uh oh!

jammm commented May 11, 2026

Uh oh!

hyoon1 commented May 12, 2026

Uh oh!

jammm commented May 12, 2026

Uh oh!

hyoon1 commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jammm commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

jammm commented May 13, 2026

Uh oh!

poyenc commented May 14, 2026

Uh oh!

jammm commented May 15, 2026

Uh oh!

poyenc commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

poyenc commented May 18, 2026

Uh oh!

jammm commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jammm commented May 2, 2026 •

edited

Loading

0xDELUXA commented May 5, 2026 •

edited

Loading

hyoon1 commented May 12, 2026 •

edited

Loading

jammm commented May 12, 2026 •

edited

Loading

poyenc commented May 15, 2026 •

edited

Loading