[CK] Fix RDNA3 FMHA tile-load paths#7016
Conversation
|
Thanks @jammm for your help and guidance in enabling xformers CK on RDNA/Windows! |
acb80cd to
9e55b0f
Compare
|
I’m wondering if this fallback is actually safe. My understanding is that on gfx12, we should not be calling async buffer load from the upper layer in the first place. If async buffer load is not supported or not intended to be used on gfx12, then silently falling back here may hide an incorrect/suboptimal code path. Wouldn’t the proper fix be to update the kernel/code path that currently emits async buffer load, so that it calls the regular buffer load directly on gfx12 instead? |
Tried to fix this with the following. PTAL: |
I might be mistaken, but does this mean it’s safe to fall back to gfx103 and gfx11, and only gfx12 needs this change? |
|
Neither supports async loads.
…On Wednesday, May 6, 2026, DELUXA ***@***.***> wrote:
*0xDELUXA* left a comment (ROCm/rocm-libraries#7016)
<#7016 (comment)>
Wouldn’t the proper fix be to update the kernel/code path that currently
emits async buffer load, so that it calls the regular buffer load directly
on gfx12 instead?
Tried to fix this with the following. PTAL: a13ca21
<a13ca21>
I might be mistaken, but does this mean that gfx103 and gfx11 support
async buffer load, while gfx12 doesn’t?
—
Reply to this email directly, view it on GitHub
<#7016 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AATCSOFZRZRZOCCX6S2ZLU34ZI4QVAVCNFSM6AAAAACYODWOT2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DGOBSGI2TANRXHE>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
gfx12 falls back from async global-to-LDS loads to sync VGPR loads plus LDS stores. The async raw API relies on buffer OOB behavior instead of tensor-coordinate validity, so keep the sync fallback aligned with that raw-load contract.
|
This PR now has CK fixes relevant to xformers for both RDNA3/4. |
poyenc
left a comment
There was a problem hiding this comment.
The gfx11 P-tile remap (PermuteWarpGemmCToA) and GetSmemKPackV fix are correct — WMMA lane layout differences between GEMM C and A tiles require this permutation for the P×V matmul, and the K-pack needs to match kKPerThread for WMMA. No concerns with those changes.
However, I have concerns about the gfx12/gfx103 synchronous fallbacks added to the core tile infrastructure (load_tile.hpp, tile_window.hpp, tile_window_linear.hpp, amd_buffer_addressing.hpp).
These fallbacks are dead code for all currently-dispatched paths. gfx12 FMHA only dispatches "qr" / "qr_hpad" (fully synchronous, never calls async_load_tile*). gfx11 FMHA similarly dispatches "qr" only. No async pipeline is dispatched on either architecture today.
The deeper problem is that these fallbacks sit in core tile infrastructure shared by GEMM, FlatMM, Fused MoE, Sparse Attention, and FMHA. Any future code that accidentally instantiates an async pipeline on gfx12 will silently compile and run correctly — but with all load/compute overlap removed. Without the fallbacks, the same mistake would produce a compile-time static_assert or runtime illegal instruction — immediate, obvious failure. Silent performance degradation is much harder to catch than a crash or compile error.
Also, __gfx12__ covers gfx1250 (MI450), which has dedicated TENSOR_LOAD_TO_LDS hardware. A blanket gfx12 fallback would prevent future async pipeline work on MI450 from using the correct instruction, forcing everything through the synchronous path instead.
Suggestion: Remove the gfx12 fallbacks from core infrastructure and let unsupported paths fail loudly at compile time. If a specific pipeline needs gfx12 support, the fallback should live in that pipeline — not in the shared tile load layer where it silently affects everything.
Drop the shared RDNA/gfx12 synchronous fallbacks from the core tile-load path so unsupported async pipelines continue to fail loudly instead of silently losing overlap. Keep the gfx11 FMHA-specific WMMA layout and K-pack fixes in the pipeline layer.
|
@poyenc removed the fallbacks. PTAL ^^ |
|
Async pipelines and the TRLoad/QS pipelines are neither used nor validated on gfx11 or gfx12, so we should avoid going down those paths. If execution somehow reaches them, there’s a risk that things could fail silently without any obvious warning or error. I’m not sure whether |
I've only touched the parts necessary to get xformers running. Tests are passing there. Is there anything you need on this PR to be done before we can merge merge? |
The async, trload, qs pipeline files currently contain gfx11-related code, but these pipelines are not used on RDNA. If this unverified path is ever taken, it could lead to potential issues. I'm also not sure whether the performance of these changes has actually been validated, or whether the modified code paths are truly covered by existing tests. At a minimum, we should make sure RDNA cannot enter these pipelines. To prevent this more reliably, I think it would be better to remove the RDNA-related code from these pipeline files altogether. I’m not very familiar with xFormers, but if we want to minimize the scope, it seems that only the patches related to |
|
@hyoon1 makes sense. Neither gfx11 nor gfx12 supports async load/store, so the app calling CK (xformers in this case) shouldn't be going through the async pipeline. As for gfx12, it does have support for some transpose load instructions, but that's a separate topic. I've pushed a commit that removes the gfx11 changes in the async/tr pipelines. The xformers PR has been modified to not use async pipelines for gfx11/12 ROCm/xformers@34064c9 |
1a7b409 to
25012a7
Compare
|
@poyenc thanks! I enabled auto-merge but it's waiting on "Math CI Summary" which doesn't seem to have triggered yet for many hours. |
|
@jammm The Math CI failure on build #2 is unrelated to your changes — AITER's |
|
Math CI is still stuck, and AITER test was still failing. Can we skip those? @poyenc |
|
@jammm I have already turn AITER tests off.. now it fails on the Build CK and run Tests on gfx942 stage. And the error log indicates that CI failed to run the cat command
|
|
The 24 Root cause: When Fix in #7530:
|
[CK] Fix RDNA3 FMHA tile-load paths ## Summary Fix CK tile FMHA paths needed for RDNA3/RDNA4 targets. ## Details This PR addresses RDNA-specific issues hit while enabling xFormers CK FMHA on gfx11/gfx12: - On RDNA3, update FMHA P tile handling so the layout consumed by the second GEMM matches the WMMA path. ## Testing Validated downstream with xFormers CK/FMHA on gfx1201/gfx1151. ```text pytest --import-mode=importlib -q \ tests/test_mem_eff_attention.py::test_forward \ tests/test_mem_eff_attention.py::test_backward \ tests/test_mem_eff_attention.py::test_dropout_ck 3844 passed, 5244 skipped, 26 warnings
Summary
Fix CK tile FMHA paths needed for RDNA3/RDNA4 targets.
Details
This PR addresses RDNA-specific issues hit while enabling xFormers CK FMHA on gfx11/gfx12:
Testing
Validated downstream with xFormers CK/FMHA on gfx1201/gfx1151.