[AMD ROCm] Enable CK backend for ROCm gfx12 by hyoon1 · Pull Request #2054 · Dao-AILab/flash-attention

hyoon1 · 2025-12-08T17:45:23Z

This extends #2052 which updated to the latest Composable Kernel version. The latest CK now supports gfx12 architectures, but the CK kernel generator needs explicit target specification to generate kernels for these GPUs.

Added gfx1200 and gfx1201 to allowed architectures
Added GPU_ARCHS environment variable support for explicit CK target specification
Auto-detects GPU when GPU_ARCHS not set
Modified CK generator to pass --targets flag with specified architectures
Disable CK deterministic backward on gfx12 (force nondeterministic kernels and skip deterministic CK tests there) because the deterministic path is unstable on these GPUs (GPU hang occurs).

tridao · 2025-12-10T09:33:37Z

Cc @rocking5566

Logiquo · 2026-01-03T00:17:02Z

I think gfx11 and gfx12 are supported by CK at approximately the same time based on the commit history of CK repo? maybe also add gfx11 archs?

rocking5566 · 2026-01-20T03:31:06Z

csrc/flash_attn_ck/mha_bwd.cpp

    at::cuda::CUDAGuard device_guard{q.device()};

    auto opts = q.options();
+    // gfx12 deterministic bwd is unstable; always fall back to nondeterministic there.


I suggest to add TORCH_CHECK to warn the user rather than switching to non deterministic automatically

rocking5566 · 2026-01-20T03:31:46Z

csrc/flash_attn_ck/mha_varlen_bwd.cpp

    at::cuda::CUDAGuard device_guard{q.device()};

    auto opts = q.options();
+    // gfx12 deterministic bwd is unstable; always fall back to nondeterministic there.


I suggest to add TORCH_CHECK to warn the user rather than switching to non deterministic automatically

rocking5566 · 2026-01-20T03:33:36Z

flash_attn/flash_attn_interface.py

        return_softmax,
        is_grad_enabled,
    ):
+        deterministic = _disable_gfx12_deterministic(deterministic, qkv.device)


I thought not to change the parameter inside the API, just assert and warn the user

Updated. No longer mutate deterministic in the Python API. Instead, the C++ CK backward now uses TORCH_CHECK to assert and surface an error when deterministic=True on gfx12.

rocking5566 · 2026-01-20T03:37:29Z

tests/test_flash_attn_ck.py


    g = torch.randn_like(out)
-    if is_bwd_hdim_supported(d):
+    if is_bwd_hdim_supported(d) and not skip_deterministic_bwd(deterministic):


Use this function
def is_bwd_supported(d): return is_bwd_hdim_supported(d) and not skip_deterministic_bwd(deterministic)

rocking5566 · 2026-02-03T00:02:06Z

Could you revise the supported GPU for Composable Kernel Backend in the README?

rocking5566 · 2026-02-03T19:51:32Z

LGTM
@tridao could you help to merge?

rocking5566

LGTM

bluefalcon13 · 2026-02-06T08:00:28Z

I just wanted to say thank you @hyoon1 . I have been fighting with building a test bed for ROCm on my 9070XT desktop to figure out if I wanted to get a Strix Halo for dedicated AI stuff. After quite a few days fighting with ROCm dependency hell, I finally hit the mark with your pull request!

vLLM params

docker run -it --rm  \
   --device=/dev/kfd --device=/dev/dri \
    --shm-size=16gb \
    -e VLLM_USE_TRITON_FLASH_ATTN=True \
    -p 8000:8000     vllm-rocm \
    --model neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 \
    --gpu-memory-utilization 0.8 \
    --max-model-len 4096 \
    --trust-remote_code

vLLM bench params

vllm bench serve \
    --model neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 \
    --dataset-name random \
    --random-input-len 1024 \
    --random-output-len 512 \
    --request-rate 10 \
    --num-prompts 200

results:

Traffic request rate: 10.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [06:52<00:00,  2.06s/it]
tip: install termplotlib and gnuplot to plot the metrics
============ Serving Benchmark Result ============
Successful requests:                     200       
Failed requests:                         0         
Request rate configured (RPS):           10.00     
Benchmark duration (s):                  412.70    
Total input tokens:                      204600    
Total generated tokens:                  102400    
Request throughput (req/s):              0.48      
Output token throughput (tok/s):         248.12    
Peak output token throughput (tok/s):    325.00    
Peak concurrent requests:                194.00    
Total token throughput (tok/s):          743.88    
---------------Time to First Token----------------
Mean TTFT (ms):                          187872.60 
Median TTFT (ms):                        187549.63 
P99 TTFT (ms):                           372961.62 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          22.94     
Median TPOT (ms):                        20.84     
P99 TPOT (ms):                           31.85     
---------------Inter-token Latency----------------
Mean ITL (ms):                           22.94     
Median ITL (ms):                         20.25     
P99 ITL (ms):                            21.11     
==================================================

A few days ago I was capped at approximately 32 Tokens/sec. Not bad for 16gb card.

Build process:
https://github.com/bluefalcon13/vllm-rocm.git

liangshen68 · 2026-02-15T21:09:45Z

@tridao could you please help to merge this PR as many users are trying to use CK-based FA on gfx12? Thanks.

rocking5566 · 2026-03-26T20:46:05Z

Hi @hyoon1, we've opened #2400 which is a more complete version of this PR (includes gfx11 support, LLC head grouping, and improvements from code review). This PR can be closed in favor of #2400.

hyoon1 · 2026-03-26T20:47:50Z

Closing this PR in favor of #2400

Enable CK backend for ROCm gfx12

60c692a

Umio-Yasuno mentioned this pull request Jan 4, 2026

Do you support RDNA4 ROCm/flash-attention#161

Open

hyoon1 force-pushed the enable-ck-gfx12 branch from f971ee2 to 9791647 Compare January 15, 2026 23:24

gfx12: gate CK deterministic backward on ROCm

0290691

hyoon1 force-pushed the enable-ck-gfx12 branch from 9791647 to 0290691 Compare January 15, 2026 23:37

rocking5566 reviewed Jan 20, 2026

View reviewed changes

error out on gfx12 deterministic backward requests

6de8735

hyoon1 requested a review from rocking5566 January 20, 2026 18:54

jnolck mentioned this pull request Jan 31, 2026

[Question] Is GFX1201 support planned? ROCm/aiter#900

Open

update README

43475c6

hyoon1 added 3 commits February 4, 2026 13:59

Merge branch 'main' into enable-ck-gfx12

0e05d9e

fix typo

e232859

fix typo

c07513d

rocking5566 approved these changes Feb 4, 2026

View reviewed changes

rocking5566 mentioned this pull request Mar 26, 2026

[AMD ROCm] Update CK and add RDNA 3/4 support #2400

Merged

hyoon1 closed this Mar 26, 2026

Conversation

hyoon1 commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tridao commented Dec 10, 2025

Uh oh!

Logiquo commented Jan 3, 2026

Uh oh!

rocking5566 Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hyoon1 Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

rocking5566 Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

hyoon1 Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

rocking5566 Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

hyoon1 Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

rocking5566 Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hyoon1 Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

rocking5566 commented Feb 3, 2026

Uh oh!

rocking5566 commented Feb 3, 2026

Uh oh!

rocking5566 left a comment

Choose a reason for hiding this comment

Uh oh!

bluefalcon13 commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liangshen68 commented Feb 15, 2026

Uh oh!

rocking5566 commented Mar 26, 2026

Uh oh!

hyoon1 commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

hyoon1 commented Dec 8, 2025 •

edited

Loading

rocking5566 Jan 20, 2026 •

edited

Loading

rocking5566 Jan 20, 2026 •

edited

Loading

bluefalcon13 commented Feb 6, 2026 •

edited

Loading