Skip to content

[AMD ROCm] Support gfx950#1586

Merged
tridao merged 3 commits intoDao-AILab:mainfrom
ROCm:v2.7.3-cktile/gfx950
Apr 11, 2025
Merged

[AMD ROCm] Support gfx950#1586
tridao merged 3 commits intoDao-AILab:mainfrom
ROCm:v2.7.3-cktile/gfx950

Conversation

@rocking5566
Copy link
Copy Markdown
Contributor

@rocking5566 rocking5566 commented Apr 11, 2025

  1. I update the AMD backend (composable_kernel) to support gfx950
  2. Unlock gfx950 in setup.py

@tridao tridao merged commit c1352b6 into Dao-AILab:main Apr 11, 2025
@rocking5566 rocking5566 changed the title [AMD ROCm] Support MI350 [AMD ROCm] Support gfx950 Apr 12, 2025
shcho1118 pushed a commit to shcho1118/flash-attention that referenced this pull request Apr 22, 2025
Don't use FusedDense anymore to simplify code

Fix FA3 qkvpacked interface

Launch more thread blocks in layer_norm_bwd

check valid tile before storing num_splits in split_idx (Dao-AILab#1578)

Tune rotary kernel to use 2 warps if rotary_dim <= 64

Implement attention_chunk

Fix missed attention chunk size param for block specifics in `mma_pv`. (Dao-AILab#1582)

[AMD ROCm] Support MI350 (Dao-AILab#1586)

* enable gfx950 support

* update ck for gfx950

---------

Co-authored-by: illsilin <Illia.Silin@amd.com>

Make attention_chunk work for non-causal cases

Use tile size 128 x 96 for hdim 64,256

Fix kvcache tests for attention_chunk when precomputing metadata

Fix kvcache test with precomputed metadata: pass in max_seqlen_q

Pass 0 as attention_chunk in the bwd for now

[LayerNorm] Implement option for zero-centered weight

Make hopper build more robust (Dao-AILab#1598)

In certain environments the relative path to the vendored nvcc is not picked up correctly if provided relative. In this PR, I just make it absolute.

Fix L2 swizzle in causal tile scheduler

Use LPT scheduler for causal backward pass
shcho1118 pushed a commit to shcho1118/flash-attention that referenced this pull request Apr 22, 2025
Don't use FusedDense anymore to simplify code

Fix FA3 qkvpacked interface

Launch more thread blocks in layer_norm_bwd

check valid tile before storing num_splits in split_idx (Dao-AILab#1578)

Tune rotary kernel to use 2 warps if rotary_dim <= 64

Implement attention_chunk

Fix missed attention chunk size param for block specifics in `mma_pv`. (Dao-AILab#1582)

[AMD ROCm] Support MI350 (Dao-AILab#1586)

* enable gfx950 support

* update ck for gfx950

---------

Co-authored-by: illsilin <Illia.Silin@amd.com>

Make attention_chunk work for non-causal cases

Use tile size 128 x 96 for hdim 64,256

Fix kvcache tests for attention_chunk when precomputing metadata

Fix kvcache test with precomputed metadata: pass in max_seqlen_q

Pass 0 as attention_chunk in the bwd for now

[LayerNorm] Implement option for zero-centered weight

Make hopper build more robust (Dao-AILab#1598)

In certain environments the relative path to the vendored nvcc is not picked up correctly if provided relative. In this PR, I just make it absolute.

Fix L2 swizzle in causal tile scheduler

Use LPT scheduler for causal backward pass
playerzer0x pushed a commit to Liqhtworks/flash-attention that referenced this pull request Jul 24, 2025
* enable gfx950 support

* update ck for gfx950

---------

Co-authored-by: illsilin <Illia.Silin@amd.com>
elewarr pushed a commit to elewarr/flash-attention that referenced this pull request Feb 4, 2026
* enable gfx950 support

* update ck for gfx950

---------

Co-authored-by: illsilin <Illia.Silin@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants