Skip to content

Fix missed attention_chunk_divmod param for block specifics in mma_pv.#1582

Merged
tridao merged 1 commit intoDao-AILab:mainfrom
wanderingai:patch-1
Apr 10, 2025
Merged

Fix missed attention_chunk_divmod param for block specifics in mma_pv.#1582
tridao merged 1 commit intoDao-AILab:mainfrom
wanderingai:patch-1

Conversation

@wanderingai
Copy link
Copy Markdown
Contributor

No description provided.

@tridao tridao merged commit 7ff73af into Dao-AILab:main Apr 10, 2025
@wanderingai wanderingai deleted the patch-1 branch April 11, 2025 11:20
shcho1118 pushed a commit to shcho1118/flash-attention that referenced this pull request Apr 22, 2025
Don't use FusedDense anymore to simplify code

Fix FA3 qkvpacked interface

Launch more thread blocks in layer_norm_bwd

check valid tile before storing num_splits in split_idx (Dao-AILab#1578)

Tune rotary kernel to use 2 warps if rotary_dim <= 64

Implement attention_chunk

Fix missed attention chunk size param for block specifics in `mma_pv`. (Dao-AILab#1582)

[AMD ROCm] Support MI350 (Dao-AILab#1586)

* enable gfx950 support

* update ck for gfx950

---------

Co-authored-by: illsilin <Illia.Silin@amd.com>

Make attention_chunk work for non-causal cases

Use tile size 128 x 96 for hdim 64,256

Fix kvcache tests for attention_chunk when precomputing metadata

Fix kvcache test with precomputed metadata: pass in max_seqlen_q

Pass 0 as attention_chunk in the bwd for now

[LayerNorm] Implement option for zero-centered weight

Make hopper build more robust (Dao-AILab#1598)

In certain environments the relative path to the vendored nvcc is not picked up correctly if provided relative. In this PR, I just make it absolute.

Fix L2 swizzle in causal tile scheduler

Use LPT scheduler for causal backward pass
shcho1118 pushed a commit to shcho1118/flash-attention that referenced this pull request Apr 22, 2025
Don't use FusedDense anymore to simplify code

Fix FA3 qkvpacked interface

Launch more thread blocks in layer_norm_bwd

check valid tile before storing num_splits in split_idx (Dao-AILab#1578)

Tune rotary kernel to use 2 warps if rotary_dim <= 64

Implement attention_chunk

Fix missed attention chunk size param for block specifics in `mma_pv`. (Dao-AILab#1582)

[AMD ROCm] Support MI350 (Dao-AILab#1586)

* enable gfx950 support

* update ck for gfx950

---------

Co-authored-by: illsilin <Illia.Silin@amd.com>

Make attention_chunk work for non-causal cases

Use tile size 128 x 96 for hdim 64,256

Fix kvcache tests for attention_chunk when precomputing metadata

Fix kvcache test with precomputed metadata: pass in max_seqlen_q

Pass 0 as attention_chunk in the bwd for now

[LayerNorm] Implement option for zero-centered weight

Make hopper build more robust (Dao-AILab#1598)

In certain environments the relative path to the vendored nvcc is not picked up correctly if provided relative. In this PR, I just make it absolute.

Fix L2 swizzle in causal tile scheduler

Use LPT scheduler for causal backward pass
playerzer0x pushed a commit to Liqhtworks/flash-attention that referenced this pull request Jul 24, 2025
elewarr pushed a commit to elewarr/flash-attention that referenced this pull request Feb 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants