Skip to content

Make hopper build more robust#1598

Merged
tridao merged 1 commit intoDao-AILab:mainfrom
classner:patch-1
Apr 17, 2025
Merged

Make hopper build more robust#1598
tridao merged 1 commit intoDao-AILab:mainfrom
classner:patch-1

Conversation

@classner
Copy link
Copy Markdown
Contributor

In certain environments the relative path to the vendored nvcc is not picked up correctly if provided relative. In this PR, I just make it absolute.

In certain environments the relative path to the vendored nvcc is not picked up correctly if provided relative. In this PR, I just make it absolute.
@tridao tridao merged commit 934f6ad into Dao-AILab:main Apr 17, 2025
shcho1118 pushed a commit to shcho1118/flash-attention that referenced this pull request Apr 22, 2025
Don't use FusedDense anymore to simplify code

Fix FA3 qkvpacked interface

Launch more thread blocks in layer_norm_bwd

check valid tile before storing num_splits in split_idx (Dao-AILab#1578)

Tune rotary kernel to use 2 warps if rotary_dim <= 64

Implement attention_chunk

Fix missed attention chunk size param for block specifics in `mma_pv`. (Dao-AILab#1582)

[AMD ROCm] Support MI350 (Dao-AILab#1586)

* enable gfx950 support

* update ck for gfx950

---------

Co-authored-by: illsilin <Illia.Silin@amd.com>

Make attention_chunk work for non-causal cases

Use tile size 128 x 96 for hdim 64,256

Fix kvcache tests for attention_chunk when precomputing metadata

Fix kvcache test with precomputed metadata: pass in max_seqlen_q

Pass 0 as attention_chunk in the bwd for now

[LayerNorm] Implement option for zero-centered weight

Make hopper build more robust (Dao-AILab#1598)

In certain environments the relative path to the vendored nvcc is not picked up correctly if provided relative. In this PR, I just make it absolute.

Fix L2 swizzle in causal tile scheduler

Use LPT scheduler for causal backward pass
shcho1118 pushed a commit to shcho1118/flash-attention that referenced this pull request Apr 22, 2025
Don't use FusedDense anymore to simplify code

Fix FA3 qkvpacked interface

Launch more thread blocks in layer_norm_bwd

check valid tile before storing num_splits in split_idx (Dao-AILab#1578)

Tune rotary kernel to use 2 warps if rotary_dim <= 64

Implement attention_chunk

Fix missed attention chunk size param for block specifics in `mma_pv`. (Dao-AILab#1582)

[AMD ROCm] Support MI350 (Dao-AILab#1586)

* enable gfx950 support

* update ck for gfx950

---------

Co-authored-by: illsilin <Illia.Silin@amd.com>

Make attention_chunk work for non-causal cases

Use tile size 128 x 96 for hdim 64,256

Fix kvcache tests for attention_chunk when precomputing metadata

Fix kvcache test with precomputed metadata: pass in max_seqlen_q

Pass 0 as attention_chunk in the bwd for now

[LayerNorm] Implement option for zero-centered weight

Make hopper build more robust (Dao-AILab#1598)

In certain environments the relative path to the vendored nvcc is not picked up correctly if provided relative. In this PR, I just make it absolute.

Fix L2 swizzle in causal tile scheduler

Use LPT scheduler for causal backward pass
playerzer0x pushed a commit to Liqhtworks/flash-attention that referenced this pull request Jul 24, 2025
In certain environments the relative path to the vendored nvcc is not picked up correctly if provided relative. In this PR, I just make it absolute.
elewarr pushed a commit to elewarr/flash-attention that referenced this pull request Feb 4, 2026
In certain environments the relative path to the vendored nvcc is not picked up correctly if provided relative. In this PR, I just make it absolute.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants