[Fwd,SM100,CuTe] Fix split KV OOM with diff headdim + fix SM100 kwarg mismatch by MatthewBonanni · Pull Request #2338 · Dao-AILab/flash-attention

MatthewBonanni · 2026-03-12T17:46:11Z

Using split KV with diff headdim (e.g. 192/128 for DeepSeek MLA prefill) exceeds SMEM due to float32 partial output doubling the O buffer size. This PR reduces tile size or disables splitting in that case, with a heuristic to ensure optimal performance.

Also fixes a bug introduced in 99d0148 where the SM100 constructor is called with tile_m/tile_n instead of m_block_size/n_block_size, which would cause a TypeError.

This change has been applied to vLLM's FA fork via vllm-project#123 and vllm-project#126 to enable using FA4 for MLA prefill in vLLM (vllm-project/vllm#34732). This PR applies the fix upstream.

Benchmarking performed using vLLM attention benchmark tool:

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

tridao · 2026-03-12T17:57:09Z

I feel the right approach is to subtile the O inside the kernel so we can keep the same size of smem. But that's more annoying to impl so for now we can just decrease tile_n.

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Fix

45c11c5

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

MatthewBonanni changed the title ~~Fix split KV OOM with diff headdim + fix SM100 kwarg mismatch~~ [Fwd,sm100] Fix split KV OOM with diff headdim + fix SM100 kwarg mismatch Mar 12, 2026

MatthewBonanni changed the title ~~[Fwd,sm100] Fix split KV OOM with diff headdim + fix SM100 kwarg mismatch~~ [Fwd,SM100,CuTe] Fix split KV OOM with diff headdim + fix SM100 kwarg mismatch Mar 12, 2026

tridao approved these changes Mar 12, 2026

View reviewed changes

tridao merged commit bbe25ba into Dao-AILab:main Mar 12, 2026

MatthewBonanni deleted the fix_splitkv_oom branch March 12, 2026 17:59

5t4r1i9ht pushed a commit to 5t4r1i9ht/flash-attention that referenced this pull request Mar 15, 2026

Fix (Dao-AILab#2338)

b4577e8

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

zhuochenKIDD pushed a commit to zhuochenKIDD/flash-attention that referenced this pull request Mar 25, 2026

Fix (Dao-AILab#2338)

f73cb95

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

NJX-njx pushed a commit to NJX-njx/flash-attention that referenced this pull request Mar 28, 2026

Fix (Dao-AILab#2338)

5a34356

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fwd,SM100,CuTe] Fix split KV OOM with diff headdim + fix SM100 kwarg mismatch#2338

[Fwd,SM100,CuTe] Fix split KV OOM with diff headdim + fix SM100 kwarg mismatch#2338
tridao merged 1 commit intoDao-AILab:mainfrom
MatthewBonanni:fix_splitkv_oom

MatthewBonanni commented Mar 12, 2026 •

edited

Loading

Uh oh!

tridao commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MatthewBonanni commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tridao commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MatthewBonanni commented Mar 12, 2026 •

edited

Loading