Skip to content

[Fwd,SM100,CuTe] Fix split KV OOM with diff headdim + fix SM100 kwarg mismatch#2338

Merged
tridao merged 1 commit intoDao-AILab:mainfrom
MatthewBonanni:fix_splitkv_oom
Mar 12, 2026
Merged

[Fwd,SM100,CuTe] Fix split KV OOM with diff headdim + fix SM100 kwarg mismatch#2338
tridao merged 1 commit intoDao-AILab:mainfrom
MatthewBonanni:fix_splitkv_oom

Conversation

@MatthewBonanni
Copy link
Copy Markdown
Contributor

@MatthewBonanni MatthewBonanni commented Mar 12, 2026

Using split KV with diff headdim (e.g. 192/128 for DeepSeek MLA prefill) exceeds SMEM due to float32 partial output doubling the O buffer size. This PR reduces tile size or disables splitting in that case, with a heuristic to ensure optimal performance.

Also fixes a bug introduced in 99d0148 where the SM100 constructor is called with tile_m/tile_n instead of m_block_size/n_block_size, which would cause a TypeError.

This change has been applied to vLLM's FA fork via vllm-project#123 and vllm-project#126 to enable using FA4 for MLA prefill in vLLM (vllm-project/vllm#34732). This PR applies the fix upstream.

Benchmarking performed using vLLM attention benchmark tool:
559494626-b96cdd34-9c9a-4a27-95ee-73f437e5d80d

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
@MatthewBonanni MatthewBonanni changed the title Fix split KV OOM with diff headdim + fix SM100 kwarg mismatch [Fwd,sm100] Fix split KV OOM with diff headdim + fix SM100 kwarg mismatch Mar 12, 2026
@MatthewBonanni MatthewBonanni changed the title [Fwd,sm100] Fix split KV OOM with diff headdim + fix SM100 kwarg mismatch [Fwd,SM100,CuTe] Fix split KV OOM with diff headdim + fix SM100 kwarg mismatch Mar 12, 2026
@tridao
Copy link
Copy Markdown
Member

tridao commented Mar 12, 2026

I feel the right approach is to subtile the O inside the kernel so we can keep the same size of smem. But that's more annoying to impl so for now we can just decrease tile_n.

@tridao tridao merged commit bbe25ba into Dao-AILab:main Mar 12, 2026
@MatthewBonanni MatthewBonanni deleted the fix_splitkv_oom branch March 12, 2026 17:59
5t4r1i9ht pushed a commit to 5t4r1i9ht/flash-attention that referenced this pull request Mar 15, 2026
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
zhuochenKIDD pushed a commit to zhuochenKIDD/flash-attention that referenced this pull request Mar 25, 2026
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
NJX-njx pushed a commit to NJX-njx/flash-attention that referenced this pull request Mar 28, 2026
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants