Re-implement FlashAttention with new Xe atoms #547

petercad · 2025-10-04T00:55:15Z

This PR updates FlashAttention to the new copy/MMA atoms.

Changes:

Prefill and decode unified into a single implementation, allowing simultaneous K and Q subgroup-level parallelization rather than an either-or.
GEMMs and softmax grouped together and the full k loop consolidated into an FMHA mainloop class.
- This will facilitate further manual pipelining/overlap of GEMM with softmax.
Use new copy/MMA atoms and reorders to transparently support arbitrary data types.
Automatic copy/MMA operator selection.

Current status: prefill/decode examples working, similar/better performance to old examples.

Known issues:

Head size 192 decode config doesn't compile yet. Edit: fixed.
Strange SYCL compiler behavior/bug with tSrS->tArP reorder. Apparently the compiler believes there is UB somewhere and will omit a large section of the kernel as a result. For the moment, there's a direct copy as a workaround while I pin down the issue. I'm not able to reproduce this behavior with the reorder in isolation.

Additional features (causal masking, variable sequence lengths, etc.) to be added later.

Reminder: the new atoms require a very recent driver due to necessary IGC fixes/enhancements. Recommended version: ci-comp_igc-30613.

petercad · 2025-10-04T01:56:37Z

I will break up this large commit into self-contained smaller commits after review is complete.

rolandschulz · 2025-10-06T18:37:16Z

applications/flash_attention_v2/collective/copy_block_slm.hpp

why is this here? This isn't flash attention specific, is it?

No, it's not. These started as some simple helpers to make copying to/from SLM easier for the epilogue. We could move them, maybe to include/cute/algorithm/cute.hpp, though they should be made more sophisticated (use smaller/larger block sizes as appropriate, automatic fallback to scatter/gather, etc.).

include/cute/algorithm/subgroup_algorithms.hpp

applications/flash_attention_v2/collective/xe_fmha_fwd_mainloop.hpp

applications/flash_attention_v2/kernel/xe_fhma_fwd_kernel.hpp

ClarkChin08 · 2025-10-21T07:21:58Z

applications/flash_attention_v2/collective/xe_fmha_fwd_mainloop.hpp

+        FragSRow k_rem_mask;
+        int k = get<0>(tKgK(0,0,0,K,0)) + get_sub_group().get_local_id()[0];
+        for (int i = 0; i < k_rem_mask.size(); i++, k += intel::sg_size) {
+          k_rem_mask(i) = (k < shape<0>(K_2D)) ? ElementS(sycl::nan(0u)) : ElementS(-INFINITY);


If the original S already contains NaN , fmin(NaN, NaN) = NaN, will propagates the NaN to softmax. This can corrupt row-wise sum and max leading to NaN in the final output O, could we have better k_rem_mask here to avoid this case?

@ClarkChin08 Can you explain your concern a bit more? If original S has a NaN value in bounds, then that indicates either an overflow from very badly scaled data or an inf/NaN input, and there's no safe way to numerically recover from that (we can't easily guess what the right value should be in place of that NaN). If S has a NaN value out of bounds, then the fmin with -inf will produce -inf, so the NaN will be removed and not corrupt the softmax value.

I agree that if NaN appears in the valid range of S, it's likely a symptom of upstream issues like bad scaling or invalid inputs, and trying to "fix" it in the kernel can be tricky, especially in low-precision formats like fp8/fp4 where overflows are common.
Perhaps adding an optional debug mode to scan for NaNs/invalid inputs in S before softmax could help users identify issues early.

I see, yes, that could be helpful. Perhaps this could take the form of a general helper function that scans for NaNs in an arbitrary tensor and aborts if any are found.

applications/flash_attention_v2/collective/xe_fmha_fwd_mainloop.hpp

ClarkChin08 · 2025-10-23T02:13:28Z

The following command encounters an accuracy issue (Disposition: Failed) with seq_len_kv=256

output [991]: 2.696791 vs -nan

./examples/06_bmg_flash_attention/06_xe_fmha_fwd_decode_hdim128 --iterations=10 --batch=1 --num_heads_q=8 --seq_len_kv=256 --seq_len_qo=1 --num_heads_kv=8

However, when seq_len_kv is changed to 512 or higher, the example passes successfully.

petercad · 2025-10-23T03:29:06Z

The following command encounters an accuracy issue (Disposition: Failed) with seq_len_kv=256

@ClarkChin08 I pushed a patch to fix issues like this earlier today. I double-checked your test case, and it's passing on my system; can you double-check with the latest commit?

ClarkChin08 · 2025-10-23T04:03:01Z

The following command encounters an accuracy issue (Disposition: Failed) with seq_len_kv=256

@ClarkChin08 I pushed a patch to fix issues like this earlier today. I double-checked your test case, and it's passing on my system; can you double-check with the latest commit?

Yes, passed now.

applications/flash_attention_v2/collective/copy_block_slm.hpp

applications/flash_attention_v2/collective/xe_fmha_fwd_epilogue.hpp

applications/flash_attention_v2/collective/xe_fmha_fwd_mainloop.hpp

applications/flash_attention_v2/kernel/xe_fhma_fwd_kernel.hpp

petercad · 2025-10-27T15:13:00Z

Note: the CI is currently failing with compile-time divide-by-zero errors, but I can't reproduce the errors locally with any compiler/compile flags. If anyone can, let me know.

petercad · 2025-10-27T21:59:57Z

Note: the CI is currently failing with compile-time divide-by-zero errors, but I can't reproduce the errors locally with any compiler/compile flags. If anyone can, let me know.

Didn't realize CI was merging branches into main prior to testing. Thanks to @rolandschulz for helping figure this out.

Branch is rebased now and split into a logical set of patches.

rolandschulz · 2025-10-28T21:22:11Z

include/cute/layout.hpp

+  auto _0E0 = ScaledBasis<C<0>,0>{};
  auto flayout = filter(flatten(layout));
-  return inner_product_atuple_max(shape(flayout), stride(flayout));
+  auto coshape = inner_product_atuple_max(shape(flayout), stride(flayout)) + _0E0 + _0E0;


why do you add 0 twice?

It's a trick to ensure we get a tuple type for coshape. In case of a trivial layout (_1:_0), inner_product_atuple_max returns a number. Adding 0@0 (_0E0) makes it a ScaledBasis type, and then adding 0@0 again makes a tuple (0). In general, inner_product_atuple_max is already returning a tuple, and then adding 0@0 has no effect.

include/cute/tensor_sg.hpp

sunjiweiswift · 2025-10-29T07:15:14Z

applications/flash_attention_v2/kernel/xe_fhma_fwd_kernel.hpp

+
+      const int k_blocks = cute::ceil_div(s.seq_len_kv, get<1>(TileShapeQK{}));
+
+      auto shape_Q = make_shape(s.seq_len_qo, s.head_size_qk, s.num_heads_q,  s.batch);


The current layout is HND{batch, num_head, seq_len, head_size}. This format is not used in vllm and sglang; instead, NHD's(batch, seq_len, num_head, head_size) format is used.
I would like to provide an example of NHD support. Because NHD is the final layout used.

Only a minor tweak is needed for NHD support -- you would keep the shapes and kernel the same, and set the strides appropriately on this line. I added a comment there.

petercad changed the title ~~[Umbrella commit] Re-implement FlashAttention with new Xe atoms~~ Re-implement FlashAttention with new Xe atoms Oct 4, 2025

rolandschulz reviewed Oct 6, 2025

View reviewed changes

rolandschulz mentioned this pull request Oct 8, 2025

First version of SDPA Fwd - No need to review #548

Open

wuxun-zhang reviewed Oct 10, 2025

View reviewed changes

applications/flash_attention_v2/kernel/xe_fhma_fwd_kernel.hpp Outdated Show resolved Hide resolved

wuxun-zhang reviewed Oct 10, 2025

View reviewed changes

applications/flash_attention_v2/kernel/xe_fhma_fwd_kernel.hpp Outdated Show resolved Hide resolved

applications/flash_attention_v2/kernel/xe_fhma_fwd_kernel.hpp Outdated Show resolved Hide resolved

applications/flash_attention_v2/kernel/xe_fhma_fwd_kernel.hpp Outdated Show resolved Hide resolved

Antonyvance added enhancement New feature or request urgent PR requires a urgent attention (for release or blocking another PR) labels Oct 17, 2025

ClarkChin08 reviewed Oct 21, 2025

View reviewed changes

petercad force-pushed the petercad/rearch_sdpa branch from af2f402 to 326669e Compare October 23, 2025 03:54

sunjiweiswift reviewed Oct 24, 2025

View reviewed changes

tdeng5 requested review from ClarkChin08, jiyang1011, taozha2 and tdeng5 October 27, 2025 02:48

petercad added 6 commits October 27, 2025 14:41

[Xe] Refactor split barrier functionality

b693094

[CuTe] [Xe] Reorder fixes/extensions for f32 -> bf16

f34c74f

[CuTe] [Xe] Allow size-1 fragments in block 2D copies

96d2959

[CuTe] [Xe] Copy fixes

298a883

[CuTe] [Xe] make_block_2d_copy_{C,D} variants with subtiling

1b5e21b

[CuTe] Minor layout features/fixes

1fa2b61

petercad added 2 commits October 27, 2025 14:55

[CuTe] [Xe] New make_subgroup_tensor helpers

c475492

[CuTe] [Xe] Subgroup-scope broadcast/reduction

22993de

petercad force-pushed the petercad/rearch_sdpa branch from f767eb5 to 10b0c97 Compare October 27, 2025 21:56

petercad force-pushed the petercad/rearch_sdpa branch 2 times, most recently from b0e30f4 to 7dd479b Compare October 27, 2025 23:19

tdeng5 added the release label Oct 28, 2025

petercad added 2 commits October 28, 2025 08:37

[Platform] Add missing numeric_limits<float>::lowest()

cc6d10d

[Xe] Re-implement FlashAttention with new atoms

460d34a

petercad force-pushed the petercad/rearch_sdpa branch from 7dd479b to 460d34a Compare October 28, 2025 15:37

rolandschulz reviewed Oct 28, 2025

View reviewed changes

include/cute/tensor_sg.hpp Show resolved Hide resolved

sunjiweiswift reviewed Oct 29, 2025

View reviewed changes

[Xe] Additional comments

2bb6829

rolandschulz approved these changes Oct 29, 2025

View reviewed changes


		const int k_blocks = cute::ceil_div(s.seq_len_kv, get<1>(TileShapeQK{}));

		auto shape_Q = make_shape(s.seq_len_qo, s.head_size_qk, s.num_heads_q, s.batch);

Re-implement FlashAttention with new Xe atoms #547

Are you sure you want to change the base?

Re-implement FlashAttention with new Xe atoms #547

Uh oh!

Conversation

petercad commented Oct 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

petercad commented Oct 4, 2025

Uh oh!

rolandschulz Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

petercad Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ClarkChin08 Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

petercad Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ClarkChin08 Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

petercad Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ClarkChin08 commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

petercad commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ClarkChin08 commented Oct 23, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

petercad commented Oct 27, 2025

Uh oh!

petercad commented Oct 27, 2025

Uh oh!

rolandschulz Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

petercad Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sunjiweiswift Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

petercad Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

petercad commented Oct 4, 2025 •

edited

Loading

petercad Oct 21, 2025 •

edited

Loading

ClarkChin08 commented Oct 23, 2025 •

edited

Loading

petercad commented Oct 23, 2025 •

edited

Loading

petercad Oct 28, 2025 •

edited

Loading

sunjiweiswift Oct 29, 2025 •

edited

Loading

petercad Oct 29, 2025 •

edited

Loading