fix: use LSE accum strides from params instead of hardcoded ones by ZeronSix · Pull Request #2388 · Dao-AILab/flash-attention

ZeronSix · 2026-03-24T15:57:03Z

In the Split-KV path, the forward kernel computes LSE accumulator addresses using hardcoded strides instead of the stride values provided in the params structure. The combine kernel already uses the explicit strides from params, so this creates an inconsistency between the two kernels.

As a result, when the caller supplies an LSE accumulator layout that differs from the layout assumed by the forward kernel, the forward pass writes to incorrect locations and produces wrong output.

This change updates the forward kernel to use the LSE accumulator strides from params, matching the behavior of the combine kernel and ensuring correct results for arbitrary accumulator layouts.

tridao · 2026-03-25T10:40:03Z

Thanks!

fix: use LSE accum strides from params instead of hardcoded ones

e0a72b7

tridao approved these changes Mar 25, 2026

View reviewed changes

tridao merged commit 28ef22c into Dao-AILab:main Mar 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use LSE accum strides from params instead of hardcoded ones#2388

fix: use LSE accum strides from params instead of hardcoded ones#2388
tridao merged 1 commit intoDao-AILab:mainfrom
ZeronSix:fix-lse-strides

ZeronSix commented Mar 24, 2026

Uh oh!

tridao commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ZeronSix commented Mar 24, 2026

Uh oh!

tridao commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants