Skip to content

fix: use LSE accum strides from params instead of hardcoded ones#2388

Merged
tridao merged 1 commit intoDao-AILab:mainfrom
ZeronSix:fix-lse-strides
Mar 25, 2026
Merged

fix: use LSE accum strides from params instead of hardcoded ones#2388
tridao merged 1 commit intoDao-AILab:mainfrom
ZeronSix:fix-lse-strides

Conversation

@ZeronSix
Copy link
Copy Markdown
Contributor

In the Split-KV path, the forward kernel computes LSE accumulator addresses using hardcoded strides instead of the stride values provided in the params structure. The combine kernel already uses the explicit strides from params, so this creates an inconsistency between the two kernels.

As a result, when the caller supplies an LSE accumulator layout that differs from the layout assumed by the forward kernel, the forward pass writes to incorrect locations and produces wrong output.

This change updates the forward kernel to use the LSE accumulator strides from params, matching the behavior of the combine kernel and ensuring correct results for arbitrary accumulator layouts.

@tridao tridao merged commit 28ef22c into Dao-AILab:main Mar 25, 2026
@tridao
Copy link
Copy Markdown
Member

tridao commented Mar 25, 2026

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants