Fix illegal memory access in FA2 varlen SplitKV early-exit LSE write#139
Open
wangyxbh wants to merge 1 commit into
Open
Fix illegal memory access in FA2 varlen SplitKV early-exit LSE write#139wangyxbh wants to merge 1 commit into
wangyxbh wants to merge 1 commit into
Conversation
Signed-off-by: wangyxbh <wangyxbh@digitalchina.com>
74cd0aa to
17b8ccc
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix the LSE write offset used by the FA2 SplitKV early-exit path when writing directly to unpadded
softmax_lse.Upstream varlen forward stores LSE in the packed unpadded layout by setting
params.unpadded_lse = true. The normal epilogue path already handles this layout, but the SplitKV early-exit path still uses the padded(batch, head, seqlen_q)offset.Problem
When running a Qwen 235B model on a single node with 8x L40S GPUs after enabling DCP, execution can hang and eventually report:
torch.AcceleratorError: CUDA error: an illegal memory access was encounteredThe issue is in the FA2 SplitKV early-exit path. It always computes
row_offset_lseaccumwith the padded SplitKV-style layout:((n_split_idx * b + bidb) * h + bidh) * seqlen_q + m_block * kBlockMThat layout is correct for SplitKV/LSE accumulation buffers, but not for direct unpadded LSE output in varlen mode.
For varlen forward,
softmax_lseis allocated as{num_heads, total_q}andparams.unpadded_lseis set totrue. In this case, the early-exit path must use the same packed varlen LSE layout as the normal epilogue path.Fix
Use the packed varlen LSE offset when writing directly to unpadded
softmax_lse:bidh * params.total_q + binfo.q_offset(params.seqlen_q, 1, bidb) + m_block * kBlockMKeep the existing padded offset for SplitKV accumulation buffers and padded LSE output.
Testing
Not run locally.