Fix Illegal Instruction/IMA errors when using DP attention with DeepSeek-V3.2 models#12052
Fix Illegal Instruction/IMA errors when using DP attention with DeepSeek-V3.2 models#12052YAMY1234 wants to merge 7 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @YAMY1234, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses and resolves illegal memory access (IMA) errors encountered when using Distributed Parallel (DP) attention with DeepSeek-V3.2 models. The core issue stemmed from incorrect metadata being used during the logits stage of DP gather operations, leading to out-of-bounds memory access in Triton kernels. The solution involves refining the selection of token count metadata based on the operational context (logits vs. attention/MLP stages) and introducing a robust fallback mechanism to use the actual tensor size, ensuring correct memory handling and stable model execution. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request effectively resolves an illegal memory access error in DP attention for DeepSeek models. The core fix correctly selects the token count metadata based on the calling context (logits vs. attention/MLP), which is a crucial distinction. Additionally, the change to use the actual tensor size as a safety net in _dp_gather_via_all_reduce is a solid defensive programming practice. I have one minor suggestion to improve code readability by simplifying a conditional check.
|
Resolving #11942 |
| if forward_batch.dp_local_start_pos is None: | ||
| cumtokens = torch.cumsum(forward_batch.global_num_tokens_gpu, dim=0) | ||
| # Select metadata source based on context | ||
| if ( |
There was a problem hiding this comment.
How can dp_local_start_pos be None in LogitsMetadata? It is expected to be processed here:
LogitsMetadata has a separate processing logic. We should keep it separated. Perhaps we should add assertion like assert isinstance(forward_batch, ForwardBatch) to avoid misuse.
There was a problem hiding this comment.
Thanks, understand the design! The call logic should indeed stay separated at the outer level. Will have another patch soon after verified
There was a problem hiding this comment.
@ch-wan Just pushed another patch, could you take another look? Thanks
|
Root fix is now in a new PR: #12115 which can resolve the issue with in a more robust way. Closing this PR |
Motivation
Fix illegal memory access (IMA) errors when using DP attention with DeepSeek-V3.2 models.
Issue: When running with
--enable-dp-attention, the DP gather operation in logits stage uses incorrect metadata, causing Triton kernel to access out-of-bounds memory.Root cause: The
get_dp_local_info()function always usesglobal_num_tokens_gpu(total tokens for attention/MLP), but in logits stage it should useglobal_num_tokens_for_logprob_gpu(pruned tokens needing logits computation).Modifications
1. Fix metadata source selection in
get_dp_local_info()File:
python/sglang/srt/layers/dp_attention.pyAdded logic to distinguish between two calling contexts:
global_num_tokens_for_logprob_gpu(correct for logits stage)global_num_tokens_gpu(correct for attention/MLP stage)This ensures offset calculation uses the appropriate metadata field.
2. Use actual tensor size in
_dp_gather_via_all_reduce()File:
python/sglang/srt/layers/dp_attention.pyOverride
local_num_tokenswithlocal_tokens.shape[0]before callingmemcpy_triton. Uses in-place.fill_()to maintain CUDA graph compatibility.This provides a safety net since scheduler's
num_tokens_for_logprobcalculation doesn't account for logits_processor pruning.Accuracy Tests
Benchmarking and Profiling
Checklist