Skip to content

Fix Illegal Instruction/IMA errors when using DP attention with DeepSeek-V3.2 models#12052

Closed
YAMY1234 wants to merge 7 commits intosgl-project:mainfrom
YAMY1234:fix_gpqa
Closed

Fix Illegal Instruction/IMA errors when using DP attention with DeepSeek-V3.2 models#12052
YAMY1234 wants to merge 7 commits intosgl-project:mainfrom
YAMY1234:fix_gpqa

Conversation

@YAMY1234
Copy link
Copy Markdown
Contributor

@YAMY1234 YAMY1234 commented Oct 24, 2025

Motivation

Fix illegal memory access (IMA) errors when using DP attention with DeepSeek-V3.2 models.

Issue: When running with --enable-dp-attention, the DP gather operation in logits stage uses incorrect metadata, causing Triton kernel to access out-of-bounds memory.

Root cause: The get_dp_local_info() function always uses global_num_tokens_gpu (total tokens for attention/MLP), but in logits stage it should use global_num_tokens_for_logprob_gpu (pruned tokens needing logits computation).

Modifications

1. Fix metadata source selection in get_dp_local_info()

File: python/sglang/srt/layers/dp_attention.py

Added logic to distinguish between two calling contexts:

  • LogitsMetadata: Use global_num_tokens_for_logprob_gpu (correct for logits stage)
  • ForwardBatch: Use global_num_tokens_gpu (correct for attention/MLP stage)

This ensures offset calculation uses the appropriate metadata field.

2. Use actual tensor size in _dp_gather_via_all_reduce()

File: python/sglang/srt/layers/dp_attention.py

Override local_num_tokens with local_tokens.shape[0] before calling memcpy_triton. Uses in-place .fill_() to maintain CUDA graph compatibility.

This provides a safety net since scheduler's num_tokens_for_logprob calculation doesn't account for logits_processor pruning.

Accuracy Tests

python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --dp 8 --enable-dp-attention

python3 -m sglang.test.run_eval --port 30000 --eval-name gpqa --num-examples 198 --max-tokens 36000 --repeat 8 --thinking-mode deepseek-v3
Scores: ['0.798', '0.803', '0.803', '0.798', '0.788', '0.778', '0.773', '0.798']█████████████████████████████████████████████████████████████████████████████████▉      | 185/198 [41:56<01:16, 5.85s/it]
====================
Writing report to /tmp/gpqa_deepseek-ai_DeepSeek-V3.2-Exp.html███████████████▌                                                | 85/198 [42:52<28:55, 15.36s/it]
{'chars': np.float64(14306.5), 'chars:std': np.float64(11493.867823449567), 'score:std': np.float64(0.4015072103909452), 'score': np.float64(0.797979797979798)}
Writing results to /tmp/gpqa_deepseek-ai_DeepSeek-V3.2-Exp.json
Total latency: 2573.581 s
Score: 0.798
Repeat: 8, mean: 0.799
Scores: ['0.793', '0.798', '0.813', '0.808', '0.813', '0.783', '0.768', '0.818']
====================
Writing report to /tmp/gpqa_deepseek-ai_DeepSeek-V3.2-Exp.html
{'chars': np.float64(14737.838383838383), 'chars:std': np.float64(11840.990410141889), 'score:std': np.float64(0.38569460791993493), 'score': np.float64(0.8181818181818182)}
Writing results to /tmp/gpqa_deepseek-ai_DeepSeek-V3.2-Exp.json
Total latency: 1584.823 s
Score: 0.818

Benchmarking and Profiling

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @YAMY1234, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses and resolves illegal memory access (IMA) errors encountered when using Distributed Parallel (DP) attention with DeepSeek-V3.2 models. The core issue stemmed from incorrect metadata being used during the logits stage of DP gather operations, leading to out-of-bounds memory access in Triton kernels. The solution involves refining the selection of token count metadata based on the operational context (logits vs. attention/MLP stages) and introducing a robust fallback mechanism to use the actual tensor size, ensuring correct memory handling and stable model execution.

Highlights

  • Fix metadata source selection in get_dp_local_info(): Added logic to distinguish between two calling contexts: LogitsMetadata (uses global_num_tokens_for_logprob_gpu) and ForwardBatch (uses global_num_tokens_gpu). This ensures offset calculation uses the appropriate metadata field, preventing illegal memory access errors.
  • Use actual tensor size in _dp_gather_via_all_reduce(): Overrides local_num_tokens with local_tokens.shape[0] before calling memcpy_triton. This provides a safety net for cases where the scheduler's num_tokens_for_logprob calculation doesn't account for logits_processor pruning, and uses in-place .fill_() for CUDA graph compatibility.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively resolves an illegal memory access error in DP attention for DeepSeek models. The core fix correctly selects the token count metadata based on the calling context (logits vs. attention/MLP), which is a crucial distinction. Additionally, the change to use the actual tensor size as a safety net in _dp_gather_via_all_reduce is a solid defensive programming practice. I have one minor suggestion to improve code readability by simplifying a conditional check.

@YAMY1234
Copy link
Copy Markdown
Contributor Author

Resolving #11942

if forward_batch.dp_local_start_pos is None:
cumtokens = torch.cumsum(forward_batch.global_num_tokens_gpu, dim=0)
# Select metadata source based on context
if (
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can dp_local_start_pos be None in LogitsMetadata? It is expected to be processed here:

logits_metadata.compute_dp_attention_metadata()
.

LogitsMetadata has a separate processing logic. We should keep it separated. Perhaps we should add assertion like assert isinstance(forward_batch, ForwardBatch) to avoid misuse.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, understand the design! The call logic should indeed stay separated at the outer level. Will have another patch soon after verified

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ch-wan Just pushed another patch, could you take another look? Thanks

@YAMY1234
Copy link
Copy Markdown
Contributor Author

Root fix is now in a new PR: #12115 which can resolve the issue with in a more robust way. Closing this PR

@YAMY1234 YAMY1234 closed this Oct 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants