Fix Illegal Instruction/IMA errors when using DP attention with DeepSeek-V3.2 models by YAMY1234 · Pull Request #12052 · sgl-project/sglang

YAMY1234 · 2025-10-24T04:53:40Z

Motivation

Fix illegal memory access (IMA) errors when using DP attention with DeepSeek-V3.2 models.

Issue: When running with --enable-dp-attention, the DP gather operation in logits stage uses incorrect metadata, causing Triton kernel to access out-of-bounds memory.

Root cause: The get_dp_local_info() function always uses global_num_tokens_gpu (total tokens for attention/MLP), but in logits stage it should use global_num_tokens_for_logprob_gpu (pruned tokens needing logits computation).

Modifications

1. Fix metadata source selection in `get_dp_local_info()`

File: python/sglang/srt/layers/dp_attention.py

Added logic to distinguish between two calling contexts:

LogitsMetadata: Use global_num_tokens_for_logprob_gpu (correct for logits stage)
ForwardBatch: Use global_num_tokens_gpu (correct for attention/MLP stage)

This ensures offset calculation uses the appropriate metadata field.

2. Use actual tensor size in `_dp_gather_via_all_reduce()`

File: python/sglang/srt/layers/dp_attention.py

Override local_num_tokens with local_tokens.shape[0] before calling memcpy_triton. Uses in-place .fill_() to maintain CUDA graph compatibility.

This provides a safety net since scheduler's num_tokens_for_logprob calculation doesn't account for logits_processor pruning.

Accuracy Tests

python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --dp 8 --enable-dp-attention

python3 -m sglang.test.run_eval --port 30000 --eval-name gpqa --num-examples 198 --max-tokens 36000 --repeat 8 --thinking-mode deepseek-v3

Scores: ['0.798', '0.803', '0.803', '0.798', '0.788', '0.778', '0.773', '0.798']█████████████████████████████████████████████████████████████████████████████████▉      | 185/198 [41:56<01:16, 5.85s/it]
====================
Writing report to /tmp/gpqa_deepseek-ai_DeepSeek-V3.2-Exp.html███████████████▌                                                | 85/198 [42:52<28:55, 15.36s/it]
{'chars': np.float64(14306.5), 'chars:std': np.float64(11493.867823449567), 'score:std': np.float64(0.4015072103909452), 'score': np.float64(0.797979797979798)}
Writing results to /tmp/gpqa_deepseek-ai_DeepSeek-V3.2-Exp.json
Total latency: 2573.581 s
Score: 0.798

Repeat: 8, mean: 0.799
Scores: ['0.793', '0.798', '0.813', '0.808', '0.813', '0.783', '0.768', '0.818']
====================
Writing report to /tmp/gpqa_deepseek-ai_DeepSeek-V3.2-Exp.html
{'chars': np.float64(14737.838383838383), 'chars:std': np.float64(11840.990410141889), 'score:std': np.float64(0.38569460791993493), 'score': np.float64(0.8181818181818182)}
Writing results to /tmp/gpqa_deepseek-ai_DeepSeek-V3.2-Exp.json
Total latency: 1584.823 s
Score: 0.818

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist · 2025-10-24T04:53:54Z

Summary of Changes

Hello @YAMY1234, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses and resolves illegal memory access (IMA) errors encountered when using Distributed Parallel (DP) attention with DeepSeek-V3.2 models. The core issue stemmed from incorrect metadata being used during the logits stage of DP gather operations, leading to out-of-bounds memory access in Triton kernels. The solution involves refining the selection of token count metadata based on the operational context (logits vs. attention/MLP stages) and introducing a robust fallback mechanism to use the actual tensor size, ensuring correct memory handling and stable model execution.

Highlights

Fix metadata source selection in get_dp_local_info(): Added logic to distinguish between two calling contexts: LogitsMetadata (uses global_num_tokens_for_logprob_gpu) and ForwardBatch (uses global_num_tokens_gpu). This ensures offset calculation uses the appropriate metadata field, preventing illegal memory access errors.
Use actual tensor size in _dp_gather_via_all_reduce(): Overrides local_num_tokens with local_tokens.shape[0] before calling memcpy_triton. This provides a safety net for cases where the scheduler's num_tokens_for_logprob calculation doesn't account for logits_processor pruning, and uses in-place .fill_() for CUDA graph compatibility.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request effectively resolves an illegal memory access error in DP attention for DeepSeek models. The core fix correctly selects the token count metadata based on the calling context (logits vs. attention/MLP), which is a crucial distinction. Additionally, the change to use the actual tensor size as a safety net in _dp_gather_via_all_reduce is a solid defensive programming practice. I have one minor suggestion to improve code readability by simplifying a conditional check.

python/sglang/srt/layers/dp_attention.py

YAMY1234 · 2025-10-24T05:01:28Z

Resolving #11942

ch-wan · 2025-10-24T16:52:08Z

python/sglang/srt/layers/dp_attention.py

    if forward_batch.dp_local_start_pos is None:
-        cumtokens = torch.cumsum(forward_batch.global_num_tokens_gpu, dim=0)
+        # Select metadata source based on context
+        if (


How can dp_local_start_pos be None in LogitsMetadata? It is expected to be processed here:

sglang/python/sglang/srt/layers/logits_processor.py

Line 577 in 0bfa394

logits_metadata.compute_dp_attention_metadata()

.

LogitsMetadata has a separate processing logic. We should keep it separated. Perhaps we should add assertion like assert isinstance(forward_batch, ForwardBatch) to avoid misuse.

Thanks, understand the design! The call logic should indeed stay separated at the outer level. Will have another patch soon after verified

@ch-wan Just pushed another patch, could you take another look? Thanks

YAMY1234 · 2025-10-25T08:56:33Z

Root fix is now in a new PR: #12115 which can resolve the issue with in a more robust way. Closing this PR

YAMY1234 added 3 commits October 23, 2025 21:38

successfully run version

e582cd9

remove debug prints

2b8bc4f

format fix

7113ceb

YAMY1234 requested review from BBuf, Edwardf0t1, HaiShaw, Ying1123, ch-wan, ispobock, kushanam, merrymercy and zhyncs as code owners October 24, 2025 04:53

gemini-code-assist bot reviewed Oct 24, 2025

View reviewed changes

python/sglang/srt/layers/dp_attention.py Outdated Show resolved Hide resolved

fix gemini suggestion

a12df55

Fridge003 added the run-ci label Oct 24, 2025

Fridge003 and others added 2 commits October 23, 2025 22:17

Merge branch 'main' into fix_gpqa

abbd610

format fix

fdf8345

ch-wan reviewed Oct 24, 2025

View reviewed changes

ch-wan self-assigned this Oct 24, 2025

Fridge003 mentioned this pull request Oct 24, 2025

[Bug] illegal memory access / illegal instruction / memory leak #11968

Closed

5 tasks

LogitsMetadata judge move to _dp_gather_via_all_reduce

b4c0f5d

Fridge003 mentioned this pull request Oct 25, 2025

Update document index for DeepSeek-v32 docs #12101

Merged

4 tasks

YAMY1234 mentioned this pull request Oct 25, 2025

Fix Illegal Instruction/IMA errors when using DP attention -- num_tokens_for_logprob calculation #12115

Merged

4 tasks

YAMY1234 closed this Oct 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Illegal Instruction/IMA errors when using DP attention with DeepSeek-V3.2 models#12052

Fix Illegal Instruction/IMA errors when using DP attention with DeepSeek-V3.2 models#12052
YAMY1234 wants to merge 7 commits intosgl-project:mainfrom
YAMY1234:fix_gpqa

YAMY1234 commented Oct 24, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Oct 24, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

YAMY1234 commented Oct 24, 2025

Uh oh!

ch-wan Oct 24, 2025

Uh oh!

YAMY1234 Oct 24, 2025

Uh oh!

YAMY1234 Oct 24, 2025

Uh oh!

YAMY1234 commented Oct 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

YAMY1234 commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

1. Fix metadata source selection in get_dp_local_info()

2. Use actual tensor size in _dp_gather_via_all_reduce()

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Oct 24, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

YAMY1234 commented Oct 24, 2025

Uh oh!

ch-wan Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

YAMY1234 Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

YAMY1234 Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

YAMY1234 commented Oct 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

YAMY1234 commented Oct 24, 2025 •

edited

Loading

1. Fix metadata source selection in `get_dp_local_info()`

2. Use actual tensor size in `_dp_gather_via_all_reduce()`