[Bugfix] Fix illegal memory access by elvischenv · Pull Request #12758 · sgl-project/sglang

elvischenv · 2025-11-06T07:30:00Z

Motivation

Fixed illegal memory access issue in #12695 and flashinfer-ai/flashinfer#2034
Caused by #12524

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist · 2025-11-06T07:30:04Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

python/sglang/srt/layers/quantization/mxfp4.py

elvischenv · 2025-11-06T15:55:23Z

python/sglang/srt/layers/flashinfer_comm_fusion.py

@@ -128,7 +128,7 @@ def flashinfer_allreduce_residual_rmsnorm(
    residual: torch.Tensor,
    weight: torch.Tensor,
    eps: float = 1e-6,
-    max_token_num: int = 2048,
+    max_token_num: int = 16384,


Hang issue WAR: Increase max_token_num to allocate a larger workspace

add an assert: assert input_tensor.shape[0] <= max_token_num ?

should be covered by the code below:

if input_tensor.shape[0] > max_token_num: logger.debug( "Input token(%d) is greater than max_token_num(%d), " "falling back to standard implementation", input_tensor.shape[0], max_token_num, ) return None, None

Fridge003 · 2025-11-07T00:43:10Z

@elvischenv Failed at GPTOSS Ci test, please have a look
https://github.com/sgl-project/sglang/actions/runs/19149049114/job/54736119046?pr=12758

elvischenv · 2025-11-07T01:24:50Z

Failed at GPTOSS Ci test, please have a look
https://github.com/sgl-project/sglang/actions/runs/19149049114/job/54736119046?pr=12758

This is DeepseekV2.

  File "/actions-runner/_work/sglang/sglang/python/sglang/srt/models/deepseek_v2.py", line 793, in forward_normal_dual_stream
    final_hidden_states += shared_output
RuntimeError: The size of tensor a (3584) must match the size of tensor b (7168) at non-singleton dimension 1

This failure seems also related to #12524. cc @merrymercy

Fridge003 · 2025-11-07T01:37:56Z

@elvischenv Other PRs can pass dpsk test.
https://github.com/sgl-project/sglang/actions/runs/19153565252/job/54752242601

python/sglang/srt/models/deepseek_v2.py

work around hanging issue of trtllm_allreduce_fusion fix correctly

Fridge003 · 2025-11-07T05:57:49Z

The gptoss CI test on B200 is passing here
https://github.com/sgl-project/sglang/actions/runs/19157818700/job/54765839477?pr=12758

elvischenv requested review from BBuf, Edwardf0t1, FlamingoPg and ch-wan as code owners November 6, 2025 07:30

Kangyan-Zhou added run-ci high priority labels Nov 6, 2025

Fridge003 requested changes Nov 6, 2025

View reviewed changes

python/sglang/srt/layers/quantization/mxfp4.py Outdated Show resolved Hide resolved

elvischenv force-pushed the elvischenv/fix-illegal-memory-access branch from dbfc6c9 to 12ee626 Compare November 6, 2025 10:31

elvischenv requested review from HaiShaw, Ying1123, ispobock, kushanam, merrymercy and zhyncs as code owners November 6, 2025 10:31

elvischenv force-pushed the elvischenv/fix-illegal-memory-access branch from 12ee626 to a6936cb Compare November 6, 2025 15:53

elvischenv commented Nov 6, 2025

View reviewed changes

elvischenv force-pushed the elvischenv/fix-illegal-memory-access branch from 7683443 to 3e35e3a Compare November 7, 2025 01:43

github-actions bot added the deepseek label Nov 7, 2025

elvischenv commented Nov 7, 2025

View reviewed changes

python/sglang/srt/models/deepseek_v2.py Outdated Show resolved Hide resolved

elvischenv force-pushed the elvischenv/fix-illegal-memory-access branch from 3e35e3a to ddff77b Compare November 7, 2025 02:48

fix illegal memory access

b008fbe

work around hanging issue of trtllm_allreduce_fusion fix correctly

elvischenv force-pushed the elvischenv/fix-illegal-memory-access branch from ddff77b to b008fbe Compare November 7, 2025 04:04

Fridge003 approved these changes Nov 7, 2025

View reviewed changes

Fridge003 merged commit 1fa788e into sgl-project:main Nov 7, 2025
109 of 120 checks passed

Fridge003 mentioned this pull request Nov 7, 2025

[Bug] GPT-OSS-20B runs into cuda illegal memory access with the default flashinfer_mxfp4 moe runner backend #12695

Closed

5 tasks

elvischenv deleted the elvischenv/fix-illegal-memory-access branch November 7, 2025 06:26

Qiaolin-Yu mentioned this pull request Nov 9, 2025

Fix spec decoding acc length for dpsk-r1-fp4 tp8 #12896

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] Fix illegal memory access#12758

[Bugfix] Fix illegal memory access#12758
Fridge003 merged 1 commit intosgl-project:mainfrom
elvischenv:elvischenv/fix-illegal-memory-access

elvischenv commented Nov 6, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Nov 6, 2025

Uh oh!

Uh oh!

elvischenv Nov 6, 2025

Uh oh!

nvpohanh Nov 7, 2025

Uh oh!

nvpohanh Nov 7, 2025

Uh oh!

Fridge003 commented Nov 7, 2025

Uh oh!

elvischenv commented Nov 7, 2025 •

edited

Loading

Uh oh!

Fridge003 commented Nov 7, 2025

Uh oh!

Uh oh!

Fridge003 commented Nov 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

elvischenv commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Nov 6, 2025

Uh oh!

Uh oh!

elvischenv Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

nvpohanh Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

nvpohanh Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Fridge003 commented Nov 7, 2025

Uh oh!

elvischenv commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fridge003 commented Nov 7, 2025

Uh oh!

Uh oh!

Fridge003 commented Nov 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

elvischenv commented Nov 6, 2025 •

edited

Loading

elvischenv commented Nov 7, 2025 •

edited

Loading