[Fix] Fix nan error for large scale ep by Fridge003 · Pull Request #12866 · sgl-project/sglang

Fridge003 · 2025-11-08T03:53:06Z

Motivation

This bug is introduced in #10874, which will wrongly remove some of the redundant experts from logical_to_all_physical_map (e.g., should be [0, 256] for logical expert 0, but wrongly set to [0, -1])

This only happens on the first and the last node. On these two nodes the values of w13_input_scale will become nan

[2025-11-08 02:52:56 DP2 TP2 EP2] w13_input_scale values: tensor([nan, nan, nan, nan, nan, nan], device='cuda:2', dtype=torch.float32)
[2025-11-08 02:52:56 DP1 TP1 EP1] w13_input_scale values: tensor([nan, nan, nan, nan, nan, nan], device='cuda:1', dtype=torch.float32)

Later it hits errors like

  File "/sgl-workspace/sglang/python/sglang/srt/layers/quantization/modelopt_quant.py", line 1460, in process_weights_after_loading
    assert torch.all(w13_input_scale == w13_input_scale[0])

After this fix it can be solved.

Reproduction:
GB200, latest main branch
https://gist.github.com/kaixih/32bdc4fec4feabe9305d1acb2e1f96db

Modifications

Fix this by keeping the redundant experts on these nodes.

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist · 2025-11-08T03:53:19Z

Summary of Changes

Hello @Fridge003, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses and resolves a NaN error that can occur in large-scale expert parallelism (EP) setups. The core change refines the assignment mechanism for logical experts to physical experts, specifically by introducing a check to prevent the redundant mapping of an already assigned physical expert. This enhancement aims to improve the stability and correctness of computations in distributed environments by ensuring consistent expert allocation.

Highlights

Preventing Duplicate Expert Assignments: Modified the expert mapping logic within _compute_logical_to_all_physical_map to ensure that a nearest_expert is only assigned if it's not already present in the list of mapped physical experts. This prevents potential redundant or incorrect assignments that could lead to NaN errors in large-scale expert parallelism.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request aims to fix a NaN error in large-scale expert parallelism by modifying how the nearest physical expert is handled. The previous implementation could reduce expert redundancy, which might have been the source of the issue. However, the new implementation introduces a logical flaw that renders the nearest expert handling logic ineffective. My review identifies this issue, corrects a typo, and proposes a more robust solution to correctly prioritize the nearest expert without losing redundancy.

gemini-code-assist · 2025-11-08T03:55:05Z

python/sglang/srt/eplb/expert_location.py

+                mapped_phsical_experts = logical_to_all_physical_map[layer_id][
+                    logical_expert_id
+                ]
+                if (
+                    nearest_expert != -1
+                    and nearest_expert not in mapped_phsical_experts
+                ):
+                    mapped_phsical_experts[0] = nearest_expert


There are a couple of issues in this block:

There's a typo in the variable name mapped_phsical_experts. It should be mapped_physical_experts.

The condition nearest_expert not in mapped_phsical_experts will always be false when nearest_expert != -1, because the _find_nearest_expert function returns an expert from the candidate list (mapped_phsical_experts). This makes the if block dead code. While this might inadvertently fix the original issue by disabling this logic, it's not a clean solution.

If the intention is to prioritize the nearest_expert by moving it to the front of the list, simply replacing the first element with mapped_phsical_experts[0] = nearest_expert could create duplicates if nearest_expert is already in the list at a different position.

A better approach is to move the nearest_expert to the front of the list. This preserves all assigned experts and correctly prioritizes the nearest one.

Suggested change

mapped_phsical_experts = logical_to_all_physical_map[layer_id][

logical_expert_id

]

if (

nearest_expert != -1

and nearest_expert not in mapped_phsical_experts

):

mapped_phsical_experts[0] = nearest_expert

mapped_physical_experts = logical_to_all_physical_map[layer_id][

logical_expert_id

]

if nearest_expert != -1:

if nearest_expert in mapped_physical_experts:

mapped_physical_experts.remove(nearest_expert)

mapped_physical_experts.insert(0, nearest_expert)

kaixih · 2025-11-08T04:45:54Z

python/sglang/srt/eplb/expert_location.py

+                    nearest_expert != -1
+                    and nearest_expert not in mapped_physical_experts
+                ):
+                    mapped_physical_experts[0] = nearest_expert


why replace the first expert with the nearest?

No specific reason. I'm checking this with author of the breaking PR.
It's at least better than the original logic

Sure. Just curious, the original logic seemed to replace all mappings with the nearest expert that is not -1, but the new logic seems to replace only the first mapping.

The original logic expelled some needed experts, causing this bug

@Fridge003 what does "some needed experts" mean? Is the first expert enough?

For example, on EP rank 0, [0, 256] is initialized. But the prior logic will change it to [0, -1], then expert 256 is missing.

If [0, 256] is initialized, it's possible to be changed to [256,256], will it still cause the issue?

This change will not cause this issue, since 256 is already in there?
Or do you mean we need to check whether the replaced id is unique?

wenscarl · 2025-11-11T06:33:36Z

With 10874, on 1st and last node, the map could be:

[[0], [1], [2], ..., [24, 280], [25, 281], [31, 287], [33], [34],..., [255]]

where 256 to 279 physical experts are missing.

If [0, 256] is initialized, it's possible to be changed to [256,256], will it still cause the issue?
where could 256 be placed?

wenscarl · 2025-11-11T06:34:24Z

Another question is, w13_input_scale is supposed to be independent of physical map since each expert reads in 288 values anyways.

kaixih · 2025-11-11T06:39:21Z

Another question is, w13_input_scale is supposed to be independent of physical map since each expert reads in 288 values anyways.

But does it need the map to locate the 288 -> 256 logic experts and then load the weights (including the input scale) from them?

Fridge003 · 2025-11-11T06:50:11Z

Another question is, w13_input_scale is supposed to be independent of physical map since each expert reads in 288 values anyways.

But does it need the map to locate the 288 -> 256 logic experts and then load the weights (including the input scale) from them?

Yeah. If some experts are missing, maybe it will throw some nan values

Fridge003 · 2025-11-11T22:44:05Z

https://github.com/sgl-project/sglang/actions/runs/19259460525?pr=12866

Fridge003 · 2025-11-11T22:44:49Z

Since CIs are all green, let's first merge this to unblock some ongoing tasks for gb200.
If there is a better solution, we can open it in a following PR
@kaixih @wenscarl @acelyc111

This reverts commit 99e2580.

[Fix] Fix nan error for large scale ep

50b09d2

Fridge003 requested a review from fzyzcjy as a code owner November 8, 2025 03:53

sglang-bot added the run-ci label Nov 8, 2025

gemini-code-assist bot reviewed Nov 8, 2025

View reviewed changes

Fridge003 mentioned this pull request Nov 8, 2025

[Bug] DeepEP + nvfp4 issues with PD disagg or agg #12293

Closed

10 tasks

fix

6c78139

Fridge003 mentioned this pull request Nov 8, 2025

[wip] - gb200 wideep/pd fp4 pipecleaning #12750

Closed

kaixih reviewed Nov 8, 2025

View reviewed changes

Merge branch 'main' into baizhou/fix-dp

0935ec4

Fridge003 merged commit 99e2580 into main Nov 11, 2025
120 of 127 checks passed

Fridge003 deleted the baizhou/fix-dp branch November 11, 2025 22:44

wenscarl added a commit to wenscarl/sglang that referenced this pull request Nov 12, 2025

Revert "[Fix] Fix nan error for large scale ep (sgl-project#12866)"

05679af

This reverts commit 99e2580.

wenscarl mentioned this pull request Nov 12, 2025

Fix nan in global scaling factor for large scale nvfp4 EP #13162

Merged

4 tasks

Conversation

Fridge003 commented Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Nov 8, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

kaixih Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

Fridge003 Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kaixih Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

Fridge003 Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

acelyc111 Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

Fridge003 Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

acelyc111 Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

Fridge003 Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

wenscarl commented Nov 11, 2025

Uh oh!

wenscarl commented Nov 11, 2025

Uh oh!

kaixih commented Nov 11, 2025

Uh oh!

Fridge003 commented Nov 11, 2025

Uh oh!

Fridge003 commented Nov 11, 2025

Uh oh!

Fridge003 commented Nov 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Fridge003 commented Nov 8, 2025 •

edited

Loading

Fridge003 Nov 8, 2025 •

edited

Loading