Conversation
Summary of ChangesHello @Fridge003, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses and resolves a NaN error that can occur in large-scale expert parallelism (EP) setups. The core change refines the assignment mechanism for logical experts to physical experts, specifically by introducing a check to prevent the redundant mapping of an already assigned physical expert. This enhancement aims to improve the stability and correctness of computations in distributed environments by ensuring consistent expert allocation. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request aims to fix a NaN error in large-scale expert parallelism by modifying how the nearest physical expert is handled. The previous implementation could reduce expert redundancy, which might have been the source of the issue. However, the new implementation introduces a logical flaw that renders the nearest expert handling logic ineffective. My review identifies this issue, corrects a typo, and proposes a more robust solution to correctly prioritize the nearest expert without losing redundancy.
| mapped_phsical_experts = logical_to_all_physical_map[layer_id][ | ||
| logical_expert_id | ||
| ] | ||
| if ( | ||
| nearest_expert != -1 | ||
| and nearest_expert not in mapped_phsical_experts | ||
| ): | ||
| mapped_phsical_experts[0] = nearest_expert |
There was a problem hiding this comment.
There are a couple of issues in this block:
- There's a typo in the variable name
mapped_phsical_experts. It should bemapped_physical_experts. - The condition
nearest_expert not in mapped_phsical_expertswill always be false whennearest_expert != -1, because the_find_nearest_expertfunction returns an expert from the candidate list (mapped_phsical_experts). This makes theifblock dead code. While this might inadvertently fix the original issue by disabling this logic, it's not a clean solution. - If the intention is to prioritize the
nearest_expertby moving it to the front of the list, simply replacing the first element withmapped_phsical_experts[0] = nearest_expertcould create duplicates ifnearest_expertis already in the list at a different position.
A better approach is to move the nearest_expert to the front of the list. This preserves all assigned experts and correctly prioritizes the nearest one.
| mapped_phsical_experts = logical_to_all_physical_map[layer_id][ | |
| logical_expert_id | |
| ] | |
| if ( | |
| nearest_expert != -1 | |
| and nearest_expert not in mapped_phsical_experts | |
| ): | |
| mapped_phsical_experts[0] = nearest_expert | |
| mapped_physical_experts = logical_to_all_physical_map[layer_id][ | |
| logical_expert_id | |
| ] | |
| if nearest_expert != -1: | |
| if nearest_expert in mapped_physical_experts: | |
| mapped_physical_experts.remove(nearest_expert) | |
| mapped_physical_experts.insert(0, nearest_expert) |
| nearest_expert != -1 | ||
| and nearest_expert not in mapped_physical_experts | ||
| ): | ||
| mapped_physical_experts[0] = nearest_expert |
There was a problem hiding this comment.
why replace the first expert with the nearest?
There was a problem hiding this comment.
No specific reason. I'm checking this with author of the breaking PR.
It's at least better than the original logic
There was a problem hiding this comment.
Sure. Just curious, the original logic seemed to replace all mappings with the nearest expert that is not -1, but the new logic seems to replace only the first mapping.
There was a problem hiding this comment.
The original logic expelled some needed experts, causing this bug
There was a problem hiding this comment.
@Fridge003 what does "some needed experts" mean? Is the first expert enough?
There was a problem hiding this comment.
For example, on EP rank 0, [0, 256] is initialized. But the prior logic will change it to [0, -1], then expert 256 is missing.
There was a problem hiding this comment.
If [0, 256] is initialized, it's possible to be changed to [256,256], will it still cause the issue?
There was a problem hiding this comment.
This change will not cause this issue, since 256 is already in there?
Or do you mean we need to check whether the replaced id is unique?
|
With 10874, on 1st and last node, the map could be: where 256 to 279 physical experts are missing.
|
|
Another question is, |
But does it need the map to locate the 288 -> 256 logic experts and then load the weights (including the input scale) from them? |
Yeah. If some experts are missing, maybe it will throw some nan values |
|
Since CIs are all green, let's first merge this to unblock some ongoing tasks for gb200. |
This reverts commit 99e2580.
Motivation
Part of #12293
This bug is introduced in #10874, which will wrongly remove some of the redundant experts from
logical_to_all_physical_map(e.g., should be [0, 256] for logical expert 0, but wrongly set to [0, -1])This only happens on the first and the last node. On these two nodes the values of
w13_input_scalewill become nanLater it hits errors like
After this fix it can be solved.
Reproduction:
GB200, latest main branch
https://gist.github.com/kaixih/32bdc4fec4feabe9305d1acb2e1f96db
Modifications
Fix this by keeping the redundant experts on these nodes.
Accuracy Tests
Benchmarking and Profiling
Checklist