[NVIDIA]Fix local_num_experts for EP#8779
Conversation
There was a problem hiding this comment.
Summary of Changes
Hello @wenscarl, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
I'm addressing a bug in the handling of expert parallelism (EP) within the SGLang framework. Specifically, the create_weights function was not correctly receiving or utilizing the num_local_experts parameter, leading to issues when initializing weights for MoE layers. This PR ensures that both the total number of experts and the number of local experts are properly passed and assigned during weight creation.
Highlights
- Bug Fix for Expert Parallelism: I've resolved an issue where the
num_local_expertswas incorrectly handled during weight creation for Mixture-of-Experts (MoE) layers in expert parallelism (EP) setups. - Updated
create_weightsCall: Thecreate_weightsmethod infused_moe_triton/layer.pynow explicitly passes bothnum_expertsandnum_local_expertsto the quantization method. - Modified
create_weightsSignature: Thecreate_weightsmethod withinmodelopt_quant.pyhas been updated to accept the newnum_local_expertsparameter. - Correct
num_local_expertsAssignment: Insidemodelopt_quant.py, thelayer.num_local_expertsattribute is now correctly assigned the value of thenum_local_expertsparameter, rather thannum_experts.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request addresses a bug in expert parallelism by correctly passing both num_experts and num_local_experts during weight creation. The previous implementation incorrectly used num_local_experts for the total number of experts. The logical changes appear correct and align with the intended fix. My review includes a couple of minor suggestions to fix indentation for better code consistency.
There was a problem hiding this comment.
The indentation for this line appears to be using a tab character, which is inconsistent with the surrounding code that uses spaces. To maintain a consistent code style, please replace the tab with spaces.
| num_local_experts=self.num_local_experts, | |
| num_local_experts=self.num_local_experts, |
490660b to
97a8052
Compare
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
| @@ -752,6 +752,7 @@ def create_weights( | |||
| self, | |||
| layer: torch.nn.Module, | |||
| num_experts: int, | |||
There was a problem hiding this comment.
Why not simply change the num_experts to num_local_experts? I think we just need the actual num of experts when allocating weights, right?
Motivation
This PR fixes a bug in #8552. At create_weight, the num_experts and num_loca_experts should both be passed in for EP case.
@kaixih @kushanam
cc. @kushanam
Modifications
Accuracy Test
Benchmark & Profiling
Checklist