[2/N] Added the core structure of elastic EP and the eplb algorithm with faulty rank#10606
[2/N] Added the core structure of elastic EP and the eplb algorithm with faulty rank#10606ch-wan merged 34 commits intosgl-project:mainfrom
Conversation
There was a problem hiding this comment.
Summary of Changes
Hello @HanHan009527, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly enhances the system's resilience and efficiency in distributed environments by integrating Mooncake's fault-aware Expert Parallelism (EP). It establishes a foundational elastic EP module and introduces an intelligent load balancing algorithm capable of adjusting expert distribution in the presence of faulty ranks. Furthermore, the changes enable the model loading process to dynamically adapt to the current set of active ranks, ensuring continuous operation even when some nodes are unavailable.
Highlights
- Mooncake EP Integration: Introduced the core structure for Mooncake's elastic Expert Parallelism (EP) to enable fault-aware distributed computing, allowing forward passes to bypass faulty ranks.
- Elasticity-Aware Load Balancing: Implemented a new Expert Parallelism Load Balancing (EPLB) algorithm that considers and adapts to faulty ranks, ensuring efficient expert distribution even with node failures.
- Dynamic Model Loading and Updates: Modified model loading logic to bypass faulty ranks and dynamically update expert weights based on the current set of active ranks, enhancing system resilience.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in pull request comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request introduces the core structure for elastic expert parallelism (EP) and a new load-balancing algorithm (elasticity_aware) to support fault-tolerant ranks, primarily for integration with the Mooncake backend. The changes are well-structured, adding a new elastic_ep module for state management, a mooncake token dispatcher, and updating various parts of the system to be aware of the new backend and fault-tolerance logic. My review includes suggestions for code cleanup, improving configurability by removing hardcoded values, and refactoring for better maintainability and extensibility.
23ad09f to
a52e2ec
Compare
3b0c5fd to
f671015
Compare
|
CC: @zhihui1084 |
f671015 to
28b950a
Compare
b47fff6 to
7271b89
Compare
| @classmethod | ||
| def healthy_rank_state( | ||
| cls, *, ep_size: Optional[int], device: Optional[torch.device] | ||
| ) -> torch.Tensor: | ||
| size = ep_size if ep_size is not None else torch.distributed.get_world_size() | ||
| dev = device if device is not None else cls._select_device() | ||
|
|
||
| return torch.ones(size, dtype=torch.int32, device=dev) |
There was a problem hiding this comment.
Should this dtype be changed to torch.int64 to align with the future usage?
There was a problem hiding this comment.
Mooncake EP currently uses int32. BTW, what does "future usage" refer to?
There was a problem hiding this comment.
@UNIDY2002 Just wonder if it should align with log2phy, which is int64. But I am not sure.
There was a problem hiding this comment.
I had a quick verification. active_ranks has two usages in rebalance_experts: (1) num_active_ranks = active_ranks.sum().item(); (2) active_ranks_list = active_ranks.tolist(), so I think using int32 for active_ranks may be okay.
| def healthy_rank_state( | ||
| cls, *, ep_size: Optional[int], device: Optional[torch.device] | ||
| ) -> torch.Tensor: | ||
| size = ep_size if ep_size is not None else torch.distributed.get_world_size() | ||
| dev = device if device is not None else cls._select_device() |
There was a problem hiding this comment.
nit: maybe give ep_size and device a default value: None
ShangmingCai
left a comment
There was a problem hiding this comment.
LGTM. Changes are clean.
|
|
||
| @classmethod | ||
| def init(cls, server_args: ServerArgs): | ||
| with cls._lock: |
There was a problem hiding this comment.
nit: shall we init it in one single thread and only once, then the code can be simplified here
There was a problem hiding this comment.
get, I will remove this lock. This part just follows the customary writing style of singleton, which is originally single-threaded.
…ith faulty rank (sgl-project#10606) Co-authored-by: Xun Sun <UNIDY2002@outlook.com> Co-authored-by: Shangming Cai <csmthu@gmail.com>
Motivation
base on #10423
check our next pr #11657 and full draft #8961 (update on 10.16) to test the effect of fault redundancy
The ut part is modified to facilitate testing on machines with different ibdev names
Modifications
Accuracy Tests
Test results are available from our full draft version. #8961
Benchmarking and Profiling
Checklist