Skip to content

[2/N] Added the core structure of elastic EP and the eplb algorithm with faulty rank#10606

Merged
ch-wan merged 34 commits intosgl-project:mainfrom
HanHan009527:mooncake-pr-eplb
Oct 22, 2025
Merged

[2/N] Added the core structure of elastic EP and the eplb algorithm with faulty rank#10606
ch-wan merged 34 commits intosgl-project:mainfrom
HanHan009527:mooncake-pr-eplb

Conversation

@HanHan009527
Copy link
Collaborator

@HanHan009527 HanHan009527 commented Sep 18, 2025

Motivation

  1. Integrating Mooncake's fault-awareness, we need to adjust the eplb algorithm and model loading logic to enable the forward pass to bypass faulty ranks.

base on #10423
check our next pr #11657 and full draft #8961 (update on 10.16) to test the effect of fault redundancy

The ut part is modified to facilitate testing on machines with different ibdev names

Modifications

  1. Adding the core structure of the elastic ep module to bridge forward propagation and scheduling.
  2. Add an EPLB algorithm that supports fault rank.

Accuracy Tests

Test results are available from our full draft version. #8961

Benchmarking and Profiling

Checklist

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @HanHan009527, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the system's resilience and efficiency in distributed environments by integrating Mooncake's fault-aware Expert Parallelism (EP). It establishes a foundational elastic EP module and introduces an intelligent load balancing algorithm capable of adjusting expert distribution in the presence of faulty ranks. Furthermore, the changes enable the model loading process to dynamically adapt to the current set of active ranks, ensuring continuous operation even when some nodes are unavailable.

Highlights

  • Mooncake EP Integration: Introduced the core structure for Mooncake's elastic Expert Parallelism (EP) to enable fault-aware distributed computing, allowing forward passes to bypass faulty ranks.
  • Elasticity-Aware Load Balancing: Implemented a new Expert Parallelism Load Balancing (EPLB) algorithm that considers and adapts to faulty ranks, ensuring efficient expert distribution even with node failures.
  • Dynamic Model Loading and Updates: Modified model loading logic to bypass faulty ranks and dynamically update expert weights based on the current set of active ranks, enhancing system resilience.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the core structure for elastic expert parallelism (EP) and a new load-balancing algorithm (elasticity_aware) to support fault-tolerant ranks, primarily for integration with the Mooncake backend. The changes are well-structured, adding a new elastic_ep module for state management, a mooncake token dispatcher, and updating various parts of the system to be aware of the new backend and fault-tolerance logic. My review includes suggestions for code cleanup, improving configurability by removing hardcoded values, and refactoring for better maintainability and extensibility.

@HanHan009527 HanHan009527 changed the title [2/N] Added the core structure of elastic EP and the eplb algorithm with rank loss [2/N] Added the core structure of elastic EP and the eplb algorithm with faulty rank Sep 18, 2025
@HanHan009527 HanHan009527 force-pushed the mooncake-pr-eplb branch 2 times, most recently from 23ad09f to a52e2ec Compare October 2, 2025 07:44
@HanHan009527 HanHan009527 force-pushed the mooncake-pr-eplb branch 2 times, most recently from 3b0c5fd to f671015 Compare October 9, 2025 12:30
@ShangmingCai
Copy link
Collaborator

CC: @zhihui1084

@HanHan009527 HanHan009527 marked this pull request as ready for review October 15, 2025 14:03
HanHan009527 and others added 3 commits October 16, 2025 01:00
fix

fix

fix

fix

fix

fix

fix

ut

ut

ut

fix

fit
fi

fi

fix

fix

fix

fix

fix

fix

fix

fix

fix

fit

fix
Comment on lines +70 to +77
@classmethod
def healthy_rank_state(
cls, *, ep_size: Optional[int], device: Optional[torch.device]
) -> torch.Tensor:
size = ep_size if ep_size is not None else torch.distributed.get_world_size()
dev = device if device is not None else cls._select_device()

return torch.ones(size, dtype=torch.int32, device=dev)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this dtype be changed to torch.int64 to align with the future usage?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mooncake EP currently uses int32. BTW, what does "future usage" refer to?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@UNIDY2002 Just wonder if it should align with log2phy, which is int64. But I am not sure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a quick verification. active_ranks has two usages in rebalance_experts: (1) num_active_ranks = active_ranks.sum().item(); (2) active_ranks_list = active_ranks.tolist(), so I think using int32 for active_ranks may be okay.

Comment on lines +71 to +75
def healthy_rank_state(
cls, *, ep_size: Optional[int], device: Optional[torch.device]
) -> torch.Tensor:
size = ep_size if ep_size is not None else torch.distributed.get_world_size()
dev = device if device is not None else cls._select_device()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe give ep_size and device a default value: None

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Collaborator

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Changes are clean.

Copy link
Collaborator

@fzyzcjy fzyzcjy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM only a nit


@classmethod
def init(cls, server_args: ServerArgs):
with cls._lock:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: shall we init it in one single thread and only once, then the code can be simplified here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get, I will remove this lock. This part just follows the customary writing style of singleton, which is originally single-threaded.

@ShangmingCai ShangmingCai added the ready-to-merge The PR is ready to merge after the CI is green. label Oct 20, 2025
@ch-wan ch-wan merged commit 904655c into sgl-project:main Oct 22, 2025
139 of 143 checks passed
xjpang pushed a commit to xjpang/sglang that referenced this pull request Oct 22, 2025
…ith faulty rank (sgl-project#10606)

Co-authored-by: Xun Sun <UNIDY2002@outlook.com>
Co-authored-by: Shangming Cai <csmthu@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-to-merge The PR is ready to merge after the CI is green. run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants