[3/N] Achieve fault tolerance at the DP level by ympcMark · Pull Request #11657 · sgl-project/sglang

ympcMark · 2025-10-15T07:16:39Z

Motivation

The previous work of this PR only implemented fault tolerance at the EP level, but not at the DP level; we achieved this functionality by maintaining worker information through communication between the Scheduler, TokenizerManager, and DataParallelController.

Modifications

io_struct.py: Define scheduler status information

scheduler.py: Send status information to TokenizerManager.

tokenizer_manager.py: Write a handler function to accept this type of information and send information to DataParallelController.

data_parallel_controller.py: Write a handler function to accept this type of information and maintain worker information.

Accuracy Tests

added a new unit test test/srt/ep/test_mooncake_ep_small.py

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

HanHan009527 · 2025-10-28T15:27:22Z

python/sglang/srt/distributed/parallel_state.py


 _TP: Optional[GroupCoordinator] = None
+_TP_ACTIVE_RANKS: Optional[torch.Tensor] = None
+_TP_ACTIVE_RANKS_CPU: Optional[torch.Tensor] = None


I'm thinking, would it be more reasonable to rename the elastic ep module to elstic and group together concepts like _TP_ACTIVE_RANKS and _TP_ACTIVE_RANKS_CPU as much as possible?

Make it less intrusive to the original logic

Sounds reasonable

Co-authored-by: Hank Han <hanhan7630@outlook.com>

Co-authored-by: UNIDY2002 <unidy2002@outlook.com>

ShangmingCai · 2026-01-13T11:52:58Z

python/sglang/srt/managers/data_parallel_controller.py

        # Launch data parallel workers
        self.scheduler_procs = []
        self.workers: List[zmq.Socket] = [None] * server_args.dp_size
+        self.status: List[int] = [1] * server_args.dp_size


What is the range of possible values for this status?

Should be better to use an enum class here, I thinkx== 1 is not easy to understand.

status contains only 0 and 1, representing whether the worker is available

shall we use bool ?

yeah, bool sounds better

ShangmingCai · 2026-01-13T11:59:43Z

python/sglang/srt/managers/scheduler_dp_attn_mixin.py

+        global_info_tensor.view(-1, 6)[tp_active_ranks == 0, :] = torch.tensor(
+            [0, 1, 0, 0, 1, ForwardMode.IDLE.value],


The code here seems to have poor readability.

ShangmingCai

Make sense. We need to disable failed ep rank and dp rank to achieve fault tolerance during large-scale deployment. However, Scheduler/Tokenizer Manager/DP changes should be reviewed carefully by @hnyls2002. Please ping him in the Slack.

ch-wan

LGTM in general. I only have one minor comment.

ch-wan · 2026-01-18T20:41:24Z

test/srt/ep/test_mooncake_ep_small.py

+        "72",
    ]

+    def test_gsm8k_fault_1(self):


move these tests to the base class

… cases

UNIDY2002 · 2026-01-19T13:51:14Z

All comments have been resolved. Waiting for the CI.

UNIDY2002 · 2026-01-20T00:55:31Z

May I request a rerun of the failed tests? Thanks!

UNIDY2002 · 2026-01-20T09:57:53Z

CI tests almost pass (https://github.com/sgl-project/sglang/actions/runs/21139539897?pr=11657), except for one case in stage-c-test-large-4-gpu-b200 and one pending unit-test-backend-4-gpu-gb200.

ShangmingCai · 2026-01-20T10:37:09Z

/rerun-stage unit-test-deepep-4-gpu

github-actions · 2026-01-20T10:37:33Z

✅ Triggered unit-test-deepep-4-gpu to run independently (skipping dependencies).

github-actions · 2026-01-20T10:37:40Z

🔗 View workflow run

ShangmingCai · 2026-01-20T10:45:37Z

rerun-stage is broken. We can bypass since the final push is safe, for it only skips one test.

HanHan009527 mentioned this pull request Oct 15, 2025

[2/N] Added the core structure of elastic EP and the eplb algorithm with faulty rank #10606

Merged

4 tasks

UNIDY2002 force-pushed the mooncake-pr-eplb-scheduler branch 2 times, most recently from 65a0b1f to 6f302c3 Compare October 23, 2025 02:35

UNIDY2002 mentioned this pull request Oct 23, 2025

Elastic EP Support (Milestone 1 & 2) #8961

Closed

6 tasks

UNIDY2002 force-pushed the mooncake-pr-eplb-scheduler branch 4 times, most recently from 2a08f54 to 09c4da0 Compare October 23, 2025 04:06

ympcMark marked this pull request as ready for review October 23, 2025 04:25

ympcMark requested review from Ying1123, hnyls2002, ispobock, merrymercy, xiezhq-hermann, yizhang2077 and zhyncs as code owners October 23, 2025 04:25

UNIDY2002 force-pushed the mooncake-pr-eplb-scheduler branch from 09c4da0 to 567ada0 Compare October 23, 2025 07:16

ShangmingCai added the run-ci label Oct 23, 2025

UNIDY2002 force-pushed the mooncake-pr-eplb-scheduler branch 2 times, most recently from c13f1fc to e7f23db Compare October 23, 2025 10:08

HanHan009527 reviewed Oct 28, 2025

View reviewed changes

UNIDY2002 force-pushed the mooncake-pr-eplb-scheduler branch from d03f5e2 to 9959c80 Compare November 5, 2025 03:33

UNIDY2002 requested review from BBuf, Edwardf0t1, HaiShaw, ch-wan and kushanam as code owners November 5, 2025 08:03

UNIDY2002 force-pushed the mooncake-pr-eplb-scheduler branch 2 times, most recently from de2465d to 2f068be Compare November 5, 2025 11:48

UNIDY2002 requested a review from Fridge003 as a code owner November 7, 2025 07:58

UNIDY2002 force-pushed the mooncake-pr-eplb-scheduler branch from ad44587 to 4e13ddd Compare December 15, 2025 06:55

UNIDY2002 force-pushed the mooncake-pr-eplb-scheduler branch from 4e13ddd to e6caee4 Compare December 24, 2025 03:01

UNIDY2002 mentioned this pull request Dec 24, 2025

[6/N] (Elastic EP) Recover failed ranks #15771

Draft

6 tasks

UNIDY2002 force-pushed the mooncake-pr-eplb-scheduler branch from bd23c60 to e1c10bc Compare January 12, 2026 14:32

UNIDY2002 and others added 2 commits January 13, 2026 12:07

Get active_ranks info from Mooncake Backend

a347674

Co-authored-by: Hank Han <hanhan7630@outlook.com>

Achieve fault tolerance at the DP level

6fcbf39

Co-authored-by: UNIDY2002 <unidy2002@outlook.com>

UNIDY2002 force-pushed the mooncake-pr-eplb-scheduler branch from e1c10bc to 6fcbf39 Compare January 13, 2026 04:07

ShangmingCai reviewed Jan 13, 2026

View reviewed changes

debgug

8cf153a

UNIDY2002 force-pushed the mooncake-pr-eplb-scheduler branch from 49fcb0e to 8cf153a Compare January 13, 2026 13:40

UNIDY2002 and others added 2 commits January 16, 2026 14:52

Improve readability

7a8d656

Merge branch 'main' into mooncake-pr-eplb-scheduler

69f93ea

ch-wan approved these changes Jan 19, 2026

View reviewed changes

UNIDY2002 added 3 commits January 19, 2026 19:26

Remove duplicated code in test_mooncake_ep_small.py and simplify test…

b246726

… cases

Fix log level

8a76cf4

Minor fix

9031782

Skip 4 out of 7 test cases in test_mooncake_ep_small.py

81e1482

ShangmingCai merged commit f7a5e42 into sgl-project:main Jan 20, 2026
30 of 59 checks passed

UNIDY2002 deleted the mooncake-pr-eplb-scheduler branch January 21, 2026 02:41

ZailiWang mentioned this pull request Jan 21, 2026

[CPU][Fix CI] Solidate torch version for sgl-kernel-cpu and fix device orientation error #17460

Merged

5 tasks

GumpHaruhi pushed a commit to GumpHaruhi/sglang that referenced this pull request Mar 5, 2026

Merge origin/main (includes PR sgl-project#11657)

d664539

		global_info_tensor.view(-1, 6)[tp_active_ranks == 0, :] = torch.tensor(
		[0, 1, 0, 0, 1, ForwardMode.IDLE.value],

Conversation

ympcMark commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ShangmingCai left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ch-wan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

UNIDY2002 commented Jan 19, 2026

Uh oh!

UNIDY2002 commented Jan 20, 2026

Uh oh!

UNIDY2002 commented Jan 20, 2026

Uh oh!

ShangmingCai commented Jan 20, 2026

Uh oh!

github-actions bot commented Jan 20, 2026

Uh oh!

github-actions bot commented Jan 20, 2026

Uh oh!

ShangmingCai commented Jan 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ympcMark commented Oct 15, 2025 •

edited

Loading

ShangmingCai left a comment •

edited

Loading