Skip to content

[3/N] Achieve fault tolerance at the DP level#11657

Merged
ShangmingCai merged 9 commits intosgl-project:mainfrom
HanHan009527:mooncake-pr-eplb-scheduler
Jan 20, 2026
Merged

[3/N] Achieve fault tolerance at the DP level#11657
ShangmingCai merged 9 commits intosgl-project:mainfrom
HanHan009527:mooncake-pr-eplb-scheduler

Conversation

@ympcMark
Copy link
Contributor

@ympcMark ympcMark commented Oct 15, 2025

Motivation

The previous work of this PR only implemented fault tolerance at the EP level, but not at the DP level; we achieved this functionality by maintaining worker information through communication between the Scheduler, TokenizerManager, and DataParallelController.

Modifications

io_struct.py: Define scheduler status information

scheduler.py: Send status information to TokenizerManager.

tokenizer_manager.py: Write a handler function to accept this type of information and send information to DataParallelController.

data_parallel_controller.py: Write a handler function to accept this type of information and maintain worker information.

Accuracy Tests

added a new unit test test/srt/ep/test_mooncake_ep_small.py

Benchmarking and Profiling

Checklist

@UNIDY2002 UNIDY2002 force-pushed the mooncake-pr-eplb-scheduler branch 2 times, most recently from 65a0b1f to 6f302c3 Compare October 23, 2025 02:35
@UNIDY2002 UNIDY2002 force-pushed the mooncake-pr-eplb-scheduler branch 4 times, most recently from 2a08f54 to 09c4da0 Compare October 23, 2025 04:06
@ympcMark ympcMark marked this pull request as ready for review October 23, 2025 04:25
@UNIDY2002 UNIDY2002 force-pushed the mooncake-pr-eplb-scheduler branch from 09c4da0 to 567ada0 Compare October 23, 2025 07:16
@UNIDY2002 UNIDY2002 force-pushed the mooncake-pr-eplb-scheduler branch 2 times, most recently from c13f1fc to e7f23db Compare October 23, 2025 10:08

_TP: Optional[GroupCoordinator] = None
_TP_ACTIVE_RANKS: Optional[torch.Tensor] = None
_TP_ACTIVE_RANKS_CPU: Optional[torch.Tensor] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking, would it be more reasonable to rename the elastic ep module to elstic and group together concepts like _TP_ACTIVE_RANKS and _TP_ACTIVE_RANKS_CPU as much as possible?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make it less intrusive to the original logic

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable

@UNIDY2002 UNIDY2002 force-pushed the mooncake-pr-eplb-scheduler branch from d03f5e2 to 9959c80 Compare November 5, 2025 03:33
@UNIDY2002 UNIDY2002 force-pushed the mooncake-pr-eplb-scheduler branch 2 times, most recently from de2465d to 2f068be Compare November 5, 2025 11:48
@UNIDY2002 UNIDY2002 requested a review from Fridge003 as a code owner November 7, 2025 07:58
@UNIDY2002 UNIDY2002 force-pushed the mooncake-pr-eplb-scheduler branch from ad44587 to 4e13ddd Compare December 15, 2025 06:55
@UNIDY2002 UNIDY2002 force-pushed the mooncake-pr-eplb-scheduler branch from 4e13ddd to e6caee4 Compare December 24, 2025 03:01
@UNIDY2002 UNIDY2002 force-pushed the mooncake-pr-eplb-scheduler branch from bd23c60 to e1c10bc Compare January 12, 2026 14:32
UNIDY2002 and others added 2 commits January 13, 2026 12:07
Co-authored-by: Hank Han <hanhan7630@outlook.com>
Co-authored-by: UNIDY2002 <unidy2002@outlook.com>
@UNIDY2002 UNIDY2002 force-pushed the mooncake-pr-eplb-scheduler branch from e1c10bc to 6fcbf39 Compare January 13, 2026 04:07
# Launch data parallel workers
self.scheduler_procs = []
self.workers: List[zmq.Socket] = [None] * server_args.dp_size
self.status: List[int] = [1] * server_args.dp_size
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the range of possible values ​​for this status?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be better to use an enum class here, I thinkx== 1 is not easy to understand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

status contains only 0 and 1, representing whether the worker is available

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we use bool ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, bool sounds better

Comment on lines +75 to +76
global_info_tensor.view(-1, 6)[tp_active_ranks == 0, :] = torch.tensor(
[0, 1, 0, 0, 1, ForwardMode.IDLE.value],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code here seems to have poor readability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Collaborator

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. We need to disable failed ep rank and dp rank to achieve fault tolerance during large-scale deployment. However, Scheduler/Tokenizer Manager/DP changes should be reviewed carefully by @hnyls2002. Please ping him in the Slack.

@UNIDY2002 UNIDY2002 force-pushed the mooncake-pr-eplb-scheduler branch from 49fcb0e to 8cf153a Compare January 13, 2026 13:40
Copy link
Collaborator

@ch-wan ch-wan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general. I only have one minor comment.

"72",
]

def test_gsm8k_fault_1(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move these tests to the base class

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@UNIDY2002
Copy link
Contributor

All comments have been resolved. Waiting for the CI.

@UNIDY2002
Copy link
Contributor

May I request a rerun of the failed tests? Thanks!

@UNIDY2002
Copy link
Contributor

@ShangmingCai
Copy link
Collaborator

/rerun-stage unit-test-deepep-4-gpu

@github-actions
Copy link
Contributor

✅ Triggered unit-test-deepep-4-gpu to run independently (skipping dependencies).

@github-actions
Copy link
Contributor

🔗 View workflow run

@ShangmingCai
Copy link
Collaborator

rerun-stage is broken. We can bypass since the final push is safe, for it only skips one test.

@ShangmingCai ShangmingCai merged commit f7a5e42 into sgl-project:main Jan 20, 2026
30 of 59 checks passed
@UNIDY2002 UNIDY2002 deleted the mooncake-pr-eplb-scheduler branch January 21, 2026 02:41
GumpHaruhi pushed a commit to GumpHaruhi/sglang that referenced this pull request Mar 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants