[3/N] Achieve fault tolerance at the DP level#11657
[3/N] Achieve fault tolerance at the DP level#11657ShangmingCai merged 9 commits intosgl-project:mainfrom
Conversation
65a0b1f to
6f302c3
Compare
2a08f54 to
09c4da0
Compare
09c4da0 to
567ada0
Compare
c13f1fc to
e7f23db
Compare
|
|
||
| _TP: Optional[GroupCoordinator] = None | ||
| _TP_ACTIVE_RANKS: Optional[torch.Tensor] = None | ||
| _TP_ACTIVE_RANKS_CPU: Optional[torch.Tensor] = None |
There was a problem hiding this comment.
I'm thinking, would it be more reasonable to rename the elastic ep module to elstic and group together concepts like _TP_ACTIVE_RANKS and _TP_ACTIVE_RANKS_CPU as much as possible?
There was a problem hiding this comment.
Make it less intrusive to the original logic
d03f5e2 to
9959c80
Compare
de2465d to
2f068be
Compare
ad44587 to
4e13ddd
Compare
4e13ddd to
e6caee4
Compare
bd23c60 to
e1c10bc
Compare
Co-authored-by: Hank Han <hanhan7630@outlook.com>
Co-authored-by: UNIDY2002 <unidy2002@outlook.com>
e1c10bc to
6fcbf39
Compare
| # Launch data parallel workers | ||
| self.scheduler_procs = [] | ||
| self.workers: List[zmq.Socket] = [None] * server_args.dp_size | ||
| self.status: List[int] = [1] * server_args.dp_size |
There was a problem hiding this comment.
What is the range of possible values for this status?
There was a problem hiding this comment.
Should be better to use an enum class here, I thinkx== 1 is not easy to understand.
There was a problem hiding this comment.
status contains only 0 and 1, representing whether the worker is available
There was a problem hiding this comment.
shall we use bool ?
There was a problem hiding this comment.
yeah, bool sounds better
| global_info_tensor.view(-1, 6)[tp_active_ranks == 0, :] = torch.tensor( | ||
| [0, 1, 0, 0, 1, ForwardMode.IDLE.value], |
There was a problem hiding this comment.
The code here seems to have poor readability.
There was a problem hiding this comment.
Make sense. We need to disable failed ep rank and dp rank to achieve fault tolerance during large-scale deployment. However, Scheduler/Tokenizer Manager/DP changes should be reviewed carefully by @hnyls2002. Please ping him in the Slack.
49fcb0e to
8cf153a
Compare
ch-wan
left a comment
There was a problem hiding this comment.
LGTM in general. I only have one minor comment.
| "72", | ||
| ] | ||
|
|
||
| def test_gsm8k_fault_1(self): |
There was a problem hiding this comment.
move these tests to the base class
|
All comments have been resolved. Waiting for the CI. |
|
May I request a rerun of the failed tests? Thanks! |
|
CI tests almost pass (https://github.com/sgl-project/sglang/actions/runs/21139539897?pr=11657), except for one case in stage-c-test-large-4-gpu-b200 and one pending unit-test-backend-4-gpu-gb200. |
|
/rerun-stage unit-test-deepep-4-gpu |
|
✅ Triggered |
|
rerun-stage is broken. We can bypass since the final push is safe, for it only skips one test. |

Motivation
The previous work of this PR only implemented fault tolerance at the EP level, but not at the DP level; we achieved this functionality by maintaining worker information through communication between the
Scheduler,TokenizerManager, andDataParallelController.Modifications
io_struct.py: Define scheduler status informationscheduler.py: Send status information to TokenizerManager.tokenizer_manager.py: Write a handler function to accept this type of information and send information to DataParallelController.data_parallel_controller.py: Write a handler function to accept this type of information and maintain worker information.Accuracy Tests
added a new unit test
test/srt/ep/test_mooncake_ep_small.pyBenchmarking and Profiling
Checklist