[6/N] (Elastic EP) Recover failed ranks#15771
Draft
UNIDY2002 wants to merge 1 commit intosgl-project:mainfrom
Draft
[6/N] (Elastic EP) Recover failed ranks#15771UNIDY2002 wants to merge 1 commit intosgl-project:mainfrom
UNIDY2002 wants to merge 1 commit intosgl-project:mainfrom
Conversation
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
This was referenced Dec 24, 2025
8ce7735 to
f620cab
Compare
21 tasks
f620cab to
50203d8
Compare
UNIDY2002
added a commit
to HanHan009527/sglang
that referenced
this pull request
Mar 9, 2026
Skip EPLB rebalance after rank recovery and directly sync expert weights to recovered ranks instead of using EPLB which causes asymmetric P2P operations due to incorrect old_expert_location_metadata for recovered ranks. The root cause is that ExpertLocationUpdater relies on both old_expert_location_metadata and new_expert_location_metadata. For a recovered rank, the old metadata is stale/incorrect, leading to wrong P2P operations being calculated (e.g., rank 6 expects to send to rank 7, but rank 7 doesn't produce a corresponding irecv op). Instead of modifying expert_location_updater.py, the correct approach is to skip EPLB after 'EPLB due to rank faults' and let the expert weights be synced through normal Mooncake EP operation. See: sgl-project#15771
50203d8 to
3cd8a7f
Compare
3cd8a7f to
c1e0869
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
As a follow-up to #11657, this PR enables SGLang to dynamically add back previously failed processes, recovering the optimal throughput. This also marks the completion of our planned support for elasticity and fault tolerance, as outlined in the roadmap in #8961.
The core idea is as follows. When a rank failure is detected, the data-parallel controller automatically re-launches the corresponding process. Meanwhile, the remaining healthy processes continue serving ongoing inference requests and periodically poll the status of the relaunched process. Once the new process becomes ready, it is able to seamlessly rejoin the existing process group. With this design, disruption to ongoing inference is minimized.
flowchart TD X[Healthy processes] --> Y[Failed process found] Y --> A A[Normal Inference Iteration] --> B{Is new process ready?} B -- No --> C[Run inference normally] C --> A B -- Yes --> D[Join new process into process group] D --> C U[New process] -->E E[Process Relaunched] --> F[Setup Python modules, CUDA, etc.] F --> H[Initialize distributed state] H --> DModifications
--mooncake-extend-groupto server-args, indicating whether the process is relaunched to extend the ongoing groupAccuracy Tests
test_mooncake_ep_small.pyshould pass.Benchmarking and Profiling
Checklist