[6/N] (Elastic EP) Recover failed ranks by UNIDY2002 · Pull Request #15771 · sgl-project/sglang

UNIDY2002 · 2025-12-24T14:54:42Z

Motivation

As a follow-up to #11657, this PR enables SGLang to dynamically add back previously failed processes, recovering the optimal throughput. This also marks the completion of our planned support for elasticity and fault tolerance, as outlined in the roadmap in #8961.

The core idea is as follows. When a rank failure is detected, the data-parallel controller automatically re-launches the corresponding process. Meanwhile, the remaining healthy processes continue serving ongoing inference requests and periodically poll the status of the relaunched process. Once the new process becomes ready, it is able to seamlessly rejoin the existing process group. With this design, disruption to ongoing inference is minimized.

flowchart TD
    X[Healthy processes] --> Y[Failed process found]
    Y --> A
    A[Normal Inference Iteration] --> B{Is new process ready?}
    B -- No --> C[Run inference normally]
    C --> A
    B -- Yes --> D[Join new process into process group]
    D --> C

    U[New process] -->E
    E[Process Relaunched] --> F[Setup Python modules, CUDA, etc.]
    F --> H[Initialize distributed state]
    H --> D

Modifications

Add a store-true flag --mooncake-extend-group to server-args, indicating whether the process is relaunched to extend the ongoing group
In parallel-state, pass the flag to Mooncake during process group initialization
During inference, poll the status of new processes, and extend the ongoing process groups when necessary
In data-parallel-controller, relaunch processes after failure is detected.

Accuracy Tests

test_mooncake_ep_small.py should pass.

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

gemini-code-assist · 2025-12-24T14:54:46Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Skip EPLB rebalance after rank recovery and directly sync expert weights to recovered ranks instead of using EPLB which causes asymmetric P2P operations due to incorrect old_expert_location_metadata for recovered ranks. The root cause is that ExpertLocationUpdater relies on both old_expert_location_metadata and new_expert_location_metadata. For a recovered rank, the old metadata is stale/incorrect, leading to wrong P2P operations being calculated (e.g., rank 6 expects to send to rank 7, but rank 7 doesn't produce a corresponding irecv op). Instead of modifying expert_location_updater.py, the correct approach is to skip EPLB after 'EPLB due to rank faults' and let the expert weights be synced through normal Mooncake EP operation. See: sgl-project#15771

This was referenced Dec 24, 2025

[Elastic EP] First draft of scaling up #14118

Closed

Elastic EP Support (Milestone 1 & 2) #8961

Closed

UNIDY2002 force-pushed the mooncake-pr-recovery branch 2 times, most recently from 8ce7735 to f620cab Compare January 25, 2026 14:44

UNIDY2002 mentioned this pull request Mar 5, 2026

[PG] Share P2PProxy/ConnectionPoller threads across backends. kvcache-ai/Mooncake#1607

Merged

21 tasks

UNIDY2002 force-pushed the mooncake-pr-recovery branch from f620cab to 50203d8 Compare March 5, 2026 06:37

UNIDY2002 force-pushed the mooncake-pr-recovery branch from 50203d8 to 3cd8a7f Compare March 10, 2026 03:17

Recover ranks

c1e0869

UNIDY2002 force-pushed the mooncake-pr-recovery branch from 3cd8a7f to c1e0869 Compare March 17, 2026 06:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[6/N] (Elastic EP) Recover failed ranks#15771

[6/N] (Elastic EP) Recover failed ranks#15771
UNIDY2002 wants to merge 1 commit intosgl-project:mainfrom
HanHan009527:mooncake-pr-recovery

UNIDY2002 commented Dec 24, 2025

Uh oh!

gemini-code-assist bot commented Dec 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

UNIDY2002 commented Dec 24, 2025

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Dec 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant