Skip to content

[6/N] (Elastic EP) Recover failed ranks#15771

Draft
UNIDY2002 wants to merge 1 commit intosgl-project:mainfrom
HanHan009527:mooncake-pr-recovery
Draft

[6/N] (Elastic EP) Recover failed ranks#15771
UNIDY2002 wants to merge 1 commit intosgl-project:mainfrom
HanHan009527:mooncake-pr-recovery

Conversation

@UNIDY2002
Copy link
Contributor

Motivation

As a follow-up to #11657, this PR enables SGLang to dynamically add back previously failed processes, recovering the optimal throughput. This also marks the completion of our planned support for elasticity and fault tolerance, as outlined in the roadmap in #8961.

The core idea is as follows. When a rank failure is detected, the data-parallel controller automatically re-launches the corresponding process. Meanwhile, the remaining healthy processes continue serving ongoing inference requests and periodically poll the status of the relaunched process. Once the new process becomes ready, it is able to seamlessly rejoin the existing process group. With this design, disruption to ongoing inference is minimized.

flowchart TD
    X[Healthy processes] --> Y[Failed process found]
    Y --> A
    A[Normal Inference Iteration] --> B{Is new process ready?}
    B -- No --> C[Run inference normally]
    C --> A
    B -- Yes --> D[Join new process into process group]
    D --> C

    U[New process] -->E
    E[Process Relaunched] --> F[Setup Python modules, CUDA, etc.]
    F --> H[Initialize distributed state]
    H --> D
Loading

Modifications

  • Add a store-true flag --mooncake-extend-group to server-args, indicating whether the process is relaunched to extend the ongoing group
  • In parallel-state, pass the flag to Mooncake during process group initialization
  • During inference, poll the status of new processes, and extend the ongoing process groups when necessary
  • In data-parallel-controller, relaunch processes after failure is detected.

Accuracy Tests

test_mooncake_ep_small.py should pass.

Benchmarking and Profiling

Checklist

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@UNIDY2002 UNIDY2002 force-pushed the mooncake-pr-recovery branch 2 times, most recently from 8ce7735 to f620cab Compare January 25, 2026 14:44
@UNIDY2002 UNIDY2002 force-pushed the mooncake-pr-recovery branch from f620cab to 50203d8 Compare March 5, 2026 06:37
UNIDY2002 added a commit to HanHan009527/sglang that referenced this pull request Mar 9, 2026
Skip EPLB rebalance after rank recovery and directly sync expert weights
to recovered ranks instead of using EPLB which causes asymmetric P2P
operations due to incorrect old_expert_location_metadata for recovered ranks.

The root cause is that ExpertLocationUpdater relies on both
old_expert_location_metadata and new_expert_location_metadata. For a
recovered rank, the old metadata is stale/incorrect, leading to wrong
P2P operations being calculated (e.g., rank 6 expects to send to rank 7,
but rank 7 doesn't produce a corresponding irecv op).

Instead of modifying expert_location_updater.py, the correct approach
is to skip EPLB after 'EPLB due to rank faults' and let the expert
weights be synced through normal Mooncake EP operation.

See: sgl-project#15771
@UNIDY2002 UNIDY2002 force-pushed the mooncake-pr-recovery branch from 50203d8 to 3cd8a7f Compare March 10, 2026 03:17
@UNIDY2002 UNIDY2002 force-pushed the mooncake-pr-recovery branch from 3cd8a7f to c1e0869 Compare March 17, 2026 06:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant