feat: introduce async rebalance mode for expert load balancer by TheBasy · Pull Request #8529 · sgl-project/sglang

TheBasy · 2025-07-29T11:52:55Z

Add CLI arg --eplb-rebalance-async-mode
Background thread: broadcast logical_count → compute ExpertLocationMetadata → store in _rebalance_result
TP barrier: new gloo cpu_group signals (send_signal_cpu / recv_signal_cpu) ensure all ranks swap plan atomically
Yield-based generator keeps training loop non-blocking; transfer starts after TP-wide agreement via _begin_transfer
Sync mode (async=False) unchanged: blocking single-thread rebalance

Motivation

Modifications

Accuracy Test

Benchmark & Profiling

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

gemini-code-assist

Summary of Changes

Hello @TheBasy, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the Expert Load Balancer (EPLB) by introducing an asynchronous rebalance mode. This new mode offloads the rebalance plan computation to a background thread and employs non-blocking communication primitives for distributed synchronization, aiming to reduce the performance impact of rebalancing on the main training loop.

Highlights

Asynchronous Rebalance Mode: Introduced a new asynchronous rebalance mode for the Expert Load Balancer (EPLB), enabled via the --eplb-rebalance-async-mode CLI argument. This mode allows the computationally intensive rebalance plan calculation to occur in a background thread, preventing it from blocking the main training loop.
Non-blocking Communication Primitives: Added new send_signal_cpu and recv_signal_cpu methods to the ProcessGroup class. These methods utilize torch.distributed.isend and irecv for non-blocking, CPU-based integer signal communication, crucial for coordinating the asynchronous rebalance across distributed ranks.
TP-wide Synchronization Barrier: Implemented a new synchronization mechanism (tp_sync) using the non-blocking CPU signals. This ensures that all ranks in the Tensor Parallel (TP) group agree on a specific iteration (_begin_transfer) before the rebalance plan is applied, guaranteeing atomic plan swapping across the entire group.
Yield-based Generator for Training Loop Integration: The rebalance process is now integrated into the main training loop via a yield-based generator. This design allows the rebalance computation and synchronization steps to interleave with the training iterations, keeping the training loop non-blocking while the rebalance is prepared in the background.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces an asynchronous rebalancing mode for the expert load balancer to improve performance by reducing blocking time in the main loop. The changes involve adding new asynchronous signaling functions, using background threads for computation and synchronization, and updating the server arguments.

The implementation is well-structured, but there are a few areas for improvement:

There are a couple of incorrect type hints in the new signaling functions.
A hardcoded value for the number of GPUs per node could cause issues on different hardware setups.
There is some dead code (an unused variable).
The waiting logic for asynchronous operations can be made more efficient.

gemini-code-assist · 2025-07-29T11:55:19Z

python/sglang/srt/eplb/eplb_manager.py

The line torch.cuda.set_device(local_rank%8) uses a hardcoded value 8 for the number of GPUs per node. This assumption may not hold true for all hardware configurations and could lead to a CUDA error: invalid device ordinal if a node has a different number of GPUs. It's better to dynamically determine the number of devices using torch.cuda.device_count().

Suggested change

torch.cuda.set_device(local_rank%8)

torch.cuda.set_device(local_rank % torch.cuda.device_count())

gemini-code-assist · 2025-07-29T11:55:19Z

python/sglang/srt/distributed/parallel_state.py

The function is type-hinted to return None, but it actually returns a tuple (w, size_tensor). The type hint should be updated to reflect the actual return value for better type safety and code clarity.

Suggested change

def send_signal_cpu(self, dst_rank_global: int, value: int, group: Optional[ProcessGroup] = None,) -> None:

def send_signal_cpu(self, dst_rank_global: int, value: int, group: Optional[ProcessGroup] = None,) -> Tuple[torch.distributed.Work, torch.Tensor]:

gemini-code-assist · 2025-07-29T11:55:19Z

python/sglang/srt/distributed/parallel_state.py

The function is type-hinted to return Any, but it returns a tuple (w, size_tensor). For better code clarity and type safety, the type hint should be more specific.

Suggested change

def recv_signal_cpu(self, src_rank_global: int, group: Optional[ProcessGroup] = None, ) -> Any:

def recv_signal_cpu(self, src_rank_global: int, group: Optional[ProcessGroup] = None, ) -> Tuple[torch.distributed.Work, torch.Tensor]:

gemini-code-assist · 2025-07-29T11:55:20Z

python/sglang/srt/eplb/eplb_manager.py

The variable send_buffer_kept is initialized here, and appended to on line 155, but it is never used or returned. This appears to be dead code and should be removed to improve clarity.

gemini-code-assist · 2025-07-29T11:55:20Z

python/sglang/srt/eplb/eplb_manager.py

The while True loop with a nested for w in works: w.wait() results in waiting for the asynchronous operations sequentially. This negates the benefit of using non-blocking isend/irecv and is inefficient.

A better approach is to wait for all operations to complete in parallel by checking their completion status in a loop. Also, the while True loop in the original implementation is redundant as it would only ever execute once.

while not all(w.is_completed() for w in works): time.sleep(0.001) self._tp_sync_ongoing = False

TheBasy · 2025-07-30T08:01:15Z

We conducted the bench_serving test with a batch size of 4096 and a request rate of 4.
The synchronous EPLB mode results are as follows:

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    4.0
Max request concurrency:                 4096
Successful requests:                     4096
Benchmark duration (s):                  1082.51
Total input tokens:                      16777216
Total generated tokens:                  4194304
Total generated tokens (retokenized):    4179213
Request throughput (req/s):              3.78
Input token throughput (tok/s):          15498.40
Output token throughput (tok/s):         3874.60
Total token throughput (tok/s):          19373.00
Concurrency:                             275.71
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   72865.09
Median E2E Latency (ms):                 72389.68
---------------Time to First Token----------------
Mean TTFT (ms):                          1113.59
Median TTFT (ms):                        765.44
P99 TTFT (ms):                           6863.33
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          70.14
Median TPOT (ms):                        69.98
P90 TPOT (ms):                           73.48
P99 TPOT (ms):                           75.28
---------------Inter-Token Latency----------------
Mean ITL (ms):                           70.14
Median ITL (ms):                         66.65
P90 ITL (ms):                            79.06
P95 ITL (ms):                            88.69
P99 ITL (ms):                            139.56
Max ITL (ms):                            12072.54
==================================================

The asynchronous EPLB mode results are as follows:

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    4.0
Max request concurrency:                 4096
Successful requests:                     4096
Benchmark duration (s):                  1079.86
Total input tokens:                      16777216
Total generated tokens:                  4194304
Total generated tokens (retokenized):    4179411
Request throughput (req/s):              3.79
Input token throughput (tok/s):          15536.40
Output token throughput (tok/s):         3884.10
Total token throughput (tok/s):          19420.50
Concurrency:                             269.38
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   71019.86
Median E2E Latency (ms):                 70991.12
---------------Time to First Token----------------
Mean TTFT (ms):                          1079.06
Median TTFT (ms):                        730.95
P99 TTFT (ms):                           6248.78
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          68.37
Median TPOT (ms):                        68.62
P90 TPOT (ms):                           71.70
P99 TPOT (ms):                           72.50
---------------Inter-Token Latency----------------
Mean ITL (ms):                           68.37
Median ITL (ms):                         66.29
P90 ITL (ms):                            77.96
P95 ITL (ms):                            87.58
P99 ITL (ms):                            136.42
Max ITL (ms):                            8238.64
==================================================

With the same load (batch 4096, request rate 4 req/s), switching from synchronous to asynchronous EPLB yields a small but consistent throughput gain (~0.3 % more tokens and requests per second) while simultaneously cutting latency: mean end-to-end time drops ~2-3 %, P99 TTFT and max inter-token latency fall 9 % and 32 %, respectively, and the average in-flight concurrency shrinks by ~2 %. These improvements confirm that moving expert-parameter transfers to a background thread effectively overlaps communication with computation, giving higher hardware utilization and noticeably better tail-latency for large-batch, high-concurrency inference workloads.

fzyzcjy

oh I realize the code is not ready, thus ignore the code cleanup suggestions below for now - as long as the code is clean and neat when it is ready then it is good!

python/sglang/srt/distributed/parallel_state.py

python/sglang/srt/eplb/eplb_manager.py

TheBasy · 2025-08-01T10:36:20Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces an asynchronous mode for expert rebalancing in EPLB, which is a great feature for improving performance by overlapping computation and communication with the main model execution loop. The use of background threads and a dedicated CPU process group for synchronization is well-thought-out.

My review focuses on ensuring the correctness of the new asynchronous logic. I've identified a couple of race conditions related to thread synchronization flags that could lead to incorrect behavior. I've also suggested some minor renamings for clarity.

Overall, the changes are a significant improvement. Addressing the identified issues will make the implementation more robust.

gemini-code-assist · 2025-08-01T10:37:45Z

python/sglang/srt/eplb/eplb_manager.py

As mentioned in the comment on _compute_expert_metadata, to fix the race condition with self._compute_ongoing, it should be set to True here before starting the background thread.

Suggested change

self._compute_thread = threading.Thread(

target=self._compute_expert_metadata,

args=(logical_count_sum,),

daemon=True

)

self._compute_thread.start()

self._compute_ongoing = True

self._compute_thread = threading.Thread(

target=self._compute_expert_metadata,

args=(logical_count_sum,),

daemon=True

)

self._compute_thread.start()

gemini-code-assist · 2025-08-01T10:37:45Z

python/sglang/srt/eplb/eplb_manager.py

There is a race condition in how self._compute_ongoing is managed. It's set to True here in the background thread, but the main thread in compute() starts checking it in a loop immediately after starting this thread. The main thread might read the old False value before this line is executed, causing it to skip waiting for the computation to finish.

To fix this, self._compute_ongoing should be set to True in the compute() method before this thread is started. This line should be removed. I'll add another comment in compute() with the corresponding change.

gemini-code-assist · 2025-08-01T10:37:45Z

python/sglang/srt/eplb/eplb_manager.py

As mentioned in the comment on _wait_compute_sig, to fix the race condition with self._tp_sync_ongoing, it should be set to True here before starting the background thread.

Suggested change

self._tp_sync_thread = threading.Thread(

target=self._wait_compute_sig,

args=(send_works, recv_works,),

daemon=True

)

self._tp_sync_thread.start()

self._tp_sync_ongoing = True

self._tp_sync_thread = threading.Thread(

target=self._wait_compute_sig,

args=(send_works, recv_works,),

daemon=True

)

self._tp_sync_thread.start()

gemini-code-assist · 2025-08-01T10:37:46Z

python/sglang/srt/eplb/eplb_manager.py

Similar to _compute_ongoing, there is a race condition in how self._tp_sync_ongoing is managed. It's set to True here in the background thread, but the main thread in tp_sync() starts checking it in a loop after starting this thread. The main thread might read the old False value before this line is executed, causing it to skip waiting for the TP sync to finish.

To fix this, self._tp_sync_ongoing should be set to True in the tp_sync() method before this thread is started. This line should be removed. I'll add another comment in tp_sync() with the corresponding change.

python/sglang/srt/eplb/eplb_manager.py

TheBasy · 2025-08-01T10:42:53Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces an asynchronous rebalancing mode for the expert load balancer, which is a great feature for improving performance by overlapping computation and communication. The implementation uses background threads and generators to keep the main loop non-blocking, which is a solid approach.

My main concerns are around the robustness of the background threading. Specifically, exceptions in the background threads are not propagated and will cause the main loop to hang. I've added critical comments with suggestions to fix this using try...finally blocks.

I've also included a few medium-severity comments to improve code clarity and maintainability, such as removing redundant yield statements, clarifying a magic number with a comment, and renaming a function to better reflect its purpose.

Overall, this is a well-structured feature, and addressing the exception handling will make it much more robust.

gemini-code-assist · 2025-08-01T10:44:16Z

python/sglang/srt/eplb/eplb_manager.py

If an exception occurs within ExpertLocationMetadata.init_by_eplb, the self._compute_ongoing = False line will not be executed. This will cause the main generator loop in compute() to hang indefinitely in the while self._compute_ongoing: spin-wait loop.

To prevent this, you should wrap the logic in a try...finally block to ensure self._compute_ongoing is always set to False, even if an error occurs.

For even greater robustness, consider capturing the exception and propagating it to the main thread to be handled, rather than just preventing the hang. A simple way to do this is to store the exception in an instance variable (e.g., self._compute_exception) and check for it in the main loop after the thread completes.

Suggested change

def _compute_expert_metadata(self, logical_count):

local_rank = self._model_runner.tp_rank

num_gpu_per_node = self._num_gpu_per_node

torch.cuda.set_device(local_rank % num_gpu_per_node)

expert_location_metadata = ExpertLocationMetadata.init_by_eplb(

self._server_args, self._model_runner.model_config, logical_count

)

self._rebalance_result = expert_location_metadata

self._compute_ongoing = False

def _compute_expert_metadata(self, logical_count):

try:

local_rank = self._model_runner.tp_rank

num_gpu_per_node = self._num_gpu_per_node

torch.cuda.set_device(local_rank % num_gpu_per_node)

expert_location_metadata = ExpertLocationMetadata.init_by_eplb(

self._server_args, self._model_runner.model_config, logical_count

)

self._rebalance_result = expert_location_metadata

finally:

self._compute_ongoing = False

gemini-code-assist · 2025-08-01T10:44:16Z

python/sglang/srt/eplb/eplb_manager.py

If any of the w.wait() calls raise an exception (e.g., due to a timeout or a distributed communication error), the self._tp_sync_ongoing = False line will be skipped. This will cause the main generator loop in tp_sync() to hang in the while self._tp_sync_ongoing: spin-wait loop.

To ensure the application remains responsive, you should wrap the waiting logic in a try...finally block. This guarantees that _tp_sync_ongoing is set to False regardless of whether the operations succeed or fail.

Suggested change

def _wait_compute_sig(self, send_works,recv_works):

works = send_works + recv_works

for w in works:

w.wait()

self._tp_sync_ongoing = False

def _wait_compute_sig(self, send_works,recv_works):

try:

works = send_works + recv_works

for w in works:

w.wait()

finally:

self._tp_sync_ongoing = False

gemini-code-assist · 2025-08-01T10:44:16Z

python/sglang/srt/eplb/eplb_manager.py

This yield statement appears to be redundant. The while self._compute_ongoing: loop on the following lines will yield on its first iteration if the condition is met. If the condition is not met (i.e., the computation is already complete), the loop is skipped, and yielding here is unnecessary. Removing this line would make the code slightly cleaner without changing the logic.

gemini-code-assist · 2025-08-01T10:44:16Z

python/sglang/srt/eplb/eplb_manager.py

Similar to the compute method, this yield statement appears to be redundant. The while self._tp_sync_ongoing: loop on the next line will handle yielding. Removing this line will improve code clarity.

gemini-code-assist · 2025-08-01T10:44:16Z

python/sglang/srt/eplb/eplb_manager.py

The + 2 here is a magic number. For better maintainability and readability, it would be helpful to add a comment explaining why a buffer of 2 steps is needed. This will help future developers understand the synchronization logic.

For example, a comment could clarify that this buffer ensures all ranks have processed the synchronization signal before proceeding with the transfer.

Suggested change

self._begin_transfer_step = max(max_tensor.item(),self._send_signal_step) + 2

# Add a buffer of 2 steps to ensure all ranks see the new step before acting on it.

self._begin_transfer_step = max(max_tensor.item(),self._send_signal_step) + 2

gemini-code-assist · 2025-08-01T10:44:17Z

python/sglang/srt/eplb/eplb_manager.py

The function name _wait_compute_sig is a bit misleading, as this function waits for communication signals (isend/irecv works) from the tensor parallel group to complete, not for a computation signal. A more descriptive name like _wait_for_tp_sync_signals would better reflect its purpose and improve code clarity.

Remember to update the call site in tp_sync (line 145) if you rename this function.

Suggested change

def _wait_compute_sig(self, send_works,recv_works):

def _wait_for_tp_sync_signals(self, send_works,recv_works):

TheBasy · 2025-08-04T02:31:18Z

oh I realize the code is not ready, thus ignore the code cleanup suggestions below for now - as long as the code is clean and neat when it is ready then it is good!

Thanks for the heads-up!
The code is now complete and clean, so feel free to take a fresh look whenever convenient.
Really appreciate your time and earlier suggestions! @fzyzcjy

- Add CLI arg `--enable-eplb-rebalance-async` - Background thread: broadcast `logical_count` → compute `ExpertLocationMetadata` → store in `_rebalance_result` - TP barrier: use gloo `cpu_group` signals (`send_single_signal` / `recv_single_signal`) ensure all ranks swap plan atomically - Yield-based generator keeps decoding loop non-blocking; transfer starts after TP-wide agreement via `_begin_transfer` - Sync mode (`async=False`) unchanged: blocking single-thread rebalance

fzyzcjy · 2025-08-10T13:14:50Z

python/sglang/srt/eplb/eplb_manager.py

+        self._rebalance_result = expert_location_metadata
+        self._compute_ongoing = False
+
+    def tp_sync(self):


hmm not sure whether this is optimal or not (do not have time to dig deeper). but FYI there is a PollBasedBarrier that may look somehow related

Thanks for the note!
In our tp_sync function, we're aiming to implement a non-blocking all-reduce max operation to synchronize the step_counter across TP ranks. The goal is to ensure that all TP ranks proceed to the data transfer phase only after the maximum step_counter among them is reached, while allowing each rank to continue decoding independently during the synchronization window.
If TP ranks enter the transfer phase at different decoding steps, the earlier ones would have to transfer one layer and then wait for others to finish their current decode — this creates global blocking and hurts throughput.
We did try using torch.distributed.all_reduce directly, but it blocks the main inference thread due to the collective communication being synchronous, which disrupts the continuous decoding flow. That’s why we’re exploring alternative designs to achieve asynchronous coordination without stalling inference.

harleyszhang · 2025-08-22T19:03:41Z

what moe model you are test? can you should give your model server command, that is good for test reproduction.

TheBasy · 2025-08-25T02:54:29Z

what moe model you are test? can you should give your model server command, that is good for test reproduction.

I test Deepseek-R1. You only need to add one line, --enable-eplb-rebalance-async, to the original launch arguments to enable asynchronous computation during rebalancing. It is also recommended to set --eplb-rebalance-layers-per-chunk 1, which allows layerwise data migration during the rebalancing phase, further reducing the user's perception of interruption latency.

for example, my original launch arguments for the decode stage is:

python3 -m sglang.launch_server \
    --disaggregation-mode decode \
    --disaggregation-transfer-backend mooncake \
    --attention-backend flashmla \
    --trust-remote-code \
    --enable-dp-attention \
    --mem-fraction-static 0.9 \
    --context-length 65535 \
    --decode-log-interval 50 \
    --enable-cache-report \
    --moe-dense-tp-size 1 \
    --enable-dp-lm-head \
    --enable-deepep-moe \
    --deepep-mode low_latency \
    --disaggregation-ib-device mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3 \
    --model-path /home/models/deepseek-ai__DeepSeek-R1 \
    --host 0.0.0.0 \
    --port ${DEFAULT_PORT} \
    --dist-init-addr ${DECODE_MASTER_IP}:${MASTER_PORT} \
    --tp-size $((8 * NUM_DECODE)) \
    --dp-size $((DP_SIZE_PER_DECODE_NODE * NUM_DECODE)) \
    --chunked-prefill-size $((DP_SIZE_PER_DECODE_NODE * NUM_DECODE * CHUNKED_PREFILL_SIZE_PER_DP_RANK)) \
    --max-running-requests $((DP_SIZE_PER_DECODE_NODE * NUM_DECODE * DECODE_MAX_RUNNING_REQUEST_PER_DP_RANK)) \
    --nnodes ${NUM_DECODE} \
    --node-rank $node_rank \
    --enable-dp-attention \
    --enable-expert-distribution-metrics \
    --enable-eplb \
    --eplb-rebalance-num-iterations 1000 \
    --cuda-graph-max-bs ${DECODE_MAX_RUNNING_REQUEST_PER_DP_RANK} \
    > ${WORKDIR}/decode.log 2>&1 &

After using async rebalance mode, the launch arguments will be

python3 -m sglang.launch_server \
    --disaggregation-mode decode \
    --disaggregation-transfer-backend mooncake \
    --attention-backend flashmla \
    --trust-remote-code \
    --enable-dp-attention \
    --mem-fraction-static 0.9 \
    --context-length 65535 \
    --decode-log-interval 50 \
    --enable-cache-report \
    --moe-dense-tp-size 1 \
    --enable-dp-lm-head \
    --enable-deepep-moe \
    --deepep-mode low_latency \
    --disaggregation-ib-device mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3 \
    --model-path /home/models/deepseek-ai__DeepSeek-R1 \
    --host 0.0.0.0 \
    --port ${DEFAULT_PORT} \
    --dist-init-addr ${DECODE_MASTER_IP}:${MASTER_PORT} \
    --tp-size $((8 * NUM_DECODE)) \
    --dp-size $((DP_SIZE_PER_DECODE_NODE * NUM_DECODE)) \
    --chunked-prefill-size $((DP_SIZE_PER_DECODE_NODE * NUM_DECODE * CHUNKED_PREFILL_SIZE_PER_DP_RANK)) \
    --max-running-requests $((DP_SIZE_PER_DECODE_NODE * NUM_DECODE * DECODE_MAX_RUNNING_REQUEST_PER_DP_RANK)) \
    --nnodes ${NUM_DECODE} \
    --node-rank $node_rank \
    --enable-dp-attention \
    --enable-expert-distribution-metrics \
    --enable-eplb \
    --eplb-rebalance-layers-per-chunk 1 \
    --enable-eplb-rebalance-async \
    --eplb-rebalance-num-iterations 1000 \
    --cuda-graph-max-bs ${DECODE_MAX_RUNNING_REQUEST_PER_DP_RANK} \
    > ${WORKDIR}/decode.log 2>&1 &

jovany-wang · 2025-10-10T06:23:51Z

Do you have the related benchmark results only on decode instance?

jovany-wang · 2025-10-10T08:23:42Z

@TheBasy Could you merge the latest master to resolve the conflicts?

TheBasy · 2025-10-24T16:06:19Z

Quote reply
Reference in new

You can check this PR for the merge: antgroup#3. It includes the latest changes from master and resolves the conflicts.

gemini-code-assist bot reviewed Jul 29, 2025

View reviewed changes

TheBasy mentioned this pull request Jul 29, 2025

[Feature] Dynamic Expert Load Balance with Zero-like-overhead #7879

Closed

2 tasks

fzyzcjy reviewed Jul 31, 2025

View reviewed changes

python/sglang/srt/distributed/parallel_state.py Outdated Show resolved Hide resolved

python/sglang/srt/eplb/eplb_manager.py Outdated Show resolved Hide resolved

python/sglang/srt/eplb/eplb_manager.py Outdated Show resolved Hide resolved

TheBasy force-pushed the main branch 3 times, most recently from d2173a4 to 3fd46b3 Compare August 1, 2025 10:29

gemini-code-assist bot reviewed Aug 1, 2025

View reviewed changes

TheBasy force-pushed the main branch from 3fd46b3 to df5de38 Compare August 1, 2025 10:41

gemini-code-assist bot reviewed Aug 1, 2025

View reviewed changes

TheBasy force-pushed the main branch from df5de38 to be07d2f Compare August 1, 2025 10:50

TheBasy marked this pull request as ready for review August 1, 2025 10:51

TheBasy requested a review from fzyzcjy August 1, 2025 10:53

TheBasy force-pushed the main branch from be07d2f to 494a346 Compare August 6, 2025 07:45

TheBasy force-pushed the main branch from 494a346 to 59f60d5 Compare August 6, 2025 10:03

fzyzcjy reviewed Aug 10, 2025

View reviewed changes

This was referenced Oct 20, 2025

[Don't merge] Deploying DeepSeek-R1 on H20-96G with SGLang: Best Practices #11854

Closed

[Don't merge] Deploying DeepSeek-R1 on H20-96G with SGLang: Best Practices antgroup/sglang#4

Draft

merrymercy requested a review from ch-wan as a code owner November 29, 2025 07:06

	torch.cuda.set_device(local_rank%8)
	torch.cuda.set_device(local_rank % torch.cuda.device_count())

	def send_signal_cpu(self, dst_rank_global: int, value: int, group: Optional[ProcessGroup] = None,) -> None:
	def send_signal_cpu(self, dst_rank_global: int, value: int, group: Optional[ProcessGroup] = None,) -> Tuple[torch.distributed.Work, torch.Tensor]:

	def recv_signal_cpu(self, src_rank_global: int, group: Optional[ProcessGroup] = None, ) -> Any:
	def recv_signal_cpu(self, src_rank_global: int, group: Optional[ProcessGroup] = None, ) -> Tuple[torch.distributed.Work, torch.Tensor]:

-        self._compute_thread = threading.Thread(
-                target=self._compute_expert_metadata,
-                args=(logical_count_sum,),
-                daemon=True
-            )
-        self._compute_thread.start()
+        self._compute_ongoing = True
+        self._compute_thread = threading.Thread(
+                target=self._compute_expert_metadata,
+                args=(logical_count_sum,),
+                daemon=True
+            )
+        self._compute_thread.start()

-        self._tp_sync_thread = threading.Thread(
-                target=self._wait_compute_sig,
-                args=(send_works, recv_works,),
-                daemon=True
-            )
-        self._tp_sync_thread.start()
+        self._tp_sync_ongoing = True
+        self._tp_sync_thread = threading.Thread(
+                target=self._wait_compute_sig,
+                args=(send_works, recv_works,),
+                daemon=True
+            )
+        self._tp_sync_thread.start()

-    def _compute_expert_metadata(self, logical_count):
-        local_rank = self._model_runner.tp_rank
-        num_gpu_per_node = self._num_gpu_per_node
-        torch.cuda.set_device(local_rank % num_gpu_per_node)
-        expert_location_metadata = ExpertLocationMetadata.init_by_eplb(
-            self._server_args, self._model_runner.model_config, logical_count
-        )
-        self._rebalance_result = expert_location_metadata
-        self._compute_ongoing = False
+    def _compute_expert_metadata(self, logical_count):
+        try:
+            local_rank = self._model_runner.tp_rank
+            num_gpu_per_node = self._num_gpu_per_node
+            torch.cuda.set_device(local_rank % num_gpu_per_node)
+            expert_location_metadata = ExpertLocationMetadata.init_by_eplb(
+                self._server_args, self._model_runner.model_config, logical_count
+            )
+            self._rebalance_result = expert_location_metadata
+        finally:
+            self._compute_ongoing = False

-    def _wait_compute_sig(self, send_works,recv_works):
-        works = send_works + recv_works
-        for w in works:
-            w.wait()
-        self._tp_sync_ongoing = False
+    def _wait_compute_sig(self, send_works,recv_works):
+        try:
+            works = send_works + recv_works
+            for w in works:
+                w.wait()
+        finally:
+            self._tp_sync_ongoing = False

	self._begin_transfer_step = max(max_tensor.item(),self._send_signal_step) + 2
	# Add a buffer of 2 steps to ensure all ranks see the new step before acting on it.
	self._begin_transfer_step = max(max_tensor.item(),self._send_signal_step) + 2

	def _wait_compute_sig(self, send_works,recv_works):
	def _wait_for_tp_sync_signals(self, send_works,recv_works):

Conversation

TheBasy commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Test

Benchmark & Profiling

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

TheBasy commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fzyzcjy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TheBasy commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

TheBasy commented Aug 1, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 1, 2025

Choose a reason for hiding this comment

TheBasy commented Jul 29, 2025 •

edited

Loading

TheBasy commented Jul 30, 2025 •

edited

Loading

TheBasy commented Aug 1, 2025 •

edited

Loading

TheBasy Aug 12, 2025 •

edited

Loading

TheBasy commented Aug 25, 2025 •

edited

Loading