[2/3] Optimize Slime Update Weights: Avoid GPU-to-CPU Device Sync when update expert weights by hebiao064 · Pull Request #8753 · sgl-project/sglang

hebiao064 · 2025-08-04T06:44:28Z

Motivation

1/3: [1/3] Optimize Slime Update Weights: Remove QWen3MOE Load Weight Overhead #8751 （23s => 22s)
2/3: [2/3] Optimize Slime Update Weights: Avoid GPU-to-CPU Device Sync when update expert weights #8753 (22s => 12s)
3/3: [3/3] Optimize Slime Update Weights: Load Weight in non-blocking mode #8754 (12s => 10s)
This PR[2/3] Optimize Slime Update Weights: Avoid GPU-to-CPU Device Sync when update expert weights

Below graph shows a unnecessary device sync from Device to Host

This PR maps to optimization 2

Modifications

Accuracy Test

Benchmark & Profiling

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

gemini-code-assist

Summary of Changes

Hello @hebiao064, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

I've implemented a performance optimization aimed at reducing the overhead associated with updating expert weights. This change specifically targets the elimination of frequent and expensive GPU-to-CPU device synchronization, which was identified as a bottleneck. By caching necessary data on the CPU, I've streamlined the process of accessing expert location information, leading to more efficient weight updates.

Highlights

Performance Optimization for Expert Weight Updates: I've introduced a caching mechanism within the logical_to_all_physical method. By creating a CPU copy (_logical_to_all_physical_map_cpu) of the logical_to_all_physical_map tensor, we can avoid repeated and costly GPU-to-CPU data transfers, which are particularly expensive during expert weight updates.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces an optimization to avoid GPU-to-CPU synchronization in logical_to_all_physical by caching a CPU copy of the logical_to_all_physical_map tensor. While the optimization is a good idea, the current implementation introduces a critical bug where the cached data can become stale. The update method modifies the GPU tensor in-place but does not invalidate the new CPU cache, which will lead to incorrect expert routing. I've added a comment with details on the issue and how to resolve it.

python/sglang/srt/eplb/expert_location.py

gemini-code-assist · 2025-08-06T05:09:54Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

…n update expert weights (sgl-project#8753)

Avoid GPU-to-CPU Device Sync when update expert weights

531368e

hebiao064 requested a review from fzyzcjy as a code owner August 4, 2025 06:44

gemini-code-assist bot reviewed Aug 4, 2025

View reviewed changes

python/sglang/srt/eplb/expert_location.py Outdated Show resolved Hide resolved

This was referenced Aug 4, 2025

[3/3] Optimize Slime Update Weights: Load Weight in non-blocking mode #8754

Closed

[perf] Weight Sync Optimization in Colocate Mode THUDM/slime#132

Closed

[1/3] Optimize Slime Update Weights: Remove QWen3MOE Load Weight Overhead #8751

Merged

avoid eplb

3323503

hebiao064 added RLHF ready-to-merge The PR is ready to merge after the CI is green. labels Aug 4, 2025

zhuzilin approved these changes Aug 5, 2025

View reviewed changes

ch-wan merged commit cbbb738 into main Aug 6, 2025
166 of 178 checks passed

ch-wan deleted the avoid_device_sync_for_load_ep_weights branch August 6, 2025 05:09

narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 17, 2025

[2/3] Optimize Slime Update Weights: Avoid GPU-to-CPU Device Sync whe…

1323af0

…n update expert weights (sgl-project#8753)

hebiao064 mentioned this pull request Aug 28, 2025

[3/3] Optimize: Cache CUDA device to reduce redundant calls during tensor l… #8996

Merged

6 tasks

MahmoudAshraf97 pushed a commit to MahmoudAshraf97/sglang that referenced this pull request Sep 8, 2025

[2/3] Optimize Slime Update Weights: Avoid GPU-to-CPU Device Sync whe…

133bd20

…n update expert weights (sgl-project#8753)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[2/3] Optimize Slime Update Weights: Avoid GPU-to-CPU Device Sync when update expert weights#8753

[2/3] Optimize Slime Update Weights: Avoid GPU-to-CPU Device Sync when update expert weights#8753
ch-wan merged 2 commits intomainfrom
avoid_device_sync_for_load_ep_weights

hebiao064 commented Aug 4, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot commented Aug 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hebiao064 commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Test

Benchmark & Profiling

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot commented Aug 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hebiao064 commented Aug 4, 2025 •

edited

Loading