Use reduce scatter for DP by trevor-m · Pull Request #8539 · sgl-project/sglang

trevor-m · 2025-07-29T20:27:27Z

Motivation

Similar to how #8280 enables all-gather for DP when using padding, this PR enables reducescatter instead of all-reduce following MOE/MLP layers in Deepseek.

Modifications

If DP with padding is used, all-reduce is skipped after MoE and MLP. The layer communicator will perform reduce scatter on hidden states instead of just scatter.

Currently this is implemented for deepseek, but we can extend this easily to other models that use LayerCommunicator later.

Accuracy Test

python3 -m sglang.launch_server --model-path nvidia/DeepSeek-R1-0528-FP4 --trust-remote-code --quantization modelopt_fp4 --tp 8 --enable-flashinfer-cutlass-moe --enable-ep-moe --ep-size 8 --dp 8 --enable-dp-attention
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319 --port=30000
Accuracy: 0.959
Invalid: 0.000
Latency: 20.852 s
Output throughput: 6857.902 token/s

Benchmark & Profiling

python3 -m sglang.launch_server --model-path nvidia/DeepSeek-R1-0528-FP4 --trust-remote-code --quantization modelopt_fp4 --tp 8 --enable-flashinfer-cutlass-moe --enable-ep-moe --ep-size 8 --dp 8 --enable-dp-attention
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompt 1024 --random-input 1024 --random-output 1024 --random-range-ratio 1 --max-concurrency 1024

with PR:

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1024
Successful requests:                     1024
Benchmark duration (s):                  70.14
Total input tokens:                      1048576
Total generated tokens:                  1048576
Total generated tokens (retokenized):    1046022
Request throughput (req/s):              14.60
Input token throughput (tok/s):          14949.45
Output token throughput (tok/s):         14949.45
Total token throughput (tok/s):          29898.91
Concurrency:                             1020.88
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   69927.88
Median E2E Latency (ms):                 69911.65
---------------Time to First Token----------------
Mean TTFT (ms):                          9383.06
Median TTFT (ms):                        9239.48
P99 TTFT (ms):                           17455.72
---------------Inter-Token Latency----------------
Mean ITL (ms):                           59.18
Median ITL (ms):                         50.93
P95 ITL (ms):                            62.90
P99 ITL (ms):                            69.13
Max ITL (ms):                            15106.24
==================================================

without PR:

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1024
Successful requests:                     1024
Benchmark duration (s):                  72.25
Total input tokens:                      1048576
Total generated tokens:                  1048576
Total generated tokens (retokenized):    1045943
Request throughput (req/s):              14.17
Input token throughput (tok/s):          14513.15
Output token throughput (tok/s):         14513.15
Total token throughput (tok/s):          29026.30
Concurrency:                             1020.99
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   72037.95
Median E2E Latency (ms):                 72012.01
---------------Time to First Token----------------
Mean TTFT (ms):                          9897.81
Median TTFT (ms):                        9698.01
P99 TTFT (ms):                           18522.87
---------------Inter-Token Latency----------------
Mean ITL (ms):                           60.74
Median ITL (ms):                         52.01
P95 ITL (ms):                            69.65
P99 ITL (ms):                            76.24
Max ITL (ms):                            16203.54
==================================================

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

trevor-m · 2025-07-29T20:27:46Z

@ch-wan Could you please take a look?

gemini-code-assist

Summary of Changes

Hello @trevor-m, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on optimizing the communication strategy for Deepseek models when running with Data Parallelism (DP) and padding enabled. By replacing the standard all-reduce operation with reduce-scatter after Mixture-of-Experts (MoE) and Multi-Layer Perceptron (MLP) layers, the changes aim to improve performance and efficiency in these specific distributed training scenarios.

Highlights

Communication Optimization: Introduced dp_reduce_scatter_tensor to enable more efficient communication for Data Parallel (DP) operations, specifically for padded inputs.
Conditional Communication Strategy: Implemented logic within the layer communicator to conditionally use reduce-scatter instead of scatter on hidden states when DP with padding is active, optimizing data movement.
MLP/MoE All-Reduce Skip: Modified the Deepseek model's MLP and MoE layers to skip the all-reduce operation after their forward pass when reduce-scatter is being utilized, preventing redundant communication.
Parameter Renaming: Renamed the can_fuse_mlp_allreduce parameter to skip_all_reduce in linear layers for broader applicability and clarity regarding when all-reduce should be bypassed.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces an optimization to use reduce_scatter instead of all_reduce for data parallelism (DP) when padding is enabled, which is a good performance enhancement. The changes are logical and well-contained. My main feedback focuses on improving code clarity and maintainability by reducing coupling between components and clarifying function names.

python/sglang/srt/layers/dp_attention.py

python/sglang/srt/models/deepseek_v2.py

python/sglang/srt/layers/communicator.py

kaixih · 2025-07-30T05:25:49Z

LGTM.

nvcastet

LGTM

python/sglang/srt/layers/dp_attention.py

ch-wan

Nice work. I have approved the PR. Could you clean up merge conflicts?

ch-wan · 2025-08-02T19:07:57Z

Oh, I have another question. This communicator is also used in llama and qwen. I believe your optimization can be applied to these models as well.

trevor-m · 2025-08-04T17:03:49Z

@ch-wan Thank you, I have fix the conflicts. Yes, I believe this can be applied to those models too - can this be done in a follow-up PR?

nvcastet · 2025-08-05T14:31:00Z

@merrymercy Do you mind reviewing this PR since you are a code owner?

trevor-m requested review from BBuf, HaiShaw, Ying1123, ch-wan, hnyls2002, ispobock, kushanam, merrymercy and zhyncs as code owners July 29, 2025 20:27

gemini-code-assist bot reviewed Jul 29, 2025

View reviewed changes

python/sglang/srt/layers/dp_attention.py Outdated Show resolved Hide resolved

python/sglang/srt/models/deepseek_v2.py Outdated Show resolved Hide resolved

kushanam assigned nvcastet and kaixih Jul 29, 2025

nvcastet reviewed Jul 29, 2025

View reviewed changes

python/sglang/srt/models/deepseek_v2.py Outdated Show resolved Hide resolved

kaixih reviewed Jul 29, 2025

View reviewed changes

python/sglang/srt/models/deepseek_v2.py Outdated Show resolved Hide resolved

trevor-m force-pushed the reduce-scatter branch 3 times, most recently from 135e2a7 to e919208 Compare July 29, 2025 23:53

ch-wan reviewed Jul 30, 2025

View reviewed changes

python/sglang/srt/layers/communicator.py Outdated Show resolved Hide resolved

nvcastet approved these changes Jul 30, 2025

View reviewed changes

trevor-m force-pushed the reduce-scatter branch from 45b9b34 to 3b295a4 Compare July 30, 2025 19:33

kushanam approved these changes Jul 30, 2025

View reviewed changes

zhyncs assigned fzyzcjy, ispobock and ch-wan Jul 31, 2025

zhyncs added the high priority label Jul 31, 2025

ch-wan reviewed Aug 1, 2025

View reviewed changes

python/sglang/srt/layers/dp_attention.py Outdated Show resolved Hide resolved

zhyncs self-assigned this Aug 1, 2025

trevor-m force-pushed the reduce-scatter branch from bd50760 to eb82cf3 Compare August 1, 2025 23:41

ch-wan approved these changes Aug 2, 2025

View reviewed changes

ch-wan self-requested a review August 2, 2025 19:08

trevor-m force-pushed the reduce-scatter branch from eb82cf3 to 50e24ec Compare August 4, 2025 16:41

trevor-m added 4 commits August 6, 2025 18:54

Use reduce scatter for DP

31286fc

Only enable for deepseek for now

6a03f26

Fix case when dp < tp

6a3d07b

fix for rename down_proj arg

810b2d7

trevor-m force-pushed the reduce-scatter branch from 7fd0823 to 810b2d7 Compare August 6, 2025 18:54

trevor-m mentioned this pull request Aug 6, 2025

[bugfix] Fix storage of hidden states with "layernorm before all gather" #8875

Closed

6 tasks

Merge branch 'main' into reduce-scatter

3f484bf

merrymercy approved these changes Aug 6, 2025

View reviewed changes

ch-wan approved these changes Aug 6, 2025

View reviewed changes

ch-wan merged commit c0e8429 into sgl-project:main Aug 6, 2025
55 of 65 checks passed

This was referenced Aug 7, 2025

fix glm4 moe #8883

Merged

Support reduce-scatter when dp=tp #5988

Closed

Misaka9468 mentioned this pull request Aug 12, 2025

Feature: support qwen and llama4 reducescatter for dp attention padding #9101

Merged

4 tasks

narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 17, 2025

Use reduce scatter for DP (sgl-project#8539)

edd346c

MahmoudAshraf97 pushed a commit to MahmoudAshraf97/sglang that referenced this pull request Sep 8, 2025

Use reduce scatter for DP (sgl-project#8539)

ce91acc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use reduce scatter for DP#8539

Use reduce scatter for DP#8539
ch-wan merged 5 commits intosgl-project:mainfrom
trevor-m:reduce-scatter

trevor-m commented Jul 29, 2025 •

edited

Loading

Uh oh!

trevor-m commented Jul 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kaixih commented Jul 30, 2025

Uh oh!

nvcastet left a comment

Uh oh!

Uh oh!

ch-wan left a comment

Uh oh!

ch-wan commented Aug 2, 2025

Uh oh!

trevor-m commented Aug 4, 2025

Uh oh!

nvcastet commented Aug 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Conversation

trevor-m commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Test

Benchmark & Profiling

Checklist

Uh oh!

trevor-m commented Jul 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kaixih commented Jul 30, 2025

Uh oh!

nvcastet left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ch-wan left a comment

Choose a reason for hiding this comment

Uh oh!

ch-wan commented Aug 2, 2025

Uh oh!

trevor-m commented Aug 4, 2025

Uh oh!

nvcastet commented Aug 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

trevor-m commented Jul 29, 2025 •

edited

Loading