Skip to content

Use reduce scatter for DP#8539

Merged
ch-wan merged 5 commits intosgl-project:mainfrom
trevor-m:reduce-scatter
Aug 6, 2025
Merged

Use reduce scatter for DP#8539
ch-wan merged 5 commits intosgl-project:mainfrom
trevor-m:reduce-scatter

Conversation

@trevor-m
Copy link
Copy Markdown
Collaborator

@trevor-m trevor-m commented Jul 29, 2025

Motivation

Similar to how #8280 enables all-gather for DP when using padding, this PR enables reducescatter instead of all-reduce following MOE/MLP layers in Deepseek.

Modifications

If DP with padding is used, all-reduce is skipped after MoE and MLP. The layer communicator will perform reduce scatter on hidden states instead of just scatter.

Currently this is implemented for deepseek, but we can extend this easily to other models that use LayerCommunicator later.

Accuracy Test

python3 -m sglang.launch_server --model-path nvidia/DeepSeek-R1-0528-FP4 --trust-remote-code --quantization modelopt_fp4 --tp 8 --enable-flashinfer-cutlass-moe --enable-ep-moe --ep-size 8 --dp 8 --enable-dp-attention
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319 --port=30000
Accuracy: 0.959
Invalid: 0.000
Latency: 20.852 s
Output throughput: 6857.902 token/s

Benchmark & Profiling

python3 -m sglang.launch_server --model-path nvidia/DeepSeek-R1-0528-FP4 --trust-remote-code --quantization modelopt_fp4 --tp 8 --enable-flashinfer-cutlass-moe --enable-ep-moe --ep-size 8 --dp 8 --enable-dp-attention
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompt 1024 --random-input 1024 --random-output 1024 --random-range-ratio 1 --max-concurrency 1024

with PR:

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1024
Successful requests:                     1024
Benchmark duration (s):                  70.14
Total input tokens:                      1048576
Total generated tokens:                  1048576
Total generated tokens (retokenized):    1046022
Request throughput (req/s):              14.60
Input token throughput (tok/s):          14949.45
Output token throughput (tok/s):         14949.45
Total token throughput (tok/s):          29898.91
Concurrency:                             1020.88
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   69927.88
Median E2E Latency (ms):                 69911.65
---------------Time to First Token----------------
Mean TTFT (ms):                          9383.06
Median TTFT (ms):                        9239.48
P99 TTFT (ms):                           17455.72
---------------Inter-Token Latency----------------
Mean ITL (ms):                           59.18
Median ITL (ms):                         50.93
P95 ITL (ms):                            62.90
P99 ITL (ms):                            69.13
Max ITL (ms):                            15106.24
==================================================

without PR:

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1024
Successful requests:                     1024
Benchmark duration (s):                  72.25
Total input tokens:                      1048576
Total generated tokens:                  1048576
Total generated tokens (retokenized):    1045943
Request throughput (req/s):              14.17
Input token throughput (tok/s):          14513.15
Output token throughput (tok/s):         14513.15
Total token throughput (tok/s):          29026.30
Concurrency:                             1020.99
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   72037.95
Median E2E Latency (ms):                 72012.01
---------------Time to First Token----------------
Mean TTFT (ms):                          9897.81
Median TTFT (ms):                        9698.01
P99 TTFT (ms):                           18522.87
---------------Inter-Token Latency----------------
Mean ITL (ms):                           60.74
Median ITL (ms):                         52.01
P95 ITL (ms):                            69.65
P99 ITL (ms):                            76.24
Max ITL (ms):                            16203.54
==================================================

Checklist

@trevor-m
Copy link
Copy Markdown
Collaborator Author

@ch-wan Could you please take a look?

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @trevor-m, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on optimizing the communication strategy for Deepseek models when running with Data Parallelism (DP) and padding enabled. By replacing the standard all-reduce operation with reduce-scatter after Mixture-of-Experts (MoE) and Multi-Layer Perceptron (MLP) layers, the changes aim to improve performance and efficiency in these specific distributed training scenarios.

Highlights

  • Communication Optimization: Introduced dp_reduce_scatter_tensor to enable more efficient communication for Data Parallel (DP) operations, specifically for padded inputs.
  • Conditional Communication Strategy: Implemented logic within the layer communicator to conditionally use reduce-scatter instead of scatter on hidden states when DP with padding is active, optimizing data movement.
  • MLP/MoE All-Reduce Skip: Modified the Deepseek model's MLP and MoE layers to skip the all-reduce operation after their forward pass when reduce-scatter is being utilized, preventing redundant communication.
  • Parameter Renaming: Renamed the can_fuse_mlp_allreduce parameter to skip_all_reduce in linear layers for broader applicability and clarity regarding when all-reduce should be bypassed.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an optimization to use reduce_scatter instead of all_reduce for data parallelism (DP) when padding is enabled, which is a good performance enhancement. The changes are logical and well-contained. My main feedback focuses on improving code clarity and maintainability by reducing coupling between components and clarifying function names.

@trevor-m trevor-m force-pushed the reduce-scatter branch 3 times, most recently from 135e2a7 to e919208 Compare July 29, 2025 23:53
@kaixih
Copy link
Copy Markdown
Collaborator

kaixih commented Jul 30, 2025

LGTM.

Copy link
Copy Markdown
Collaborator

@nvcastet nvcastet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zhyncs zhyncs self-assigned this Aug 1, 2025
Copy link
Copy Markdown
Collaborator

@ch-wan ch-wan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. I have approved the PR. Could you clean up merge conflicts?

@ch-wan
Copy link
Copy Markdown
Collaborator

ch-wan commented Aug 2, 2025

Oh, I have another question. This communicator is also used in llama and qwen. I believe your optimization can be applied to these models as well.

@ch-wan ch-wan self-requested a review August 2, 2025 19:08
@trevor-m
Copy link
Copy Markdown
Collaborator Author

trevor-m commented Aug 4, 2025

@ch-wan Thank you, I have fix the conflicts. Yes, I believe this can be applied to those models too - can this be done in a follow-up PR?

@nvcastet
Copy link
Copy Markdown
Collaborator

nvcastet commented Aug 5, 2025

@merrymercy Do you mind reviewing this PR since you are a code owner?

@ch-wan ch-wan merged commit c0e8429 into sgl-project:main Aug 6, 2025
55 of 65 checks passed
This was referenced Aug 7, 2025
narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 17, 2025
MahmoudAshraf97 pushed a commit to MahmoudAshraf97/sglang that referenced this pull request Sep 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants