Support reduce-scatter when dp=tp by EstherBear · Pull Request #5988 · sgl-project/sglang

EstherBear · 2025-05-02T21:58:56Z

Motivation

This PR modifies the deepseek model to use reduce-scatter in place of separate reduce and scatter operations when data parallel attention is enabled and dp == tp. By doing so, it exposes potential opportunities for the use of optimized reduce-scatter kernels for performance gains in future.

Modifications

Modify deepseek model to use reduce-scatter when data parallel attention is enabled and dp == tp.
Modify CudaGraphRunner and ForwardBatch to include necessary metadata for the reduce-scatter operations.

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

EstherBear · 2025-05-04T03:41:51Z

Hi @ch-wan, is this ready to merge or should I do any modifications?

EstherBear · 2025-05-05T17:24:04Z

Hi @ch-wan, any updates?

ch-wan · 2025-05-06T17:37:40Z

@EstherBear Sorry for the late response. I'm going to review it tonight.

ch-wan · 2025-05-07T08:15:03Z

python/sglang/srt/models/deepseek_v2.py

        if self.tp_size > 1:
-            final_hidden_states = tensor_model_parallel_all_reduce(final_hidden_states)
+            if self.tp_size == self.dp_size:
+                tensor_list = list(final_hidden_states.tensor_split(split_indices_cpu))


How about using Tensor.split() rather than Tensor.tensor_split()? This can reuse the info in forward_batch.global_num_tokens_gpu to make the code concise.

ch-wan · 2025-05-07T08:20:11Z

python/sglang/srt/layers/dp_attention.py

    return get_attention_tp_group().all_gather(input_, tensor_list=output_list)
+
+
+def dp_reduce_scatter(


This function performs reduce-scatter across the global TP group. We may name it as tensor_model_parallel_reduce_scatter and place it to a different file.

I think just tp_reduce_scatter is sufficient

ch-wan · 2025-05-07T08:21:16Z

python/sglang/srt/models/deepseek_v2.py

+            if self.tp_size == self.dp_size:
+                tensor_list = list(final_hidden_states.tensor_split(split_indices_cpu))
+                final_hidden_states = tensor_list[self.dp_rank]
+                dp_reduce_scatter(final_hidden_states, tensor_list)


tensor_model_parallel_reduce_scatter

ch-wan · 2025-05-07T08:23:04Z

python/sglang/srt/models/deepseek_v2.py

+                        ],
+                        hidden_states,
+                    )
+                    dp_copy(hidden_states, tmp_hidden_states, forward_batch)


If my understanding is correct, this line is doing the same thing with dp_scatter.

ch-wan · 2025-05-07T08:32:18Z

@EstherBear Thank you for your contribution. I have finished my review and left some comments. It appears that the PR does not pass some CIs (e.g., https://github.com/sgl-project/sglang/actions/runs/14876617018/job/41775058885?pr=5988). Could you please fix it? Also, it is highly encouraged to share your benchmark results. This allows us to clearly see the performance gain from your PR. For example, you can launch this to compare the efficiency:

python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V2-Lite --disable-radix-cache --trust-remote-code --tp 8 --dp 8 --enable-dp-attention

python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompt 512 --random-input 1000 --random-output 1000 --random-range-ratio 1 --host 127.0.0.1 --port 30000 --max-concurrency 128

ch-wan · 2025-08-11T20:51:40Z

Done in #8539

Support reduce-scatter when dp=tp

d4db915

EstherBear requested review from ByronHsu, HaiShaw, Ying1123, ch-wan, hnyls2002, ispobock, merrymercy and zhyncs as code owners May 2, 2025 21:58

Merge branch 'main' into feature/ds-opt

3a92996

zhyncs assigned ch-wan May 2, 2025

Merge branch 'main' into feature/ds-opt

aa9ab69

Merge branch 'main' into feature/ds-opt

8980946

Merge branch 'main' into feature/ds-opt

c27f19b

ch-wan requested changes May 7, 2025

View reviewed changes

ch-wan closed this Aug 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support reduce-scatter when dp=tp#5988

Support reduce-scatter when dp=tp#5988
EstherBear wants to merge 5 commits intosgl-project:mainfrom
EstherBear:feature/ds-opt

EstherBear commented May 2, 2025

Uh oh!

EstherBear commented May 4, 2025

Uh oh!

EstherBear commented May 5, 2025

Uh oh!

ch-wan commented May 6, 2025

Uh oh!

ch-wan May 7, 2025

Uh oh!

ch-wan May 7, 2025 •

edited

Loading

Uh oh!

jon-chuang May 14, 2025

Uh oh!

ch-wan May 7, 2025

Uh oh!

ch-wan May 7, 2025

Uh oh!

ch-wan commented May 7, 2025

Uh oh!

ch-wan commented Aug 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		return get_attention_tp_group().all_gather(input_, tensor_list=output_list)


		def dp_reduce_scatter(

Conversation

EstherBear commented May 2, 2025

Motivation

Modifications

Checklist

Uh oh!

EstherBear commented May 4, 2025

Uh oh!

EstherBear commented May 5, 2025

Uh oh!

ch-wan commented May 6, 2025

Uh oh!

ch-wan May 7, 2025

Choose a reason for hiding this comment

Uh oh!

ch-wan May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jon-chuang May 14, 2025

Choose a reason for hiding this comment

Uh oh!

ch-wan May 7, 2025

Choose a reason for hiding this comment

Uh oh!

ch-wan May 7, 2025

Choose a reason for hiding this comment

Uh oh!

ch-wan commented May 7, 2025

Uh oh!

ch-wan commented Aug 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ch-wan May 7, 2025 •

edited

Loading