Opt tp: tp attn support tp reduce scattered input by xu-yfei · Pull Request #10568 · sgl-project/sglang

xu-yfei · 2025-09-17T12:12:51Z

Motivation

In H20(96GB) TP8 prefill, optimize the original combined operation:
embed/mlp all reduce + RMSNorm + fused_qkv_a_proj_with_mqa
into:
embed/mlp reduce scatter + RMSNorm + fused_qkv_a_proj_with_mqa + all gather

Use switch --enable-attn-tp-input-scattered to enable this feature.

This optimization primarily brings the following improvements:

Computation and Memory Reduction: The amount of data that RMSNorm and fused_qkv_a_proj_with_mqa need to process is reduced to 1/8 of the original.
Communication Pattern Optimization: The all reduce communication is decomposed into reduce scatter + all gather. During the all gather phase, after fused_qkv_a_proj_with_mqa, the last dimension is reduced from 7168 to (1536 + 512 + 64), significantly reducing the communication data volume.

Based on the performance sampling data for 16K chunked prefill, the effects after optimization are as follows:

The latency offused_qkv_a_proj_with_mqa decreased from 205.1 ms to 26.14 ms.
The total latency of communication has decreased from 267.1 ms to 249.63 ms.
The total latency of RMSNorm has decreased from 82.303 ms to 43.398 ms.

After:

Before:

Modifications

Accuracy Tests

#gsm8k
Accuracy: 0.955
Invalid: 0.000
Latency: 271.702 s
Output throughput: 467.895 token/s

#mmlu
subject: abstract_algebra, #q:100, acc: 0.760
subject: anatomy, #q:135, acc: 0.844
subject: astronomy, #q:152, acc: 0.934
subject: business_ethics, #q:100, acc: 0.890
subject: clinical_knowledge, #q:265, acc: 0.928
subject: college_biology, #q:144, acc: 0.965
subject: college_chemistry, #q:100, acc: 0.620
subject: college_computer_science, #q:100, acc: 0.830
subject: college_mathematics, #q:100, acc: 0.760
subject: college_medicine, #q:173, acc: 0.873
subject: college_physics, #q:102, acc: 0.804
subject: computer_security, #q:100, acc: 0.890
subject: conceptual_physics, #q:235, acc: 0.936
subject: econometrics, #q:114, acc: 0.763
subject: electrical_engineering, #q:145, acc: 0.869
subject: elementary_mathematics, #q:378, acc: 0.939
subject: formal_logic, #q:126, acc: 0.802
subject: global_facts, #q:100, acc: 0.670
subject: high_school_biology, #q:310, acc: 0.952
subject: high_school_chemistry, #q:203, acc: 0.857
subject: high_school_computer_science, #q:100, acc: 0.940
subject: high_school_european_history, #q:165, acc: 0.885
subject: high_school_geography, #q:198, acc: 0.965
subject: high_school_government_and_politics, #q:193, acc: 0.984
subject: high_school_macroeconomics, #q:390, acc: 0.926
subject: high_school_mathematics, #q:270, acc: 0.748
subject: high_school_microeconomics, #q:238, acc: 0.962
subject: high_school_physics, #q:151, acc: 0.834
subject: high_school_psychology, #q:545, acc: 0.971
subject: high_school_statistics, #q:216, acc: 0.861
subject: high_school_us_history, #q:204, acc: 0.961
subject: high_school_world_history, #q:237, acc: 0.949
subject: human_aging, #q:223, acc: 0.852
subject: human_sexuality, #q:131, acc: 0.939
subject: international_law, #q:121, acc: 0.942
subject: jurisprudence, #q:108, acc: 0.907
subject: logical_fallacies, #q:163, acc: 0.933
subject: machine_learning, #q:112, acc: 0.786
subject: management, #q:103, acc: 0.932
subject: marketing, #q:234, acc: 0.944
subject: medical_genetics, #q:100, acc: 0.940
subject: miscellaneous, #q:783, acc: 0.951

Benchmarking and Profiling

export SGL_ENABLE_JIT_DEEPGEMM=1
export TORCHINDUCTOR_CACHE_DIR=/home/admin/inductor_root_cache
export SGLANG_TORCH_PROFILER_DIR=/home/admin/torch_profiler
export SGL_CHUNKED_PREFIX_CACHE_USE_TUNED=1
model_path=/home/deepseek-ai__DeepSeek-R1

python3 -m sglang.launch_server --model-path $model_path \
--host 0.0.0.0 --port 8000 --trust-remote-code \
--enable-cache-report --quantization fp8 --log-level info --max-running-requests 32 \
--mem-fraction-static 0.92 --chunked-prefill-size 16384 --context-length 65535 --chat-template /home/r1.jinja \
--attention-backend fa3 \
--disable-radix-cache \
--tp-size 8 --enable-metrics --cuda-graph-max-bs 32

input_len=1000  # 2000, 4000
python3 -m sglang.bench_serving --backend sglang --dataset-name random \
--random-input ${input_len} --random-output 1 --request-rate 1000 \
--num-prompt 500 --random-range-ratio 1 --max-concurrency 16  --port 8000 
--dataset-path /home/ShareGPT_V3_unfiltered_cleaned_split.json

Request throughput (req/s) :

Input Length	Before PR	After PR
1000	12.82	14.22
2000	6.52	7.33
4000	2.49	2.72
4096	2.41	2.63

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

Summary by CodeRabbit

New Features
- Added support for tensor-parallel scattered attention, enabling more efficient multi-GPU attention workflows.
- Introduced a fused QKV projection path with automatic gathering when required.
- Added an option to skip reduction in vocab-parallel embeddings for advanced workflows.
Performance
- Reduced communication overhead in attention and embedding stages, improving scalability on tensor-parallel setups.
Bug Fixes
- Prevented potential hangs in fused paths when processing empty micro-batches.
Refactor
- Exposed additional runtime context to better coordinate tensor-parallel operations.

gemini-code-assist

Summary of Changes

Hello @xu-yfei, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant optimization for the prefill phase of large language models, specifically targeting H20 (96GB) TP8 configurations. By refactoring the communication pattern around attention and MLP layers from a single all-reduce to a reduce-scatter followed by an all-gather, the changes drastically reduce the amount of data processed by intermediate layers and minimize communication overhead. This leads to notable improvements in overall latency and computational efficiency.

Highlights

Optimized Communication Pattern: The core optimization replaces a single all reduce operation with a reduce scatter followed by an all gather for embed/mlp and fused_qkv_a_proj_with_mqa in H20 (96GB) TP8 prefill. This reduces the data processed by RMSNorm and fused_qkv_a_proj_with_mqa to 1/8th of the original and significantly decreases communication data volume.
Performance Improvements: Benchmarking shows substantial latency reductions: fused_qkv_a_proj_with_mqa decreased from 205.1 ms to 26.14 ms, total communication from 267.1 ms to 249.63 ms, and RMSNorm from 82.303 ms to 43.398 ms for an input length of 4000.
Flexible Attention Input Handling: Introduced attn_input_tp_scattered flags and logic to dynamically determine whether attention inputs should be scattered across tensor parallel ranks, allowing for more efficient processing in specific configurations.
New Communication Primitives: Added a tp_all_gather function to facilitate the new communication pattern and a _tp_reduce_scatter_or_all_reduce method within the LayerCommunicator to manage the conditional application of reduce-scatter or all-reduce operations.
DeepseekV2 Model Integration: The DeepseekV2 model's attention and decoder layers have been updated to leverage this new communication strategy, with helper functions to enable the optimization under specific conditions (e.g., q_lora_rank and forward mode).

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a commendable optimization for tensor parallelism by replacing all_reduce with a reduce_scatter + all_gather pattern. This change, primarily affecting communicator.py and deepseek_v2.py, aims to reduce communication overhead and memory usage. While the optimization strategy is sound, my review identified a critical correctness issue in the new communication logic within communicator.py concerning the handling of scattered residuals. This flaw is likely to produce incorrect model outputs and needs to be addressed. I have also provided suggestions for a type hint correction and a refactoring opportunity in deepseek_v2.py to improve code maintainability.

python/sglang/srt/layers/communicator.py

python/sglang/srt/models/deepseek_v2.py

whybeyoung · 2025-09-28T15:40:53Z

can you resolve the confilicts ?

xu-yfei · 2025-09-29T02:35:00Z

can you resolve the confilicts ?

done

miter6 · 2025-09-29T07:05:57Z

Do you test the even input ids??
There will be a crash which the ids is not the power of 2.

xu-yfei · 2025-09-29T07:14:27Z

Do you test the even input ids?? There will be a crash which the ids is not the power of 2.

Whether it is an even or odd number, it is very common. I have verified that it is normal. What is your current scenario? TP8? What is the specific error?

miter6 · 2025-09-29T07:38:01Z

input sequence is [1023,7168] and tp 8.

miter6 · 2025-09-29T07:39:55Z

I have implemented a similar feature locally.
We must pad the sequence to fit the tp-size.

xu-yfei · 2025-09-29T07:44:41Z

I have implemented a similar feature locally. We must pad the sequence to fit the tp-size.

a = torch.randn((1023, 7168), dtype=torch.bfloat16, device="cuda")
>>> b = a.tensor_split(8)
>>> b[0].shape
torch.Size([128, 7168])
>>> b[-1].shape
torch.Size([127, 7168])

xu-yfei · 2025-10-09T07:37:20Z

@merrymercy @yizhang2077 Could you please help review this pr?

python/sglang/srt/models/deepseek_v2.py

python/sglang/srt/model_executor/forward_batch_info.py

python/sglang/srt/models/deepseek_v2.py

xu-yfei · 2025-11-10T08:30:55Z

@ch-wan I've updated a version according to the review comments. Could you please review it?

ch-wan

It's much better than the original version. Thank you for your continuous efforts.

python/sglang/srt/server_args.py

xu-yfei requested review from BBuf, Edwardf0t1, HaiShaw, Ying1123, ch-wan, ispobock, kushanam, merrymercy, yizhang2077 and zhyncs as code owners September 17, 2025 12:12

gemini-code-assist bot reviewed Sep 17, 2025

View reviewed changes

python/sglang/srt/layers/communicator.py Outdated Show resolved Hide resolved

python/sglang/srt/layers/communicator.py Outdated Show resolved Hide resolved

python/sglang/srt/models/deepseek_v2.py Outdated Show resolved Hide resolved

xu-yfei force-pushed the xyf/qkv_opt branch from 17b27f3 to 24ebcbc Compare September 19, 2025 04:03

huangtingwei9988 added the run-ci label Sep 19, 2025

xu-yfei force-pushed the xyf/qkv_opt branch 3 times, most recently from 4ed160a to c357dd0 Compare September 25, 2025 07:28

xu-yfei changed the title ~~Opt fused_qkv_a_proj_with_mqa: tp attn support tp reduce scattered input~~ Opt tp: tp attn support tp reduce scattered input Sep 25, 2025

xu-yfei force-pushed the xyf/qkv_opt branch from c357dd0 to d395444 Compare September 29, 2025 02:34

zhyncs approved these changes Sep 29, 2025

View reviewed changes

xu-yfei force-pushed the xyf/qkv_opt branch from 0578335 to b3724c9 Compare September 29, 2025 10:53

xu-yfei force-pushed the xyf/qkv_opt branch from b3724c9 to 9e774a6 Compare October 9, 2025 07:23

TianyuZhang1214 mentioned this pull request Oct 20, 2025

[Don't merge] Deploying DeepSeek-R1 on H20-96G with SGLang: Best Practices antgroup/sglang#4

Draft

xu-yfei force-pushed the xyf/qkv_opt branch 4 times, most recently from abc792c to 0d28a37 Compare October 28, 2025 06:39

xu-yfei force-pushed the xyf/qkv_opt branch from fab6798 to 2bcf277 Compare October 30, 2025 03:07

xu-yfei requested a review from ch-wan October 31, 2025 08:11

ch-wan reviewed Nov 4, 2025

View reviewed changes

python/sglang/srt/models/deepseek_v2.py Outdated Show resolved Hide resolved

python/sglang/srt/model_executor/forward_batch_info.py Outdated Show resolved Hide resolved

python/sglang/srt/models/deepseek_v2.py Outdated Show resolved Hide resolved

xu-yfei force-pushed the xyf/qkv_opt branch from e40fca0 to 8baf7a3 Compare November 10, 2025 08:29

xu-yfei requested a review from Fridge003 as a code owner November 10, 2025 08:29

github-actions bot added the deepseek label Nov 10, 2025

xu-yfei force-pushed the xyf/qkv_opt branch 3 times, most recently from 15ba980 to 2febc76 Compare November 12, 2025 07:19

ch-wan reviewed Nov 15, 2025

View reviewed changes

python/sglang/srt/server_args.py Show resolved Hide resolved

xu-yfei added 6 commits November 15, 2025 13:43

Opt tp: reduce-scattered input for attn input

a162eea

add global attn tp context

cd3d937

pad for tokens

9319d22

disable when enable_piecewise_cuda_graph

b702157

add switch --enable-attn-tp-input-scattered

fede5e7

update arg enable-attn-tp-input-scattered decription

b7bc56c

xu-yfei force-pushed the xyf/qkv_opt branch from 4aedf79 to b7bc56c Compare November 15, 2025 06:00

github-actions bot added the documentation Improvements or additions to documentation label Nov 15, 2025

ch-wan approved these changes Nov 15, 2025

View reviewed changes

ch-wan merged commit d91b16e into sgl-project:main Nov 15, 2025
53 of 66 checks passed

This was referenced Nov 25, 2025

fix: cuda graph issue while running longcat_flash #13899

Closed

fix: cuda graph issue while running longcat_flash #14007

Merged

Conversation

xu-yfei commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Summary by CodeRabbit

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

whybeyoung commented Sep 28, 2025

Uh oh!

xu-yfei commented Sep 29, 2025

Uh oh!

miter6 commented Sep 29, 2025

Uh oh!

xu-yfei commented Sep 29, 2025

Uh oh!

miter6 commented Sep 29, 2025

Uh oh!

miter6 commented Sep 29, 2025

Uh oh!

xu-yfei commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xu-yfei commented Oct 9, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xu-yfei commented Nov 10, 2025

Uh oh!

ch-wan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

xu-yfei commented Sep 17, 2025 •

edited

Loading

xu-yfei commented Sep 29, 2025 •

edited

Loading