Skip to content

Deepseek v4 support dp attention is not TP size#24952

Open
zhangxiaolei123456 wants to merge 10 commits into
sgl-project:mainfrom
bytedance-iaas:main_deepseek_dp_attention
Open

Deepseek v4 support dp attention is not TP size#24952
zhangxiaolei123456 wants to merge 10 commits into
sgl-project:mainfrom
bytedance-iaas:main_deepseek_dp_attention

Conversation

@zhangxiaolei123456
Copy link
Copy Markdown
Contributor

@zhangxiaolei123456 zhangxiaolei123456 commented May 11, 2026

Motivation

DeepSeekV4 branch PR is here: #23933

Modifications

Accuracy Tests

SGLANG_SHARED_EXPERT_TP1=1 SGLANG_ENABLE_THINKING=1 SGLANG_DSV4_FP4_EXPERTS=1 SGLANG_JIT_DEEPGEMM_PRECOMPILE=0 GLOO_SOCKET_IFNAME=eth0 NCCL_MIN_NCHANNELS=24 NCCL_IB_QPS_PER_CONNECTION=8 sglang serve --trust-remote-code --model-path /data00/models/DeepSeek-V4-Pro --tp 16 --dp-size 2  --enable-dp-attention --cuda-graph-max-bs 8 --max-running-requests 16 --enable-metrics --host 0.0.0.0 --port 8080 --mem-fraction-static 0.9 --moe-runner-backend marlin --dist-init-addr 192.168.3.198:30300 --nnodes 2 --node-rank 0 --tool-call-parser deepseekv4 --reasoning-parser deepseek-v4

SGLANG_SHARED_EXPERT_TP1=1 SGLANG_ENABLE_THINKING=1 SGLANG_DSV4_FP4_EXPERTS=1 SGLANG_JIT_DEEPGEMM_PRECOMPILE=0 GLOO_SOCKET_IFNAME=eth0 NCCL_MIN_NCHANNELS=24 NCCL_IB_QPS_PER_CONNECTION=8 sglang serve --trust-remote-code --model-path /data00/models/DeepSeek-V4-Pro --tp 16 --dp-size 2 --enable-dp-attention --cuda-graph-max-bs 8 --max-running-requests 16 --enable-metrics --host 0.0.0.0 --port 8080 --mem-fraction-static 0.9 --moe-runner-backend marlin --dist-init-addr 192.168.3.198:30300 --nnodes 2 --node-rank 1 --tool-call-parser deepseekv4 --reasoning-parser deepseek-v4
### MMLU
python3 bench_sglang.py --parallel 128 --backend srt --host http://127.0.0.1 --port 8080 --data_dir /data00/mmlu
100%|████████████████████████████████████████████| 14042/14042 [23:40<00:00,  9.88it/s]
subject: abstract_algebra, #q:100, acc: 0.830
subject: anatomy, #q:135, acc: 0.904
subject: astronomy, #q:152, acc: 0.934
subject: business_ethics, #q:100, acc: 0.870
subject: clinical_knowledge, #q:265, acc: 0.928
subject: college_biology, #q:144, acc: 0.972
subject: college_chemistry, #q:100, acc: 0.710
subject: college_computer_science, #q:100, acc: 0.930
subject: college_mathematics, #q:100, acc: 0.840
subject: college_medicine, #q:173, acc: 0.890
subject: college_physics, #q:102, acc: 0.971
subject: computer_security, #q:100, acc: 0.870
subject: conceptual_physics, #q:235, acc: 0.953
subject: econometrics, #q:114, acc: 0.851
subject: electrical_engineering, #q:145, acc: 0.910
subject: elementary_mathematics, #q:378, acc: 0.963
subject: formal_logic, #q:126, acc: 0.802
subject: global_facts, #q:100, acc: 0.780
subject: high_school_biology, #q:310, acc: 0.965
subject: high_school_chemistry, #q:203, acc: 0.897
subject: high_school_computer_science, #q:100, acc: 0.960
subject: high_school_european_history, #q:165, acc: 0.903
subject: high_school_geography, #q:198, acc: 0.955
subject: high_school_government_and_politics, #q:193, acc: 0.995
subject: high_school_macroeconomics, #q:390, acc: 0.944
subject: high_school_mathematics, #q:270, acc: 0.844
subject: high_school_microeconomics, #q:238, acc: 0.975
subject: high_school_physics, #q:151, acc: 0.921
subject: high_school_psychology, #q:545, acc: 0.972
subject: high_school_statistics, #q:216, acc: 0.917
subject: high_school_us_history, #q:204, acc: 0.936
subject: high_school_world_history, #q:237, acc: 0.958
subject: human_aging, #q:223, acc: 0.874
subject: human_sexuality, #q:131, acc: 0.893
subject: international_law, #q:121, acc: 0.967
subject: jurisprudence, #q:108, acc: 0.907
subject: logical_fallacies, #q:163, acc: 0.920
subject: machine_learning, #q:112, acc: 0.902
subject: management, #q:103, acc: 0.981
subject: marketing, #q:234, acc: 0.966
subject: medical_genetics, #q:100, acc: 0.970
subject: miscellaneous, #q:783, acc: 0.963
subject: moral_disputes, #q:346, acc: 0.879
subject: moral_scenarios, #q:895, acc: 0.848
subject: nutrition, #q:306, acc: 0.928
subject: philosophy, #q:311, acc: 0.923
subject: prehistory, #q:324, acc: 0.954
subject: professional_accounting, #q:282, acc: 0.890
subject: professional_law, #q:1534, acc: 0.745
subject: professional_medicine, #q:272, acc: 0.941
subject: professional_psychology, #q:612, acc: 0.922
subject: public_relations, #q:110, acc: 0.809
subject: security_studies, #q:245, acc: 0.890
subject: sociology, #q:201, acc: 0.950
subject: us_foreign_policy, #q:100, acc: 0.940
subject: virology, #q:166, acc: 0.578
subject: world_religions, #q:171, acc: 0.924
Total latency: 1420.867
Average accuracy: 0.896
### GSM8K
python3 bench_sglang.py --host http://localhost  --port 8080 --data-path /data00 --num-questions 5000 --parallel 100
100%|██████████████████████████████████████████████| 1319/1319 [07:59<00:00,  2.75it/s]
Accuracy: 0.952
Invalid: 0.000
Latency: 479.254 s
Output throughput: 252.254 token/s

@Fridge003

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refines the tensor parallelism (TP) logic in the DeepseekV4 model, specifically adjusting how results are reduced and gathered across TP groups. It modifies the reduce_results condition in RowParallelLinear, introduces explicit all-reduce calls for partial TP scenarios, and replaces dp_gather_partial with dp_gather_replicate to correctly handle replicated activations. Review feedback suggests that removing a safety assertion for zero-sized tensors could lead to hangs and recommends simplifying a redundant boolean expression in the reduce_results assignment.

Comment on lines 516 to 517
if not get_attn_tp_context().input_scattered and x.shape[0] == 0:
assert (
not self.wo_b.reduce_results
), "short-circuiting allreduce will lead to hangs"
return x
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The removal of the assertion that checks self.wo_b.reduce_results could lead to hangs if not handled carefully. The original assertion assert (not self.wo_b.reduce_results) was a safeguard against short-circuiting an all-reduce operation when x.shape[0] == 0.

While the logic for input_scattered seems to cover the DP case, it's safer to retain a check to prevent potential hangs in non-DP scenarios where token distribution might be uneven across ranks, or if all ranks have zero tokens but reduce_results is true. A more robust approach would be to handle the zero-sized tensor case within the RowParallelLinear layer itself, but as a direct fix, consider reintroducing a check or ensuring that x.shape[0] == 0 implies all ranks have zero tokens when reduce_results is true.

Comment thread python/sglang/srt/models/deepseek_v4.py Outdated
bias=False,
quant_config=quant_config,
reduce_results=attn_tp_size > 1,
reduce_results=attn_tp_size == get_tensor_model_parallel_world_size() and attn_tp_size > 1,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The condition and attn_tp_size > 1 is redundant here.

  • If get_tensor_model_parallel_world_size() is 1, attn_tp_size is also 1, making attn_tp_size > 1 false. The whole expression is false.
  • If get_tensor_model_parallel_world_size() is greater than 1, and attn_tp_size == get_tensor_model_parallel_world_size(), then attn_tp_size > 1 is implicitly true.

The RowParallelLinear layer already checks if tp_size > 1 before performing an all-reduce, so setting reduce_results=True when tp_size=1 has no effect. Simplifying this condition improves clarity.

Suggested change
reduce_results=attn_tp_size == get_tensor_model_parallel_world_size() and attn_tp_size > 1,
reduce_results=attn_tp_size == get_tensor_model_parallel_world_size(),

@zhangxiaolei123456
Copy link
Copy Markdown
Contributor Author

/tag-and-rerun-ci

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants