Deepseek v4 support dp attention is not TP size#24952
Deepseek v4 support dp attention is not TP size#24952zhangxiaolei123456 wants to merge 10 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request refines the tensor parallelism (TP) logic in the DeepseekV4 model, specifically adjusting how results are reduced and gathered across TP groups. It modifies the reduce_results condition in RowParallelLinear, introduces explicit all-reduce calls for partial TP scenarios, and replaces dp_gather_partial with dp_gather_replicate to correctly handle replicated activations. Review feedback suggests that removing a safety assertion for zero-sized tensors could lead to hangs and recommends simplifying a redundant boolean expression in the reduce_results assignment.
| if not get_attn_tp_context().input_scattered and x.shape[0] == 0: | ||
| assert ( | ||
| not self.wo_b.reduce_results | ||
| ), "short-circuiting allreduce will lead to hangs" | ||
| return x |
There was a problem hiding this comment.
The removal of the assertion that checks self.wo_b.reduce_results could lead to hangs if not handled carefully. The original assertion assert (not self.wo_b.reduce_results) was a safeguard against short-circuiting an all-reduce operation when x.shape[0] == 0.
While the logic for input_scattered seems to cover the DP case, it's safer to retain a check to prevent potential hangs in non-DP scenarios where token distribution might be uneven across ranks, or if all ranks have zero tokens but reduce_results is true. A more robust approach would be to handle the zero-sized tensor case within the RowParallelLinear layer itself, but as a direct fix, consider reintroducing a check or ensuring that x.shape[0] == 0 implies all ranks have zero tokens when reduce_results is true.
| bias=False, | ||
| quant_config=quant_config, | ||
| reduce_results=attn_tp_size > 1, | ||
| reduce_results=attn_tp_size == get_tensor_model_parallel_world_size() and attn_tp_size > 1, |
There was a problem hiding this comment.
The condition and attn_tp_size > 1 is redundant here.
- If
get_tensor_model_parallel_world_size()is 1,attn_tp_sizeis also 1, makingattn_tp_size > 1false. The whole expression is false. - If
get_tensor_model_parallel_world_size()is greater than 1, andattn_tp_size == get_tensor_model_parallel_world_size(), thenattn_tp_size > 1is implicitly true.
The RowParallelLinear layer already checks if tp_size > 1 before performing an all-reduce, so setting reduce_results=True when tp_size=1 has no effect. Simplifying this condition improves clarity.
| reduce_results=attn_tp_size == get_tensor_model_parallel_world_size() and attn_tp_size > 1, | |
| reduce_results=attn_tp_size == get_tensor_model_parallel_world_size(), |
|
/tag-and-rerun-ci |
Motivation
DeepSeekV4 branch PR is here: #23933
Modifications
Accuracy Tests
@Fridge003
Speed Tests and Profiling
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci