Support reduce-scatter when dp=tp#5988
Conversation
|
Hi @ch-wan, is this ready to merge or should I do any modifications? |
|
Hi @ch-wan, any updates? |
|
@EstherBear Sorry for the late response. I'm going to review it tonight. |
| if self.tp_size > 1: | ||
| final_hidden_states = tensor_model_parallel_all_reduce(final_hidden_states) | ||
| if self.tp_size == self.dp_size: | ||
| tensor_list = list(final_hidden_states.tensor_split(split_indices_cpu)) |
There was a problem hiding this comment.
How about using Tensor.split() rather than Tensor.tensor_split()? This can reuse the info in forward_batch.global_num_tokens_gpu to make the code concise.
| return get_attention_tp_group().all_gather(input_, tensor_list=output_list) | ||
|
|
||
|
|
||
| def dp_reduce_scatter( |
There was a problem hiding this comment.
This function performs reduce-scatter across the global TP group. We may name it as tensor_model_parallel_reduce_scatter and place it to a different file.
There was a problem hiding this comment.
I think just tp_reduce_scatter is sufficient
| if self.tp_size == self.dp_size: | ||
| tensor_list = list(final_hidden_states.tensor_split(split_indices_cpu)) | ||
| final_hidden_states = tensor_list[self.dp_rank] | ||
| dp_reduce_scatter(final_hidden_states, tensor_list) |
There was a problem hiding this comment.
tensor_model_parallel_reduce_scatter
| ], | ||
| hidden_states, | ||
| ) | ||
| dp_copy(hidden_states, tmp_hidden_states, forward_batch) |
There was a problem hiding this comment.
If my understanding is correct, this line is doing the same thing with dp_scatter.
|
@EstherBear Thank you for your contribution. I have finished my review and left some comments. It appears that the PR does not pass some CIs (e.g., https://github.com/sgl-project/sglang/actions/runs/14876617018/job/41775058885?pr=5988). Could you please fix it? Also, it is highly encouraged to share your benchmark results. This allows us to clearly see the performance gain from your PR. For example, you can launch this to compare the efficiency: |
|
Done in #8539 |
Motivation
This PR modifies the deepseek model to use reduce-scatter in place of separate reduce and scatter operations when data parallel attention is enabled and dp == tp. By doing so, it exposes potential opportunities for the use of optimized reduce-scatter kernels for performance gains in future.
Modifications
CudaGraphRunnerandForwardBatchto include necessary metadata for the reduce-scatter operations.Checklist