Feat/dp attn 0514#490
Conversation
5ec497c
into
bytedance-iaas:main_deepseek_dp_attention
There was a problem hiding this comment.
Code Review
This pull request updates the distributed communication logic in communicator.py and deepseek_v4.py to handle DP buffer group selection more dynamically. Specifically, it introduces a check to determine whether to use the tensor parallel group or the attention tensor parallel group based on the relationship between the TP world size and the attention DP size. Feedback was provided to ensure that the group selection logic in the DeepSeek-V4 model remains consistent with the implementation in the layer communicator to avoid potential issues with symmetric memory allocation.
| hidden_states, global_hidden_states = ( | ||
| get_local_dp_buffer(get_attention_tp_group()), | ||
| hidden_states, | ||
| ) |
There was a problem hiding this comment.
For consistency with the logic in LayerCommunicator and to ensure correct symmetric memory allocation when the tensor parallel size equals the attention data parallel size, the group for the local DP buffer should be selected based on whether tp_size == dp_size. This pattern is followed in communicator.py for all dp_scatter operations.
| hidden_states, global_hidden_states = ( | |
| get_local_dp_buffer(get_attention_tp_group()), | |
| hidden_states, | |
| ) | |
| if get_tensor_model_parallel_world_size() == get_attention_dp_size(): | |
| group = get_tp_group() | |
| else: | |
| group = get_attention_tp_group() | |
| hidden_states, global_hidden_states = ( | |
| get_local_dp_buffer(group), | |
| hidden_states, | |
| ) |
Motivation
Modifications
Accuracy Tests
Speed Tests and Profiling
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci