[0.9.1] Add LMhead TP communication groups.#1956
[0.9.1] Add LMhead TP communication groups.#1956ganyi1996ppo merged 10 commits intovllm-project:v0.9.1-devfrom
Conversation
dfd0307 to
eda2121
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
ae9413c to
9b86bbf
Compare
5ef27e1 to
74cd78d
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
27cbf3a to
49933c3
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: angazenn <zengyanjia@huawei.com>
Signed-off-by: angazenn <zengyanjia@huawei.com>
| if not with_prefill: | ||
| padded_num_indices = num_tokens | ||
| else: | ||
| padded_num_indices = max_num_reqs |
There was a problem hiding this comment.
Will padding here cause the time to be longer when DP has a serious uneven load?
There was a problem hiding this comment.
Yes, there might be performance degradation. However, in some cases (you can see _get_forward_metadata_across_dp ) the all_reduce communication used for gathering metedata is skipped. Thus using true num_tokens_across_dp will incur another all_reduce communication in this case. Maybe we can have a better solution for this.
| backend, | ||
| group_name="mc2") | ||
|
|
||
| all_ranks = torch.arange(world_size).reshape(-1, lm_head_tp_size) |
There was a problem hiding this comment.
TODO: please create this parallel only when runing deepseek
Signed-off-by: zengyanjia <z00883269@china.huawei.com>
| False) # Whether to enable DeepSeek models' prefill optimizations | ||
| self.enable_cpu_binding = additional_config.get( # Whether to enable the cpu binding | ||
| "enable_cpu_binding", False) | ||
| self.lmhead_tp_size = additional_config.get("lmhead_tp_size", -1) |
There was a problem hiding this comment.
it's better that the default value is 1
What this PR does / why we need it?
In pure dp scenarios (such as DP32), LMHead comptuation takes 1~2ms. In this PR we customize the parallelism of LMHead,enabling the separate TP of LMHead. The computation flow is listed as follows:
this can decrease 0.5~1ms for deepseek with 28BS on a single die、MTP.
In addition, this PR also fixes a bug that introduced by LMHead quantization. The OP
npu_quant_matmulonly accepts dim < 65536, whilevocab_sizeis > 65536 if using TP 1. We can set lmhead tp size > 1 to avoid this bug.Main version of this PR: #2309 .
Does this PR introduce any user-facing change?
Yes. We introduced another configurable options
lmhead_tp_sizein ascend_config. For example:The default value is -1, and
lmhead_tp_sizeis automatically set totensor_parallel_sizein this case. Besides, it is suggested to use it when running full DP to avoid additional communication introduced by TP. Therefore, the parallel size oflmheadgroup will also be changed totensor_parallel_sizeif TP > 1 so as to fall back to normally TP+DP case.How was this patch tested?