[feat]: oproj tensor parallelism in pure DP and graph-mode scenarios.#2167
[feat]: oproj tensor parallelism in pure DP and graph-mode scenarios.#2167wangxiyuan merged 8 commits intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
| else: | ||
| self.register_parameter("bias", None) | ||
|
|
||
| def weight_loader(self, param: Parameter, loaded_weight: torch.Tensor): |
There was a problem hiding this comment.
This function seems to be identical with that of RowParallelLinear, why do we need to rewrite it here?
There was a problem hiding this comment.
in origin weight_load,
tp_rank = get_tensor_model_parallel_rank()
tp_size = get_tensor_model_parallel_world_size()
we need replace it into custom comm group
tp_rank = self.tp_rank
tp_size = self.tp_size
It seems that the latest vllm does not have this problem.
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
71c3e49 to
b925b4b
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
| else: | ||
| tp_rank = get_tensor_model_parallel_rank() | ||
| else: | ||
| tp_rank = 0 |
There was a problem hiding this comment.
This origin code here
if isinstance(layer, RowParallelLinear):
tp_rank = get_tensor_model_parallel_rank()
return self.quant_method.apply(layer, x, bias, tp_rank)
return self.quant_method.apply(layer, x, bias)The default situation is not passing tp, which is tp=0
| [mypy-numpy.*] | ||
| ignore_missing_imports = True | ||
|
|
There was a problem hiding this comment.
Why do we need to update these configurations? If this is a bug in the repo, I suggest creating a separate PR to fix it.
There was a problem hiding this comment.
Alright, I will remove this here. Here is just some local CI checks.
|
|
||
| if oproj_tp_enable(): | ||
| self.o_proj = RowParallelLinear(self.num_heads * self.v_head_dim, | ||
| self.hidden_size, | ||
| bias=False, | ||
| quant_config=quant_config, | ||
| prefix=f"{prefix}.o_proj") | ||
| elif (config.n_routed_experts is not None | ||
| and self.debug_layer_idx >= config.first_k_dense_replace | ||
| and self.debug_layer_idx % config.moe_layer_freq == 0 | ||
| and (ascend_config.torchair_graph_config.enable_multistream_moe | ||
| or self.enable_shared_expert_dp)): | ||
| self.o_proj = TorchairDeepseekV2RowParallelLinearReplaceAllreduce( | ||
| self.num_heads * self.v_head_dim, | ||
| self.hidden_size, |
There was a problem hiding this comment.
Is it still not possible to eliminate these if-else branches even with CustomOp? @wangxiyuan @Yikun
| if prefix.find("down_proj") != -1 and mlp_tp_enable(): | ||
| comm_group = get_mlp_tp_group() | ||
| self.forward_type = "mlp_tp" | ||
| elif prefix.find("o_proj") != -1 and oproj_tp_enable(): | ||
| comm_group = get_otp_group() | ||
| self.forward_type = "oproj_tp" | ||
| else: | ||
| self.tp_size = get_tensor_model_parallel_world_size() | ||
| self.tp_rank = get_tensor_model_parallel_rank() | ||
| self.enable_mlp_optimze = False | ||
| comm_group = get_tp_group() | ||
| self.forward_type = "normal" | ||
| self.comm_group = comm_group |
There was a problem hiding this comment.
Is adding more if-else conditions the way to extend support for new models?
0e2fce6 to
b1582b4
Compare
| input_, num_partitions=self.tp_size) | ||
| input_parallel = splitted_input[self.tp_rank].contiguous() | ||
| assert self.quant_method is not None | ||
| # Choose different forward function according to the type of TP group |
There was a problem hiding this comment.
This part use dict format may be more extensible. The same logic applies as mentioned above.
There was a problem hiding this comment.
I tried to modify it and found it not very intuitive, but I made some changed it to super.forward().
| name="SiluAndMul") | ||
| CustomOp.register_oot(_decorated_op_cls=AscendRotaryEmbedding, | ||
| name="RotaryEmbedding") | ||
| CustomOp.register_oot(_decorated_op_cls=AscendColumnParallelLinear, |
There was a problem hiding this comment.
If this component is enabled by default, modifications to the original vLLM repository will require ongoing maintenance and updates. What is the long-term maintenance strategy for this?
There was a problem hiding this comment.
I think there won't be many changes here, we just need to focus on the maintenance of the __init__ method in the follow-up.
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
4e860b4 to
0c35616
Compare
|
according to the comment #2678 (comment) please remove the patch_linear as well |
|
for the e2e lora error, please patch more like other does here https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/patch/worker/patch_common/patch_lora_embedding.py |
Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Signed-off-by: zzhx1 <zzh_201018@outlook.com>
|
@wangxiyuan This PR is ready,and also fixed the bug related to linearBase. |
| @@ -0,0 +1,15 @@ | |||
| import vllm | |||
There was a problem hiding this comment.
looks that these 3 file can merged into one
What this PR does / why we need it?
This PR introduces Oproj matrix tensor model parallel to achieve decreasing of memory consumption. It only support graph mode in pure DP scenario.
In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with oproj_tensor_parallel_size = 8, we have 1 ms TPOT increasing, saved 5.8 GB NPU memory per RANK. We got best performance when oproj_tensor_parallel_size=4 without TPOT increasing.
performance data:

Does this PR introduce any user-facing change?
This PR introduces one new config in
additional_config.example
--additional_config={"oproj_tensor_parallel_size": 8}How was this patch tested?