[feat]: oproj tensor parallelism in pure DP and graph-mode scenarios. by lidenghui1110 · Pull Request #2167 · vllm-project/vllm-ascend

lidenghui1110 · 2025-08-01T07:03:52Z

What this PR does / why we need it?

This PR introduces Oproj matrix tensor model parallel to achieve decreasing of memory consumption. It only support graph mode in pure DP scenario.

In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with oproj_tensor_parallel_size = 8, we have 1 ms TPOT increasing, saved 5.8 GB NPU memory per RANK. We got best performance when oproj_tensor_parallel_size=4 without TPOT increasing.

performance data:

Does this PR introduce any user-facing change?

This PR introduces one new config in additional_config.

Name	Effect	Required	Type	Constraints
oproj_tensor_parallel_size	Split the o_proj matrix along the row dimension (head num * head dim) into oproj_tensor_parallel_size pieces.	No	int	default value is None, once this value is set, the feature will be enabled, head num * head dim must be divisible by this value.

example

--additional_config={"oproj_tensor_parallel_size": 8}

How was this patch tested?

vLLM version: v0.10.1.1
vLLM main: vllm-project/vllm@eddaafc

github-actions · 2025-08-01T07:38:40Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

sdmyzlp · 2025-08-11T01:36:01Z

+        else:
+            self.register_parameter("bias", None)
+
+    def weight_loader(self, param: Parameter, loaded_weight: torch.Tensor):


This function seems to be identical with that of RowParallelLinear, why do we need to rewrite it here?

in origin weight_load,

tp_rank = get_tensor_model_parallel_rank() tp_size = get_tensor_model_parallel_world_size()

we need replace it into custom comm group

tp_rank = self.tp_rank tp_size = self.tp_size

It seems that the latest vllm does not have this problem.

Understood, thanks

github-actions · 2025-08-11T07:39:05Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2025-08-11T08:03:32Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2025-08-12T06:14:29Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

ApsarasX · 2025-08-12T06:50:43Z

+            else:
+                tp_rank = get_tensor_model_parallel_rank()
+        else:
+            tp_rank = 0


What means tp_rank = 0?

This origin code here

if isinstance(layer, RowParallelLinear): tp_rank = get_tensor_model_parallel_rank() return self.quant_method.apply(layer, x, bias, tp_rank) return self.quant_method.apply(layer, x, bias)

The default situation is not passing tp, which is tp=0

jianzs · 2025-08-28T02:19:26Z

+[mypy-numpy.*]
+ignore_missing_imports = True
+


Why do we need to update these configurations? If this is a bug in the repo, I suggest creating a separate PR to fix it.

Alright, I will remove this here. Here is just some local CI checks.

jianzs · 2025-08-28T02:26:53Z

+
+        if oproj_tp_enable():
+            self.o_proj = RowParallelLinear(self.num_heads * self.v_head_dim,
+                                            self.hidden_size,
+                                            bias=False,
+                                            quant_config=quant_config,
+                                            prefix=f"{prefix}.o_proj")
+        elif (config.n_routed_experts is not None
+              and self.debug_layer_idx >= config.first_k_dense_replace
+              and self.debug_layer_idx % config.moe_layer_freq == 0
+              and (ascend_config.torchair_graph_config.enable_multistream_moe
+                   or self.enable_shared_expert_dp)):
            self.o_proj = TorchairDeepseekV2RowParallelLinearReplaceAllreduce(
                self.num_heads * self.v_head_dim,
                self.hidden_size,


Is it still not possible to eliminate these if-else branches even with CustomOp? @wangxiyuan @Yikun

jianzs · 2025-08-28T02:28:19Z

+        if prefix.find("down_proj") != -1 and mlp_tp_enable():
+            comm_group = get_mlp_tp_group()
+            self.forward_type = "mlp_tp"
+        elif prefix.find("o_proj") != -1 and oproj_tp_enable():
+            comm_group = get_otp_group()
+            self.forward_type = "oproj_tp"
        else:
-            self.tp_size = get_tensor_model_parallel_world_size()
-            self.tp_rank = get_tensor_model_parallel_rank()
-            self.enable_mlp_optimze = False
+            comm_group = get_tp_group()
+            self.forward_type = "normal"
+        self.comm_group = comm_group


Is adding more if-else conditions the way to extend support for new models?

momo609 · 2025-08-29T01:25:54Z

+                input_, num_partitions=self.tp_size)
+            input_parallel = splitted_input[self.tp_rank].contiguous()
+        assert self.quant_method is not None
+        # Choose different forward function according to the type of TP group


This part use dict format may be more extensible. The same logic applies as mentioned above.

I tried to modify it and found it not very intuitive, but I made some changed it to super.forward().

momo609 · 2025-08-29T01:30:41Z

                          name="SiluAndMul")
    CustomOp.register_oot(_decorated_op_cls=AscendRotaryEmbedding,
                          name="RotaryEmbedding")
+    CustomOp.register_oot(_decorated_op_cls=AscendColumnParallelLinear,


If this component is enabled by default, modifications to the original vLLM repository will require ongoing maintenance and updates. What is the long-term maintenance strategy for this?

I think there won't be many changes here, we just need to focus on the maintenance of the __init__ method in the follow-up.

github-actions · 2025-08-29T03:46:24Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

wangxiyuan · 2025-09-02T12:10:51Z

according to the comment #2678 (comment) please remove the patch_linear as well

wangxiyuan · 2025-09-02T12:14:35Z

for the e2e lora error, please patch more like other does here https://github.com/vllm-project/vllm-ascend/blob/main/vllm_ascend/patch/worker/patch_common/patch_lora_embedding.py

Signed-off-by: zzhx1 <zzh_201018@outlook.com>

zzhx1 · 2025-09-06T16:00:59Z

@wangxiyuan This PR is ready，and also fixed the bug related to linearBase.

wangxiyuan · 2025-09-07T02:30:13Z

@@ -0,0 +1,15 @@
+import vllm


looks that these 3 file can merged into one

github-actions Bot added module:ops module:core module:quantization labels Aug 1, 2025

sdmyzlp reviewed Aug 11, 2025

View reviewed changes

zzhx1 force-pushed the oproj branch from 974d272 to a02e6fd Compare August 11, 2025 07:38

github-actions Bot added documentation Improvements or additions to documentation ci/build module:tests module:tools merge-conflicts labels Aug 11, 2025

zzhx1 force-pushed the oproj branch from a02e6fd to 974d272 Compare August 11, 2025 07:45

github-actions Bot removed documentation Improvements or additions to documentation ci/build module:tests module:tools merge-conflicts labels Aug 11, 2025

zzhx1 force-pushed the oproj branch 2 times, most recently from 71c3e49 to b925b4b Compare August 11, 2025 08:02

github-actions Bot added the merge-conflicts label Aug 11, 2025

zzhx1 force-pushed the oproj branch from b925b4b to 71c3e49 Compare August 11, 2025 08:04

github-actions Bot removed the merge-conflicts label Aug 11, 2025

zzhx1 force-pushed the oproj branch from 71c3e49 to e11022b Compare August 11, 2025 08:58

github-actions Bot added the merge-conflicts label Aug 12, 2025

github-actions Bot added the module:tests label Aug 12, 2025

ApsarasX reviewed Aug 12, 2025

View reviewed changes

zzhx1 force-pushed the oproj branch from 0abb2ad to 506b7d6 Compare August 18, 2025 15:49

jianzs reviewed Aug 28, 2025

View reviewed changes

zzhx1 force-pushed the oproj branch 2 times, most recently from 0e2fce6 to b1582b4 Compare August 28, 2025 14:52

momo609 reviewed Aug 29, 2025

View reviewed changes

github-actions Bot added the merge-conflicts label Aug 29, 2025

zzhx1 force-pushed the oproj branch from 667eb9a to 0457a1a Compare August 29, 2025 05:14

github-actions Bot removed the merge-conflicts label Aug 29, 2025

zzhx1 force-pushed the oproj branch 6 times, most recently from 4e860b4 to 0c35616 Compare August 31, 2025 13:03

zzhx1 added 8 commits September 5, 2025 11:25

feat: oproj tensor parallelism in pure DP and graph-mode scenarios.

365e7e5

Signed-off-by: zzhx1 <zzh_201018@outlook.com>

Improve the overall code related to oproj_tensor_parallel.

c97eec3

Signed-off-by: zzhx1 <zzh_201018@outlook.com>

[CI] fix

10c110e

Signed-off-by: zzhx1 <zzh_201018@outlook.com>

remove patch_linear and integrate matmul allreduce in linear.py together

95c9b10

Signed-off-by: zzhx1 <zzh_201018@outlook.com>

add UT

0a8017a

Signed-off-by: zzhx1 <zzh_201018@outlook.com>

fix lora patch

18b14e7

Signed-off-by: zzhx1 <zzh_201018@outlook.com>

fix ut bug

dc8a0c8

Signed-off-by: zzhx1 <zzh_201018@outlook.com>

fix linearBase UT bug from vllm

a675b1c

Signed-off-by: zzhx1 <zzh_201018@outlook.com>

wangxiyuan approved these changes Sep 7, 2025

View reviewed changes

zzhx1 mentioned this pull request Nov 9, 2025

oproj TP support acl graph #4073

Merged

lidenghui1110 mentioned this pull request Dec 8, 2025

[Feat] Add custom Embedding tensor model parallel #2616

Merged

zzhx1 mentioned this pull request Dec 16, 2025

[Doc]Add the user_guide doc file regarding fine-grained TP. #5084

Merged

		[mypy-numpy.*]
		ignore_missing_imports = True

Conversation

lidenghui1110 commented Aug 1, 2025 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions Bot commented Aug 1, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Aug 11, 2025

Uh oh!

github-actions Bot commented Aug 11, 2025

Uh oh!

github-actions Bot commented Aug 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Aug 29, 2025

Uh oh!

wangxiyuan commented Sep 2, 2025

Uh oh!

wangxiyuan commented Sep 2, 2025

Uh oh!

zzhx1 commented Sep 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

lidenghui1110 commented Aug 1, 2025 •

edited by github-actions Bot

Loading