[Feat] flashcomm_v2 optim solution by Levi-JQ · Pull Request #3232 · vllm-project/vllm-ascend

Levi-JQ · 2025-09-28T07:30:53Z

What this PR does / why we need it?

Supports generalized FlashComm2 optimization, which reduces communication overhead, decreases RmsNorm computation, and saves one AllGather step by replacing Allreduce operations in the Attention module with pre-AlltoAll and post-AllGather operations (used in combination with FlashComm1). This feature is enabled during the Prefill phase and is recommended to be used together with FlashComm1, delivering broad performance improvements, especially in long sequence scenarios with large tensor parallelism (TP) configurations. Benchmark tests show that under TP16DP1 configuration, it can improve the prefill performance of the DeepSeek model by 8% on top of FlashComm1.

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.11.0
vLLM main: vllm-project/vllm@83f478b

github-actions · 2025-09-28T07:31:03Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request introduces the FlashComm2 optimization for tensor parallelism on Ascend NPUs, aiming to improve performance by optimizing communication patterns. The changes span configuration, parallel state management, and operator implementations. My review has identified a few issues: a critical bug in the parallel group initialization that can lead to a crash, a related potential resource leak in the group destruction logic, and incorrect formatting of error messages in the configuration validation. These issues should be addressed to ensure correctness and robustness.

gemini-code-assist · 2025-09-28T07:33:22Z

vllm_ascend/distributed/parallel_state.py

+        _FLASHCOMM2_OTP = None
+        _FLASHCOMM2_ODP = get_tp_group()
+
+        if flashcomm2_otp_size > 1:


The process group creation for FlashComm2 is guarded by if flashcomm2_otp_size > 1:. This causes _FLASHCOMM2_OTP to be None when flashcomm2_oproj_tensor_parallel_size is 1. However, Flashcomm2OProjRowParallelOp is still used in this case, and it attempts to access methods on the _FLASHCOMM2_OTP group, which will lead to a crash. The logic within this if block appears to correctly handle the size == 1 case by creating groups of size 1. The conditional guard should be removed, and its content unindented, to fix this critical bug.

In the Flashcomm2OProjRowParallelOp that uses _FLASHCOMM2_OTP , a check has been added to determine whether flashcomm2_oproj_tensor_parallel_size is 1 to avoid errors. By the way，This approach of setting it to None avoids redundant communication groups when flashcomm2_oproj_tensor_parallel_sizeis 1, reducing buffer consumption.

gemini-code-assist · 2025-09-28T07:33:22Z

vllm_ascend/ascend_config.py

+                raise AssertionError(
+                    "flashcomm2_oproj_tensor_parallel_size ({self.flashcomm2_oproj_tensor_parallel_size}) cannot exceed global tensor parallel size ({global_tp_size})"
+                )
+            if global_tp_size % self.flashcomm2_oproj_tensor_parallel_size != 0:
+                raise AssertionError(
+                    "Global tensor parallel size ({global_tp_size}) must be divisible by flashcomm2_oproj_tensor_parallel_size ({self.flashcomm2_oproj_tensor_parallel_size})"
+                )


The error message strings are not f-strings, so the variables inside the curly braces will not be interpolated. This will result in confusing and unhelpful error messages for users.

Suggested change

raise AssertionError(

"flashcomm2_oproj_tensor_parallel_size ({self.flashcomm2_oproj_tensor_parallel_size}) cannot exceed global tensor parallel size ({global_tp_size})"

)

if global_tp_size % self.flashcomm2_oproj_tensor_parallel_size != 0:

raise AssertionError(

"Global tensor parallel size ({global_tp_size}) must be divisible by flashcomm2_oproj_tensor_parallel_size ({self.flashcomm2_oproj_tensor_parallel_size})"

)

raise AssertionError(

f"flashcomm2_oproj_tensor_parallel_size ({self.flashcomm2_oproj_tensor_parallel_size}) cannot exceed global tensor parallel size ({global_tp_size})"

)

if global_tp_size % self.flashcomm2_oproj_tensor_parallel_size != 0:

raise AssertionError(

f"Global tensor parallel size ({global_tp_size}) must be divisible by flashcomm2_oproj_tensor_parallel_size ({self.flashcomm2_oproj_tensor_parallel_size})"

)

gemini-code-assist · 2025-09-28T07:33:22Z

vllm_ascend/distributed/parallel_state.py

    _OTP = None
+
+    global _FLASHCOMM2_OTP
+    if _FLASHCOMM2_OTP and get_ascend_config().flashcomm2_oproj_tensor_parallel_size != 1:


The condition get_ascend_config().flashcomm2_oproj_tensor_parallel_size != 1 will prevent the _FLASHCOMM2_OTP group from being destroyed when its size is 1. If the initialization logic is fixed to create a group for size 1 (as suggested in another comment), this will cause a resource leak. The group should be destroyed if it was created, regardless of its size.

Suggested change

if _FLASHCOMM2_OTP and get_ascend_config().flashcomm2_oproj_tensor_parallel_size != 1:

if _FLASHCOMM2_OTP:

_FLASHCOMM2_OTP is None when get_ascend_config().flashcomm2_oproj_tensor_parallel_size == 1

github-actions · 2025-09-30T07:13:36Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2025-10-17T01:33:42Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

2.Rename the environment variable VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE 3.Normalize the enabling logic for sp/fc2 4.add TODO: Normalize the communication domain Signed-off-by: Levi-JQ <yujinqi2@huawei.com>

Signed-off-by: Levi-JQ <yujinqi2@huawei.com>

Signed-off-by: zzhxx <2783294813@qq.com>

Signed-off-by: Levi-JQ <yujinqi2@huawei.com>

zzhx1 · 2025-11-06T08:41:52Z

@wangxiyuan Check if this PR can be merged.

zzzzwwjj · 2025-11-08T07:43:02Z

Suggest raising an issue to organize the roles and relationships between flashcomm, flashcomm2, and enable_sp features, and comm domain reusing of lm_head_tp.

zzhx1 · 2025-11-10T02:54:09Z

Suggest raising an issue to organize the roles and relationships between flashcomm, flashcomm2, and enable_sp features, and comm domain reusing of lm_head_tp.

Within this week, we will raise an issue to clarify the connections between these features,
The communication domain of lmheadTP cannot be reused, as the use cases are different, one is prefill and the other is decode.

### What this PR does / why we need it? Supports generalized FlashComm2 optimization, which reduces communication overhead, decreases RmsNorm computation, and saves one AllGather step by replacing Allreduce operations in the Attention module with pre-AlltoAll and post-AllGather operations (used in combination with FlashComm1). This feature is enabled during the Prefill phase and is recommended to be used together with FlashComm1, delivering broad performance improvements, especially in long sequence scenarios with large tensor parallelism (TP) configurations. Benchmark tests show that under TP16DP1 configuration, it can improve the prefill performance of the DeepSeek model by 8% on top of FlashComm1. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@83f478b --------- Signed-off-by: zzhxx <2783294813@qq.com> Signed-off-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: zzhxx <2783294813@qq.com> Signed-off-by: luolun <luolun1995@cmbchina.com>

### What this PR does / why we need it? Supports generalized FlashComm2 optimization, which reduces communication overhead, decreases RmsNorm computation, and saves one AllGather step by replacing Allreduce operations in the Attention module with pre-AlltoAll and post-AllGather operations (used in combination with FlashComm1). This feature is enabled during the Prefill phase and is recommended to be used together with FlashComm1, delivering broad performance improvements, especially in long sequence scenarios with large tensor parallelism (TP) configurations. Benchmark tests show that under TP16DP1 configuration, it can improve the prefill performance of the DeepSeek model by 8% on top of FlashComm1. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@83f478b --------- Signed-off-by: zzhxx <2783294813@qq.com> Signed-off-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: zzhxx <2783294813@qq.com> Signed-off-by: hwhaokun <haokun0405@163.com>

Merge branch pick-fc2-1111 of git@code.alipay.com:Theta/vllm-ascend.git into dev-v0.11.0.1111 https://code.alipay.com/Theta/vllm-ascend/pull_requests/594?tab=diff Reviewed-by: 沧濯 <zhengshoujian.zsj@antgroup.com> * [Feat] flashcomm_v2 optim solution (vllm-project#3232)

### What this PR does / why we need it? Supports generalized FlashComm2 optimization, which reduces communication overhead, decreases RmsNorm computation, and saves one AllGather step by replacing Allreduce operations in the Attention module with pre-AlltoAll and post-AllGather operations (used in combination with FlashComm1). This feature is enabled during the Prefill phase and is recommended to be used together with FlashComm1, delivering broad performance improvements, especially in long sequence scenarios with large tensor parallelism (TP) configurations. Benchmark tests show that under TP16DP1 configuration, it can improve the prefill performance of the DeepSeek model by 8% on top of FlashComm1. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@83f478b --------- Signed-off-by: zzhxx <2783294813@qq.com> Signed-off-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: zzhxx <2783294813@qq.com> Signed-off-by: nsdie <yeyifan@huawei.com>

### What this PR does / why we need it? Supports generalized FlashComm2 optimization, which reduces communication overhead, decreases RmsNorm computation, and saves one AllGather step by replacing Allreduce operations in the Attention module with pre-AlltoAll and post-AllGather operations (used in combination with FlashComm1). This feature is enabled during the Prefill phase and is recommended to be used together with FlashComm1, delivering broad performance improvements, especially in long sequence scenarios with large tensor parallelism (TP) configurations. Benchmark tests show that under TP16DP1 configuration, it can improve the prefill performance of the DeepSeek model by 8% on top of FlashComm1. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@83f478b --------- Signed-off-by: zzhxx <2783294813@qq.com> Signed-off-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: Levi-JQ <yujinqi2@huawei.com> Co-authored-by: zzhxx <2783294813@qq.com>

github-actions bot added module:ops module:core module:quantization labels Sep 28, 2025

gemini-code-assist bot reviewed Sep 28, 2025

View reviewed changes

Levi-JQ force-pushed the official-fc2 branch 5 times, most recently from 5b6c013 to 1b8cdb3 Compare September 30, 2025 02:38

github-actions bot added the merge-conflicts label Sep 30, 2025

Levi-JQ force-pushed the official-fc2 branch 2 times, most recently from 693c0ec to 7541e94 Compare October 9, 2025 13:02

Levi-JQ force-pushed the official-fc2 branch from 7541e94 to 207e5e2 Compare October 16, 2025 04:15

github-actions bot removed the merge-conflicts label Oct 16, 2025

Levi-JQ force-pushed the official-fc2 branch from 207e5e2 to fe878cd Compare October 16, 2025 06:42

Levi-JQ changed the title ~~[main] flashcomm_v2 optim solution~~ [Feat] flashcomm_v2 optim solution Oct 16, 2025

Levi-JQ force-pushed the official-fc2 branch 3 times, most recently from 0f881a4 to d990a18 Compare October 16, 2025 10:17

zzhx1 force-pushed the official-fc2 branch 2 times, most recently from 834dd41 to 84eacfa Compare October 16, 2025 12:12

github-actions bot added the merge-conflicts label Oct 17, 2025

github-actions bot added the module:tests label Oct 18, 2025

Levi-JQ force-pushed the official-fc2 branch from 6d6eb8a to b6b68b7 Compare October 18, 2025 05:51

github-actions bot removed the merge-conflicts label Oct 18, 2025

Levi-JQ force-pushed the official-fc2 branch 2 times, most recently from 34ebe45 to 4622524 Compare October 18, 2025 07:37

Levi-JQ and others added 5 commits November 5, 2025 10:41

1.Change the location of the assert check logic

67bd24c

2.Rename the environment variable VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE 3.Normalize the enabling logic for sp/fc2 4.add TODO: Normalize the communication domain Signed-off-by: Levi-JQ <yujinqi2@huawei.com>

add e2e test for fc2

929a125

Signed-off-by: Levi-JQ <yujinqi2@huawei.com>

refactor to decouple fc1 and fc2

ff22e1a

Signed-off-by: Levi-JQ <yujinqi2@huawei.com>

fix comm after quant && fix assrert of fc2 in hybrid deployment

b8c0e64

Signed-off-by: Levi-JQ <yujinqi2@huawei.com>

fix ci

3ad78c3

Signed-off-by: zzhxx <2783294813@qq.com>

Levi-JQ force-pushed the official-fc2 branch 2 times, most recently from 2ece362 to 9229303 Compare November 5, 2025 03:03

yiz-liu added ready read for review ready-for-test start test by label for PR labels Nov 5, 2025

Levi-JQ force-pushed the official-fc2 branch from 9229303 to e78a395 Compare November 5, 2025 10:10

fix ci

c120c59

Signed-off-by: Levi-JQ <yujinqi2@huawei.com>

Levi-JQ force-pushed the official-fc2 branch from e78a395 to c120c59 Compare November 5, 2025 13:21

weijinqian0 approved these changes Nov 6, 2025

View reviewed changes

Levi-JQ requested a review from zzzzwwjj November 7, 2025 06:14

zzzzwwjj approved these changes Nov 8, 2025

View reviewed changes

wangxiyuan approved these changes Nov 10, 2025

View reviewed changes

weijinqian0 merged commit 0a62e67 into vllm-project:main Nov 10, 2025
24 checks passed

zzhx1 mentioned this pull request Nov 14, 2025

[Feat] Flashcomm2 use o_shared linear #4188

Merged

Levi-JQ mentioned this pull request Dec 16, 2025

[Feat] flashcomm2+oshard Generalized #4723

Merged

wangxiyuan mentioned this pull request Dec 18, 2025

Nominate new maintainers @zzzzwwjj @realliujiaxu @LCAIZJ #5152

Merged

zzhx1 mentioned this pull request Dec 18, 2025

[refactor] Refactor the interface for shard weight and remove the flashcomm2 o_shared interface. #5181

Merged

Yikun mentioned this pull request Feb 5, 2026

[v0.13.0rc2] FAQ / Feedback | 问题/反馈 #6186

Closed

	if _FLASHCOMM2_OTP and get_ascend_config().flashcomm2_oproj_tensor_parallel_size != 1:
	if _FLASHCOMM2_OTP:

Conversation

Levi-JQ commented Sep 28, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Sep 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 28, 2025

Choose a reason for hiding this comment

Uh oh!

Levi-JQ Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 28, 2025

Choose a reason for hiding this comment

Uh oh!

Levi-JQ Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 28, 2025

Choose a reason for hiding this comment

Uh oh!

Levi-JQ Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Sep 30, 2025

Uh oh!

github-actions bot commented Oct 17, 2025

Uh oh!

zzhx1 commented Nov 6, 2025

Uh oh!

zzzzwwjj commented Nov 8, 2025

Uh oh!

zzhx1 commented Nov 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Levi-JQ commented Sep 28, 2025 •

edited by github-actions bot

Loading