Skip to content

[Performance] Remove index opetation when VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE=1#5936

Merged
wangxiyuan merged 1 commit intovllm-project:mainfrom
coder-fny:fc2-remove-index
Jan 19, 2026
Merged

[Performance] Remove index opetation when VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE=1#5936
wangxiyuan merged 1 commit intovllm-project:mainfrom
coder-fny:fc2-remove-index

Conversation

@Levi-JQ
Copy link
Copy Markdown
Contributor

@Levi-JQ Levi-JQ commented Jan 15, 2026

What this PR does / why we need it?

When enable VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE>1, we need index operation to reorganize the batch, because that we need ensure the correct batch-id for each rank after the reduce-scatter op in VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE>1. But we do not need it when VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE=1, which dose not need reduce-scatter.
image

Does this PR introduce any user-facing change?

How was this patch tested?

@github-actions
Copy link
Copy Markdown
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to improve performance by removing an indexing operation when VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE is 1. While the intent is good, the current implementation introduces a logical error. I have provided a review comment with a suggested fix that correctly implements the optimization while maintaining the original logic.

Comment on lines +342 to +344
if self.otp_size != 1:
chunked = chunked[self.group_indices]
send_buf = chunked.flatten(1, 2)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This change introduces a logic error when self.otp_size == 1. The goal is to avoid the indexing operation for performance, but the new implementation is incorrect.

When self.otp_size == 1, the original code chunked[self.group_indices].flatten(1, 2) is equivalent to chunked because chunked[self.group_indices] is equivalent to chunked.unsqueeze(1), and chunked.unsqueeze(1).flatten(1, 2) results in chunked itself (a 3D tensor).

The new code computes send_buf = chunked.flatten(1, 2), which changes send_buf from a 3D tensor to a 2D tensor and adds an unnecessary flatten operation. This is both incorrect and inefficient, contrary to the PR's goal.

The correct implementation should assign chunked directly to send_buf when self.otp_size == 1.

            send_buf = chunked[self.group_indices].flatten(1, 2) if self.otp_size != 1 else chunked

@Levi-JQ Levi-JQ force-pushed the fc2-remove-index branch 2 times, most recently from 3397b1d to 6dda84a Compare January 15, 2026 12:52
@ApsarasX ApsarasX added ready read for review ready-for-test start test by label for PR labels Jan 16, 2026
Signed-off-by: Levi-JQ <yujinqi2@huawei.com>
@wangxiyuan wangxiyuan merged commit f0d4119 into vllm-project:main Jan 19, 2026
20 checks passed
845473182 pushed a commit to 845473182/vllm-ascend that referenced this pull request Jan 19, 2026
…to FIA_rebase

* 'main' of https://github.com/vllm-project/vllm-ascend: (110 commits)
  [Performance] Remove index opetation when VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE=1 (vllm-project#5936)
  [main][bugfix] fix mooncake kv cache transfer when one P has multi nodes (vllm-project#5960)
  [Feature] Adapt DispathGmmCombineDecode opertor to align with weight scale dtype of small operators. [RFC: issue 5476] (vllm-project#5755)
  [Refactor] Move AttentionSpec initialization to Attention module (vllm-project#5834)
  [EPLB][Bugfix] policy_swift_balancer bugfix and renaming (vllm-project#5897)
  [CI]fix for lint CI (vllm-project#5982)
  [Fusion] [Graph]Add Matmul Allreduce Rmsnorm fusion Pass (vllm-project#5034)
  [Refactor] Migrate profiler config from env vars to explicit ProfilerConfig (vllm-project#5928)
  [EPLB][Bugfix] Dispatch Allgather use log2phy if enable eplb (vllm-project#5933)
  [EPLB][Nightly][Bugfix] Get expert from moe layer only (vllm-project#5908)
  [Bugfix][MM] Fix multi-modal inference OOM issues by setting `expandable_segments:True` (vllm-project#5855)
  [doc]Table split  (vllm-project#5929)
  [Doc] Upgrade outdated ut doc (vllm-project#5937)
  [Lint]Style: Convert `vllm-ascend/` to ruff format(Batch vllm-project#2) (vllm-project#5977)
  Eagle3 mm support, enablement on qwen3vl (vllm-project#4848)
  [Doc] Remove Chinese characters from the icons in the doc. (vllm-project#5959)
  [P/D]The issue of solving the force-free secondary release request, which causes the node to crash. (vllm-project#5968)
  [Feature] Support fine-grained shared expert overlap (vllm-project#5482)
  [Bugfix] fix cpu offload hang with tp=1 (vllm-project#5963)
  [Feature]: Support 310P device run qwen2.5/3 dense and qwen2.5vl models (vllm-project#5776)
  ...
845473182 pushed a commit to 845473182/vllm-ascend that referenced this pull request Jan 21, 2026
…to qwen3next_rebase

* 'main' of https://github.com/vllm-project/vllm-ascend: (637 commits)
  [Performance] Remove index opetation when VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE=1 (vllm-project#5936)
  [main][bugfix] fix mooncake kv cache transfer when one P has multi nodes (vllm-project#5960)
  [Feature] Adapt DispathGmmCombineDecode opertor to align with weight scale dtype of small operators. [RFC: issue 5476] (vllm-project#5755)
  [Refactor] Move AttentionSpec initialization to Attention module (vllm-project#5834)
  [EPLB][Bugfix] policy_swift_balancer bugfix and renaming (vllm-project#5897)
  [CI]fix for lint CI (vllm-project#5982)
  [Fusion] [Graph]Add Matmul Allreduce Rmsnorm fusion Pass (vllm-project#5034)
  [Refactor] Migrate profiler config from env vars to explicit ProfilerConfig (vllm-project#5928)
  [EPLB][Bugfix] Dispatch Allgather use log2phy if enable eplb (vllm-project#5933)
  [EPLB][Nightly][Bugfix] Get expert from moe layer only (vllm-project#5908)
  [Bugfix][MM] Fix multi-modal inference OOM issues by setting `expandable_segments:True` (vllm-project#5855)
  [doc]Table split  (vllm-project#5929)
  [Doc] Upgrade outdated ut doc (vllm-project#5937)
  [Lint]Style: Convert `vllm-ascend/` to ruff format(Batch vllm-project#2) (vllm-project#5977)
  Eagle3 mm support, enablement on qwen3vl (vllm-project#4848)
  [Doc] Remove Chinese characters from the icons in the doc. (vllm-project#5959)
  [P/D]The issue of solving the force-free secondary release request, which causes the node to crash. (vllm-project#5968)
  [Feature] Support fine-grained shared expert overlap (vllm-project#5482)
  [Bugfix] fix cpu offload hang with tp=1 (vllm-project#5963)
  [Feature]: Support 310P device run qwen2.5/3 dense and qwen2.5vl models (vllm-project#5776)
  ...
starmountain1997 pushed a commit to starmountain1997/vllm-ascend that referenced this pull request Jan 31, 2026
…LLEL_SIZE=1 (vllm-project#5936)

### What this PR does / why we need it?
When enable VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE>1, we need index
operation to reorganize the batch, because that we need ensure the
correct batch-id for each rank after the reduce-scatter op in
VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE>1. But we do not need it when
VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE=1, which dose not need
reduce-scatter.

Signed-off-by: Levi-JQ <yujinqi2@huawei.com>
Co-authored-by: Levi-JQ <yujinqi2@huawei.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Feb 28, 2026
…LLEL_SIZE=1 (vllm-project#5936)

### What this PR does / why we need it?
When enable VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE>1, we need index
operation to reorganize the batch, because that we need ensure the
correct batch-id for each rank after the reduce-scatter op in
VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE>1. But we do not need it when
VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE=1, which dose not need
reduce-scatter.

Signed-off-by: Levi-JQ <yujinqi2@huawei.com>
Co-authored-by: Levi-JQ <yujinqi2@huawei.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
maoxx241 pushed a commit to maoxx241/vllm-ascend that referenced this pull request Mar 2, 2026
…LLEL_SIZE=1 (vllm-project#5936)

### What this PR does / why we need it?
When enable VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE>1, we need index
operation to reorganize the batch, because that we need ensure the
correct batch-id for each rank after the reduce-scatter op in
VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE>1. But we do not need it when
VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE=1, which dose not need
reduce-scatter.

Signed-off-by: Levi-JQ <yujinqi2@huawei.com>
Co-authored-by: Levi-JQ <yujinqi2@huawei.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Mar 4, 2026
…LLEL_SIZE=1 (vllm-project#5936)

### What this PR does / why we need it?
When enable VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE>1, we need index
operation to reorganize the batch, because that we need ensure the
correct batch-id for each rank after the reduce-scatter op in
VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE>1. But we do not need it when
VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE=1, which dose not need
reduce-scatter.

Signed-off-by: Levi-JQ <yujinqi2@huawei.com>
Co-authored-by: Levi-JQ <yujinqi2@huawei.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
LCAIZJ pushed a commit to LCAIZJ/vllm-ascend that referenced this pull request Mar 7, 2026
…LLEL_SIZE=1 (vllm-project#5936)

### What this PR does / why we need it?
When enable VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE>1, we need index
operation to reorganize the batch, because that we need ensure the
correct batch-id for each rank after the reduce-scatter op in
VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE>1. But we do not need it when
VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE=1, which dose not need
reduce-scatter.

Signed-off-by: Levi-JQ <yujinqi2@huawei.com>
Co-authored-by: Levi-JQ <yujinqi2@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:ops ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants