[Bugfix] Fix PP+PCP and PP+flashcomm1 bugs by gjc0824 · Pull Request #5416 · vllm-project/vllm-ascend

gjc0824 · 2025-12-27T03:43:40Z

What this PR does / why we need it?

Fixed the computing of final hidden_states when enabling pipeline parallel and prefill context parallel at the same time. Only in the last PP rank, hidden_states are required and have right tensor type.
Fixed the shape of intermediate_tensors in the dummy_run when enabling pipeline parallel and flashcomm1. The intermediate_tensors should be divided by tp_size. Otherwise, the moe will raise issues.
Fixed the shape of self.intermediate_tensors for sufficient slice space
vLLM version: release/v0.13.0
vLLM main: vllm-project/vllm@81786c8

github-actions · 2025-12-27T03:44:00Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request contains two critical bug fixes for scenarios involving pipeline parallelism. The first fix ensures that an all_gather operation for Prefill Context Parallelism is only executed on the last pipeline rank, preventing runtime errors on other ranks. The second fix corrects the handling of intermediate tensors during dummy runs when sequence parallelism is also enabled, addressing issues with tensor size calculation, buffer allocation, and slicing to prevent memory errors. Both changes are correct and improve the robustness of the model runner.

gemini-code-assist · 2025-12-27T03:45:27Z

+                intermediate_tokens = num_tokens_padded
                if enable_sp():
                    tp_size = get_tensor_model_parallel_world_size()
-                    actual_tokens = num_tokens // tp_size
+                    intermediate_tokens = (num_tokens_padded + tp_size - 1) // tp_size
                if self.intermediate_tensors is None:
+                    max_actual_tokens = self.max_num_tokens
+                    if enable_sp():
+                        max_actual_tokens = (self.max_num_tokens + tp_size - 1) // tp_size
                    self.intermediate_tensors = (
                        self.model.make_empty_intermediate_tensors(
-                            batch_size=actual_tokens,
+                            batch_size=max_actual_tokens,
                            dtype=self.dtype,
                            device=self.device))
                intermediate_tensors = IntermediateTensors({
                    k:
-                    v[:num_tokens_padded]
+                    v[:intermediate_tokens]
                    for k, v in self.intermediate_tensors.items()
                })


This change correctly handles intermediate_tensors during a dummy run when Pipeline Parallelism (PP) and Sequence Parallelism (SP, via flashcomm1) are enabled. There are several important fixes here:

Correct sharded size calculation: The size of intermediate tensors is now correctly calculated using ceiling division (num_tokens_padded + tp_size - 1) // tp_size, which is the proper way to determine the size of a sharded tensor. The previous floor division was incorrect.

Robust buffer allocation: self.intermediate_tensors is now allocated once with the maximum possible size (self.max_num_tokens), making it robust and reusable across different dummy runs. Previously, it was allocated based on the current run's token count, which could be insufficient for subsequent runs.

Correct tensor slicing: The slicing of the intermediate tensors now correctly uses the sharded size (intermediate_tokens), preventing potential out-of-bounds errors that could occur when using the un-sharded num_tokens_padded.

These changes are critical for preventing OOM errors and ensuring correctness in memory estimation during dummy runs.

jianzs · 2025-12-28T02:13:00Z

@lidenghui1110 PTAL

lidenghui1110 · 2025-12-29T09:04:24Z

@lidenghui1110 PTAL

LGTM.

Fixed the shape of intermediate_tensors in the dummy_run when enabling pipeline parallel and flashcomm1. The intermediate_tensors should be divided by tp_size. Otherwise, the moe will raise issues.

We considered this point before, but #4705 changed the logic of num_tokens_padded. We need to add more UT to cover pipeline parallelism cases.

Fixed the shape of self.intermediate_tensors for sufficient slice space

Good job, we missed the case slice space maybe insufficient before.

github-actions · 2025-12-31T01:34:28Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2025-12-31T01:34:29Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>

github-actions · 2026-01-08T07:42:36Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>

lidenghui1110 · 2026-01-26T08:43:08Z

@wangxiyuan this PR can be merged now?

wangxiyuan · 2026-01-26T08:48:26Z

@jianzs @weijinqian0 feel free to merge this.

…to qwen3next_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: (86 commits) [refactor] refactor excute_model and _dymmy_run method (vllm-project#6043) [Refactor] profiler config optimze (vllm-project#6141) [Graph][Fusion] Add MatmulAllReduceAddRMSNorm graph fusion for npugraph_ex. (vllm-project#6006) [UT]: refactoring 310p ops ut (vllm-project#6296) [Refact.]: refactoring 310p-kv cache allocator, align with main branch (vllm-project#6270) [Misc] Removes unnecessary graph size re-initialization (vllm-project#6280) [Main2Main] Upgrade vllm commit to 0123 (vllm-project#6169) [BugFix] Fix wheel package build workflow (vllm-project#6276) [CI][BugFix] Qwen3-Next nightly test fix. (vllm-project#6247) [Doc] quick fix for vllm-ascend version (vllm-project#6278) [Community] Nominate whx-sjtu as maintainer (vllm-project#6268) [Lint] Fix mypy issue to make CI happy (vllm-project#6272) BugFix: Fix moe_load accumulation error in ACL graph mode (vllm-project#6182) [Patch] Remove the patch of ECExampleConnector (vllm-project#5976) [Bugfix] Fix PP+PCP and PP+flashcomm1 bugs (vllm-project#5416) [Feat] proxy delay to remove instances (vllm-project#5934) [CI] Add workfolw_dispatch for nightly image build (vllm-project#6269) [bugfix][npugraph_ex]fix static kernel uninstall issue (vllm-project#6128) [Doc] 310P Documents update (vllm-project#6246) [Feature] Mooncake connector get remote ptp size (vllm-project#5822) ...

- Fixed the computing of final hidden_states when enabling pipeline parallel and prefill context parallel at the same time. Only in the last PP rank, hidden_states are required and have right tensor type. - Fixed the shape of intermediate_tensors in the dummy_run when enabling pipeline parallel and flashcomm1. The intermediate_tensors should be divided by tp_size. Otherwise, the moe will raise issues. - Fixed the shape of self.intermediate_tensors for sufficient slice space - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@81786c8 --------- Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>

- Fixed the computing of final hidden_states when enabling pipeline parallel and prefill context parallel at the same time. Only in the last PP rank, hidden_states are required and have right tensor type. - Fixed the shape of intermediate_tensors in the dummy_run when enabling pipeline parallel and flashcomm1. The intermediate_tensors should be divided by tp_size. Otherwise, the moe will raise issues. - Fixed the shape of self.intermediate_tensors for sufficient slice space - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@81786c8 --------- Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com> Signed-off-by: momochenchuw <chenchuw@huawei.com>

- Fixed the computing of final hidden_states when enabling pipeline parallel and prefill context parallel at the same time. Only in the last PP rank, hidden_states are required and have right tensor type. - Fixed the shape of intermediate_tensors in the dummy_run when enabling pipeline parallel and flashcomm1. The intermediate_tensors should be divided by tp_size. Otherwise, the moe will raise issues. - Fixed the shape of self.intermediate_tensors for sufficient slice space - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@81786c8 --------- Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

- Fixed the computing of final hidden_states when enabling pipeline parallel and prefill context parallel at the same time. Only in the last PP rank, hidden_states are required and have right tensor type. - Fixed the shape of intermediate_tensors in the dummy_run when enabling pipeline parallel and flashcomm1. The intermediate_tensors should be divided by tp_size. Otherwise, the moe will raise issues. - Fixed the shape of self.intermediate_tensors for sufficient slice space - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@81786c8 --------- Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>

- Fixed the computing of final hidden_states when enabling pipeline parallel and prefill context parallel at the same time. Only in the last PP rank, hidden_states are required and have right tensor type. - Fixed the shape of intermediate_tensors in the dummy_run when enabling pipeline parallel and flashcomm1. The intermediate_tensors should be divided by tp_size. Otherwise, the moe will raise issues. - Fixed the shape of self.intermediate_tensors for sufficient slice space - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@81786c8 --------- Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

- Fixed the computing of final hidden_states when enabling pipeline parallel and prefill context parallel at the same time. Only in the last PP rank, hidden_states are required and have right tensor type. - Fixed the shape of intermediate_tensors in the dummy_run when enabling pipeline parallel and flashcomm1. The intermediate_tensors should be divided by tp_size. Otherwise, the moe will raise issues. - Fixed the shape of self.intermediate_tensors for sufficient slice space - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@81786c8 --------- Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>

1. Fix intermediate tensor shape in dummy_run when SP+PP enabled (from b390e0e, PR vllm-project#5416): - Divide intermediate tensor tokens by tp_size when enable_sp() - Allocate self.intermediate_tensors with max_tokens / tp_size - Prevents extra all-gather on non-first PP ranks causing OOM 2. Refactor MC2 communication group for PP compatibility (adapted from 71df17f, PR vllm-project#7291): - Reshape all_ranks to 4D: (ExternalDP, DP, PP, TP) - MC2 group_ranks now uses EP-like layout via transpose(1,2) so ranks within the same PP stage are grouped together - Update P_TP alltoall group construction for PP dimension Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

gemini-code-assist Bot reviewed Dec 27, 2025

View reviewed changes

gjc0824 force-pushed the pp_bugfix branch from ab97464 to 97b9ee5 Compare December 27, 2025 07:47

weijinqian0 approved these changes Dec 30, 2025

View reviewed changes

github-actions Bot added merge-conflicts labels Dec 31, 2025

gjc0824 added 2 commits December 31, 2025 18:03

fix pp+flashcomm1 && pp+pcp

de163f5

Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>

[Typo] fix params name

b2cbbee

Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>

gjc0824 force-pushed the pp_bugfix branch from 97b9ee5 to cca4b93 Compare December 31, 2025 10:06

github-actions Bot removed the merge-conflicts label Dec 31, 2025

gjc0824 force-pushed the pp_bugfix branch from cca4b93 to bc00bca Compare January 3, 2026 08:45

lint

ef8622a

Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>

gjc0824 force-pushed the pp_bugfix branch from bc00bca to ef8622a Compare January 4, 2026 07:13

wangxiyuan added ready read for review ready-for-test start test by label for PR labels Jan 5, 2026

github-actions Bot added the merge-conflicts label Jan 8, 2026

Merge remote-tracking branch 'upstream/main' into pp_bugfix

1db5067

github-actions Bot removed the merge-conflicts label Jan 8, 2026

Merge remote-tracking branch 'upstream/main' into pp_bugfix

6626a60

Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>

gjc0824 force-pushed the pp_bugfix branch from ab1280f to 6626a60 Compare January 9, 2026 02:58

wangxiyuan approved these changes Jan 26, 2026

View reviewed changes

jianzs merged commit b390e0e into vllm-project:main Jan 26, 2026
16 checks passed

gjc0824 deleted the pp_bugfix branch April 18, 2026 05:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] Fix PP+PCP and PP+flashcomm1 bugs#5416

[Bugfix] Fix PP+PCP and PP+flashcomm1 bugs#5416
jianzs merged 5 commits intovllm-project:mainfrom
gjc0824:pp_bugfix

gjc0824 commented Dec 27, 2025 •

edited

Loading

Uh oh!

github-actions Bot commented Dec 27, 2025

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

gemini-code-assist Bot Dec 27, 2025

Uh oh!

jianzs commented Dec 28, 2025

Uh oh!

lidenghui1110 commented Dec 29, 2025 •

edited

Loading

Uh oh!

github-actions Bot commented Dec 31, 2025

Uh oh!

github-actions Bot commented Dec 31, 2025

Uh oh!

github-actions Bot commented Jan 8, 2026

Uh oh!

lidenghui1110 commented Jan 26, 2026

Uh oh!

wangxiyuan commented Jan 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

gjc0824 commented Dec 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Uh oh!

github-actions Bot commented Dec 27, 2025

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist Bot Dec 27, 2025

Choose a reason for hiding this comment

Uh oh!

jianzs commented Dec 28, 2025

Uh oh!

lidenghui1110 commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Dec 31, 2025

Uh oh!

github-actions Bot commented Dec 31, 2025

Uh oh!

github-actions Bot commented Jan 8, 2026

Uh oh!

lidenghui1110 commented Jan 26, 2026

Uh oh!

wangxiyuan commented Jan 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

gjc0824 commented Dec 27, 2025 •

edited

Loading

lidenghui1110 commented Dec 29, 2025 •

edited

Loading