Skip to content

[Bugfix] Fix PP+PCP and PP+flashcomm1 bugs#5416

Merged
jianzs merged 5 commits intovllm-project:mainfrom
gjc0824:pp_bugfix
Jan 26, 2026
Merged

[Bugfix] Fix PP+PCP and PP+flashcomm1 bugs#5416
jianzs merged 5 commits intovllm-project:mainfrom
gjc0824:pp_bugfix

Conversation

@gjc0824
Copy link
Copy Markdown
Contributor

@gjc0824 gjc0824 commented Dec 27, 2025

What this PR does / why we need it?

  • Fixed the computing of final hidden_states when enabling pipeline parallel and prefill context parallel at the same time. Only in the last PP rank, hidden_states are required and have right tensor type.

  • Fixed the shape of intermediate_tensors in the dummy_run when enabling pipeline parallel and flashcomm1. The intermediate_tensors should be divided by tp_size. Otherwise, the moe will raise issues.

  • Fixed the shape of self.intermediate_tensors for sufficient slice space

  • vLLM version: release/v0.13.0

  • vLLM main: vllm-project/vllm@81786c8

@github-actions
Copy link
Copy Markdown
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request contains two critical bug fixes for scenarios involving pipeline parallelism. The first fix ensures that an all_gather operation for Prefill Context Parallelism is only executed on the last pipeline rank, preventing runtime errors on other ranks. The second fix corrects the handling of intermediate tensors during dummy runs when sequence parallelism is also enabled, addressing issues with tensor size calculation, buffer allocation, and slicing to prevent memory errors. Both changes are correct and improve the robustness of the model runner.

Comment thread vllm_ascend/worker/model_runner_v1.py
Comment on lines +2108 to 2125
intermediate_tokens = num_tokens_padded
if enable_sp():
tp_size = get_tensor_model_parallel_world_size()
actual_tokens = num_tokens // tp_size
intermediate_tokens = (num_tokens_padded + tp_size - 1) // tp_size
if self.intermediate_tensors is None:
max_actual_tokens = self.max_num_tokens
if enable_sp():
max_actual_tokens = (self.max_num_tokens + tp_size - 1) // tp_size
self.intermediate_tensors = (
self.model.make_empty_intermediate_tensors(
batch_size=actual_tokens,
batch_size=max_actual_tokens,
dtype=self.dtype,
device=self.device))
intermediate_tensors = IntermediateTensors({
k:
v[:num_tokens_padded]
v[:intermediate_tokens]
for k, v in self.intermediate_tensors.items()
})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This change correctly handles intermediate_tensors during a dummy run when Pipeline Parallelism (PP) and Sequence Parallelism (SP, via flashcomm1) are enabled. There are several important fixes here:

  1. Correct sharded size calculation: The size of intermediate tensors is now correctly calculated using ceiling division (num_tokens_padded + tp_size - 1) // tp_size, which is the proper way to determine the size of a sharded tensor. The previous floor division was incorrect.
  2. Robust buffer allocation: self.intermediate_tensors is now allocated once with the maximum possible size (self.max_num_tokens), making it robust and reusable across different dummy runs. Previously, it was allocated based on the current run's token count, which could be insufficient for subsequent runs.
  3. Correct tensor slicing: The slicing of the intermediate tensors now correctly uses the sharded size (intermediate_tokens), preventing potential out-of-bounds errors that could occur when using the un-sharded num_tokens_padded.

These changes are critical for preventing OOM errors and ensuring correctness in memory estimation during dummy runs.

@jianzs
Copy link
Copy Markdown
Collaborator

jianzs commented Dec 28, 2025

@lidenghui1110 PTAL

@lidenghui1110
Copy link
Copy Markdown
Contributor

lidenghui1110 commented Dec 29, 2025

@lidenghui1110 PTAL

LGTM.

Fixed the shape of intermediate_tensors in the dummy_run when enabling pipeline parallel and flashcomm1. The intermediate_tensors should be divided by tp_size. Otherwise, the moe will raise issues.

We considered this point before, but #4705 changed the logic of num_tokens_padded. We need to add more UT to cover pipeline parallelism cases.

Fixed the shape of self.intermediate_tensors for sufficient slice space

Good job, we missed the case slice space maybe insufficient before.

@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
@wangxiyuan wangxiyuan added ready read for review ready-for-test start test by label for PR labels Jan 5, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jan 8, 2026

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
@lidenghui1110
Copy link
Copy Markdown
Contributor

@wangxiyuan this PR can be merged now?

@wangxiyuan
Copy link
Copy Markdown
Collaborator

@jianzs @weijinqian0 feel free to merge this.

@jianzs jianzs merged commit b390e0e into vllm-project:main Jan 26, 2026
16 checks passed
845473182 pushed a commit to 845473182/vllm-ascend that referenced this pull request Jan 28, 2026
…to qwen3next_rebase

* 'main' of https://github.com/vllm-project/vllm-ascend: (86 commits)
  [refactor] refactor excute_model and _dymmy_run method  (vllm-project#6043)
  [Refactor] profiler config optimze (vllm-project#6141)
  [Graph][Fusion] Add MatmulAllReduceAddRMSNorm graph fusion for npugraph_ex. (vllm-project#6006)
  [UT]: refactoring 310p ops ut (vllm-project#6296)
  [Refact.]: refactoring 310p-kv cache allocator, align with main branch (vllm-project#6270)
  [Misc] Removes unnecessary graph size re-initialization (vllm-project#6280)
  [Main2Main] Upgrade vllm commit to 0123 (vllm-project#6169)
  [BugFix] Fix wheel package build workflow (vllm-project#6276)
  [CI][BugFix] Qwen3-Next nightly test fix. (vllm-project#6247)
  [Doc] quick fix for vllm-ascend version (vllm-project#6278)
  [Community] Nominate whx-sjtu as maintainer (vllm-project#6268)
  [Lint] Fix mypy issue to make CI happy (vllm-project#6272)
  BugFix:  Fix moe_load accumulation error in ACL graph mode (vllm-project#6182)
  [Patch] Remove the patch of ECExampleConnector (vllm-project#5976)
  [Bugfix] Fix PP+PCP and PP+flashcomm1 bugs (vllm-project#5416)
  [Feat] proxy delay to remove instances (vllm-project#5934)
  [CI] Add workfolw_dispatch for nightly image build (vllm-project#6269)
  [bugfix][npugraph_ex]fix static kernel uninstall issue (vllm-project#6128)
  [Doc] 310P Documents update (vllm-project#6246)
  [Feature] Mooncake connector get remote ptp size (vllm-project#5822)
  ...
starmountain1997 pushed a commit to starmountain1997/vllm-ascend that referenced this pull request Jan 31, 2026
- Fixed the computing of final hidden_states when enabling pipeline
parallel and prefill context parallel at the same time. Only in the last
PP rank, hidden_states are required and have right tensor type.
- Fixed the shape of intermediate_tensors in the dummy_run when enabling
pipeline parallel and flashcomm1. The intermediate_tensors should be
divided by tp_size. Otherwise, the moe will raise issues.
- Fixed the shape of self.intermediate_tensors for sufficient slice
space

- vLLM version: release/v0.13.0
- vLLM main:
vllm-project/vllm@81786c8

---------

Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
chenchuw886 pushed a commit to chenchuw886/vllm-ascend that referenced this pull request Feb 12, 2026
- Fixed the computing of final hidden_states when enabling pipeline
parallel and prefill context parallel at the same time. Only in the last
PP rank, hidden_states are required and have right tensor type.
- Fixed the shape of intermediate_tensors in the dummy_run when enabling
pipeline parallel and flashcomm1. The intermediate_tensors should be
divided by tp_size. Otherwise, the moe will raise issues.
- Fixed the shape of self.intermediate_tensors for sufficient slice
space

- vLLM version: release/v0.13.0
- vLLM main:
vllm-project/vllm@81786c8

---------

Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
Signed-off-by: momochenchuw <chenchuw@huawei.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Feb 28, 2026
- Fixed the computing of final hidden_states when enabling pipeline
parallel and prefill context parallel at the same time. Only in the last
PP rank, hidden_states are required and have right tensor type.
- Fixed the shape of intermediate_tensors in the dummy_run when enabling
pipeline parallel and flashcomm1. The intermediate_tensors should be
divided by tp_size. Otherwise, the moe will raise issues.
- Fixed the shape of self.intermediate_tensors for sufficient slice
space

- vLLM version: release/v0.13.0
- vLLM main:
vllm-project/vllm@81786c8

---------

Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
maoxx241 pushed a commit to maoxx241/vllm-ascend that referenced this pull request Mar 2, 2026
- Fixed the computing of final hidden_states when enabling pipeline
parallel and prefill context parallel at the same time. Only in the last
PP rank, hidden_states are required and have right tensor type.
- Fixed the shape of intermediate_tensors in the dummy_run when enabling
pipeline parallel and flashcomm1. The intermediate_tensors should be
divided by tp_size. Otherwise, the moe will raise issues.
- Fixed the shape of self.intermediate_tensors for sufficient slice
space

- vLLM version: release/v0.13.0
- vLLM main:
vllm-project/vllm@81786c8

---------

Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Mar 4, 2026
- Fixed the computing of final hidden_states when enabling pipeline
parallel and prefill context parallel at the same time. Only in the last
PP rank, hidden_states are required and have right tensor type.
- Fixed the shape of intermediate_tensors in the dummy_run when enabling
pipeline parallel and flashcomm1. The intermediate_tensors should be
divided by tp_size. Otherwise, the moe will raise issues.
- Fixed the shape of self.intermediate_tensors for sufficient slice
space

- vLLM version: release/v0.13.0
- vLLM main:
vllm-project/vllm@81786c8

---------

Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
LCAIZJ pushed a commit to LCAIZJ/vllm-ascend that referenced this pull request Mar 7, 2026
- Fixed the computing of final hidden_states when enabling pipeline
parallel and prefill context parallel at the same time. Only in the last
PP rank, hidden_states are required and have right tensor type.
- Fixed the shape of intermediate_tensors in the dummy_run when enabling
pipeline parallel and flashcomm1. The intermediate_tensors should be
divided by tp_size. Otherwise, the moe will raise issues.
- Fixed the shape of self.intermediate_tensors for sufficient slice
space

- vLLM version: release/v0.13.0
- vLLM main:
vllm-project/vllm@81786c8

---------

Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
jiangyunfan1 pushed a commit to jiangyunfan1/vllm-ascend that referenced this pull request Apr 9, 2026
- Fixed the computing of final hidden_states when enabling pipeline
parallel and prefill context parallel at the same time. Only in the last
PP rank, hidden_states are required and have right tensor type.
- Fixed the shape of intermediate_tensors in the dummy_run when enabling
pipeline parallel and flashcomm1. The intermediate_tensors should be
divided by tp_size. Otherwise, the moe will raise issues.
- Fixed the shape of self.intermediate_tensors for sufficient slice
space

- vLLM version: release/v0.13.0
- vLLM main:
vllm-project/vllm@81786c8

---------

Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
SlightwindSec added a commit to SlightwindSec/vllm-ascend that referenced this pull request Apr 14, 2026
1. Fix intermediate tensor shape in dummy_run when SP+PP enabled
   (from b390e0e, PR vllm-project#5416):
   - Divide intermediate tensor tokens by tp_size when enable_sp()
   - Allocate self.intermediate_tensors with max_tokens / tp_size
   - Prevents extra all-gather on non-first PP ranks causing OOM

2. Refactor MC2 communication group for PP compatibility
   (adapted from 71df17f, PR vllm-project#7291):
   - Reshape all_ranks to 4D: (ExternalDP, DP, PP, TP)
   - MC2 group_ranks now uses EP-like layout via transpose(1,2)
     so ranks within the same PP stage are grouped together
   - Update P_TP alltoall group construction for PP dimension

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@gjc0824 gjc0824 deleted the pp_bugfix branch April 18, 2026 05:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants