[Bugfix] Fix PP+PCP and PP+flashcomm1 bugs#5416
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Code Review
This pull request contains two critical bug fixes for scenarios involving pipeline parallelism. The first fix ensures that an all_gather operation for Prefill Context Parallelism is only executed on the last pipeline rank, preventing runtime errors on other ranks. The second fix corrects the handling of intermediate tensors during dummy runs when sequence parallelism is also enabled, addressing issues with tensor size calculation, buffer allocation, and slicing to prevent memory errors. Both changes are correct and improve the robustness of the model runner.
| intermediate_tokens = num_tokens_padded | ||
| if enable_sp(): | ||
| tp_size = get_tensor_model_parallel_world_size() | ||
| actual_tokens = num_tokens // tp_size | ||
| intermediate_tokens = (num_tokens_padded + tp_size - 1) // tp_size | ||
| if self.intermediate_tensors is None: | ||
| max_actual_tokens = self.max_num_tokens | ||
| if enable_sp(): | ||
| max_actual_tokens = (self.max_num_tokens + tp_size - 1) // tp_size | ||
| self.intermediate_tensors = ( | ||
| self.model.make_empty_intermediate_tensors( | ||
| batch_size=actual_tokens, | ||
| batch_size=max_actual_tokens, | ||
| dtype=self.dtype, | ||
| device=self.device)) | ||
| intermediate_tensors = IntermediateTensors({ | ||
| k: | ||
| v[:num_tokens_padded] | ||
| v[:intermediate_tokens] | ||
| for k, v in self.intermediate_tensors.items() | ||
| }) |
There was a problem hiding this comment.
This change correctly handles intermediate_tensors during a dummy run when Pipeline Parallelism (PP) and Sequence Parallelism (SP, via flashcomm1) are enabled. There are several important fixes here:
- Correct sharded size calculation: The size of intermediate tensors is now correctly calculated using ceiling division
(num_tokens_padded + tp_size - 1) // tp_size, which is the proper way to determine the size of a sharded tensor. The previous floor division was incorrect. - Robust buffer allocation:
self.intermediate_tensorsis now allocated once with the maximum possible size (self.max_num_tokens), making it robust and reusable across different dummy runs. Previously, it was allocated based on the current run's token count, which could be insufficient for subsequent runs. - Correct tensor slicing: The slicing of the intermediate tensors now correctly uses the sharded size (
intermediate_tokens), preventing potential out-of-bounds errors that could occur when using the un-shardednum_tokens_padded.
These changes are critical for preventing OOM errors and ensuring correctness in memory estimation during dummy runs.
|
@lidenghui1110 PTAL |
LGTM.
We considered this point before, but #4705 changed the logic of
Good job, we missed the case slice space maybe insufficient before. |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
|
@wangxiyuan this PR can be merged now? |
|
@jianzs @weijinqian0 feel free to merge this. |
…to qwen3next_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: (86 commits) [refactor] refactor excute_model and _dymmy_run method (vllm-project#6043) [Refactor] profiler config optimze (vllm-project#6141) [Graph][Fusion] Add MatmulAllReduceAddRMSNorm graph fusion for npugraph_ex. (vllm-project#6006) [UT]: refactoring 310p ops ut (vllm-project#6296) [Refact.]: refactoring 310p-kv cache allocator, align with main branch (vllm-project#6270) [Misc] Removes unnecessary graph size re-initialization (vllm-project#6280) [Main2Main] Upgrade vllm commit to 0123 (vllm-project#6169) [BugFix] Fix wheel package build workflow (vllm-project#6276) [CI][BugFix] Qwen3-Next nightly test fix. (vllm-project#6247) [Doc] quick fix for vllm-ascend version (vllm-project#6278) [Community] Nominate whx-sjtu as maintainer (vllm-project#6268) [Lint] Fix mypy issue to make CI happy (vllm-project#6272) BugFix: Fix moe_load accumulation error in ACL graph mode (vllm-project#6182) [Patch] Remove the patch of ECExampleConnector (vllm-project#5976) [Bugfix] Fix PP+PCP and PP+flashcomm1 bugs (vllm-project#5416) [Feat] proxy delay to remove instances (vllm-project#5934) [CI] Add workfolw_dispatch for nightly image build (vllm-project#6269) [bugfix][npugraph_ex]fix static kernel uninstall issue (vllm-project#6128) [Doc] 310P Documents update (vllm-project#6246) [Feature] Mooncake connector get remote ptp size (vllm-project#5822) ...
- Fixed the computing of final hidden_states when enabling pipeline parallel and prefill context parallel at the same time. Only in the last PP rank, hidden_states are required and have right tensor type. - Fixed the shape of intermediate_tensors in the dummy_run when enabling pipeline parallel and flashcomm1. The intermediate_tensors should be divided by tp_size. Otherwise, the moe will raise issues. - Fixed the shape of self.intermediate_tensors for sufficient slice space - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@81786c8 --------- Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
- Fixed the computing of final hidden_states when enabling pipeline parallel and prefill context parallel at the same time. Only in the last PP rank, hidden_states are required and have right tensor type. - Fixed the shape of intermediate_tensors in the dummy_run when enabling pipeline parallel and flashcomm1. The intermediate_tensors should be divided by tp_size. Otherwise, the moe will raise issues. - Fixed the shape of self.intermediate_tensors for sufficient slice space - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@81786c8 --------- Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com> Signed-off-by: momochenchuw <chenchuw@huawei.com>
- Fixed the computing of final hidden_states when enabling pipeline parallel and prefill context parallel at the same time. Only in the last PP rank, hidden_states are required and have right tensor type. - Fixed the shape of intermediate_tensors in the dummy_run when enabling pipeline parallel and flashcomm1. The intermediate_tensors should be divided by tp_size. Otherwise, the moe will raise issues. - Fixed the shape of self.intermediate_tensors for sufficient slice space - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@81786c8 --------- Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
- Fixed the computing of final hidden_states when enabling pipeline parallel and prefill context parallel at the same time. Only in the last PP rank, hidden_states are required and have right tensor type. - Fixed the shape of intermediate_tensors in the dummy_run when enabling pipeline parallel and flashcomm1. The intermediate_tensors should be divided by tp_size. Otherwise, the moe will raise issues. - Fixed the shape of self.intermediate_tensors for sufficient slice space - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@81786c8 --------- Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
- Fixed the computing of final hidden_states when enabling pipeline parallel and prefill context parallel at the same time. Only in the last PP rank, hidden_states are required and have right tensor type. - Fixed the shape of intermediate_tensors in the dummy_run when enabling pipeline parallel and flashcomm1. The intermediate_tensors should be divided by tp_size. Otherwise, the moe will raise issues. - Fixed the shape of self.intermediate_tensors for sufficient slice space - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@81786c8 --------- Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
- Fixed the computing of final hidden_states when enabling pipeline parallel and prefill context parallel at the same time. Only in the last PP rank, hidden_states are required and have right tensor type. - Fixed the shape of intermediate_tensors in the dummy_run when enabling pipeline parallel and flashcomm1. The intermediate_tensors should be divided by tp_size. Otherwise, the moe will raise issues. - Fixed the shape of self.intermediate_tensors for sufficient slice space - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@81786c8 --------- Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
- Fixed the computing of final hidden_states when enabling pipeline parallel and prefill context parallel at the same time. Only in the last PP rank, hidden_states are required and have right tensor type. - Fixed the shape of intermediate_tensors in the dummy_run when enabling pipeline parallel and flashcomm1. The intermediate_tensors should be divided by tp_size. Otherwise, the moe will raise issues. - Fixed the shape of self.intermediate_tensors for sufficient slice space - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@81786c8 --------- Signed-off-by: Jingchun Gao <gaojingchun1@huawei.com>
1. Fix intermediate tensor shape in dummy_run when SP+PP enabled (from b390e0e, PR vllm-project#5416): - Divide intermediate tensor tokens by tp_size when enable_sp() - Allocate self.intermediate_tensors with max_tokens / tp_size - Prevents extra all-gather on non-first PP ranks causing OOM 2. Refactor MC2 communication group for PP compatibility (adapted from 71df17f, PR vllm-project#7291): - Reshape all_ranks to 4D: (ExternalDP, DP, PP, TP) - MC2 group_ranks now uses EP-like layout via transpose(1,2) so ranks within the same PP stage are grouped together - Update P_TP alltoall group construction for PP dimension Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
What this PR does / why we need it?
Fixed the computing of final hidden_states when enabling pipeline parallel and prefill context parallel at the same time. Only in the last PP rank, hidden_states are required and have right tensor type.
Fixed the shape of intermediate_tensors in the dummy_run when enabling pipeline parallel and flashcomm1. The intermediate_tensors should be divided by tp_size. Otherwise, the moe will raise issues.
Fixed the shape of self.intermediate_tensors for sufficient slice space
vLLM version: release/v0.13.0
vLLM main: vllm-project/vllm@81786c8