[Bugfix][V1][P/D]Fix the issue where repeated requests for the same input produce abnormal outputs for P2pNcclConnector by Abatom · Pull Request #23403 · vllm-project/vllm

Abatom · 2025-08-22T04:30:36Z

Purpose

A user reported — and I also observed — that after [V1][P/D]Enhance Performance and code readability for P2pNcclConnector #20906, when making repeated identical requests, the subsequent requests produce abnormal outputs, such as in issue [Bug]: [xPyD]Abnormal results when using v1 P2pNcclConnector as KV cache transport: repeated requests for the same input produce abnormal outputs #22965.
Revert "The KVCache sender offloads the KVCache extraction and reshape operations to a dedicated sending thread" of [V1][P/D]Enhance Performance and code readability for P2pNcclConnector #20906,

Test Plan

Test Result

(Optional) Documentation Update

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: Abatom <abzhonghua@gmail.com>

gemini-code-assist

Code Review

This pull request addresses a bug where repeated identical requests could lead to abnormal outputs. The fix involves reverting a change that offloaded KVCache extraction to a background thread. By moving the extraction logic back into the main thread within P2pNcclConnector.save_kv_layer, it resolves a potential race condition where the KV cache could be modified before being sent, ensuring data integrity. The changes in P2pNcclEngine are consistent with this revert.

I've identified a potential critical issue where attn_metadata could be None, leading to a crash. Please see my detailed comment.

gemini-code-assist · 2025-08-22T04:33:11Z

vllm/distributed/kv_transfer/kv_connector/v1/p2p/p2p_nccl_connector.py

+            if isinstance(attn_metadata, MLACommonMetadata):
+                num_pages, page_size = layer.shape[0], layer.shape[1]
+                return layer.reshape(num_pages * page_size, -1)[slot_mapping,
+                                                                ...]
+            num_pages, page_size = layer.shape[1], layer.shape[2]
+            return layer.reshape(2, num_pages * page_size, -1)[:, slot_mapping,
+                                                               ...]


There is a potential None value for attn_metadata that is not handled here, which could lead to a runtime crash.

The start_load_kv method in this class (line 122) checks if attn_metadata is None and returns early. However, save_kv_layer does not have this check. If attn_metadata is None, isinstance(attn_metadata, MLACommonMetadata) on line 257 will evaluate to False. For an MLA model, this will cause the code to incorrectly take the non-MLA path, resulting in an IndexError on line 261 because layer.shape will have 3 dimensions instead of the expected 4.

To prevent this crash, it's recommended to add a check for attn_metadata is None at the beginning of the save_kv_layer method, similar to the implementation in start_load_kv.

Abatom · 2025-08-25T02:38:52Z

@KuntaiDu PTAL!

…nput produce abnormal outputs for P2pNcclConnector (vllm-project#23403) Signed-off-by: Abatom <abzhonghua@gmail.com> Signed-off-by: Terrencezzj <terrence@cohere.ai>

…nput produce abnormal outputs for P2pNcclConnector (vllm-project#23403) Signed-off-by: Abatom <abzhonghua@gmail.com> Signed-off-by: tc-mb <caitianchi@modelbest.cn>

…nput produce abnormal outputs for P2pNcclConnector (vllm-project#23403) Signed-off-by: Abatom <abzhonghua@gmail.com>

…nput produce abnormal outputs for P2pNcclConnector (vllm-project#23403) Signed-off-by: Abatom <abzhonghua@gmail.com> Signed-off-by: Xiao Yu <xiao.yu@amd.com>

…nput produce abnormal outputs for P2pNcclConnector (vllm-project#23403) Signed-off-by: Abatom <abzhonghua@gmail.com>

bugfix

55a330e

Signed-off-by: Abatom <abzhonghua@gmail.com>

Abatom mentioned this pull request Aug 22, 2025

[Bug]: [xPyD]Abnormal results when using v1 P2pNcclConnector as KV cache transport: repeated requests for the same input produce abnormal outputs #22965

Closed

1 task

gemini-code-assist bot reviewed Aug 22, 2025

View reviewed changes

Merge branch 'main' into xpyd-abnormal-outputs

c731ff7

Abatom mentioned this pull request Aug 25, 2025

[V1][P/D]P2pNcclConnector supports flashinfer #23536

Merged

simon-mo approved these changes Aug 25, 2025

View reviewed changes

simon-mo merged commit 9188ae7 into vllm-project:main Aug 25, 2025
12 checks passed

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025

[Bugfix][V1][P/D]Fix the issue where repeated requests for the same i…

d9ed83f

…nput produce abnormal outputs for P2pNcclConnector (vllm-project#23403) Signed-off-by: Abatom <abzhonghua@gmail.com>

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025

[Bugfix][V1][P/D]Fix the issue where repeated requests for the same i…

74d4c65

…nput produce abnormal outputs for P2pNcclConnector (vllm-project#23403) Signed-off-by: Abatom <abzhonghua@gmail.com>

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Sep 3, 2025

[Bugfix][V1][P/D]Fix the issue where repeated requests for the same i…

abb69db

…nput produce abnormal outputs for P2pNcclConnector (vllm-project#23403) Signed-off-by: Abatom <abzhonghua@gmail.com>

lianjiezh mentioned this pull request Sep 4, 2025

算子mcclRecv问题 MetaX-MACA/vLLM-metax#28

Closed

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

[Bugfix][V1][P/D]Fix the issue where repeated requests for the same i…

86b5011

…nput produce abnormal outputs for P2pNcclConnector (vllm-project#23403) Signed-off-by: Abatom <abzhonghua@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix][V1][P/D]Fix the issue where repeated requests for the same input produce abnormal outputs for P2pNcclConnector#23403

[Bugfix][V1][P/D]Fix the issue where repeated requests for the same input produce abnormal outputs for P2pNcclConnector#23403
simon-mo merged 2 commits intovllm-project:mainfrom
Abatom:xpyd-abnormal-outputs

Abatom commented Aug 22, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Aug 22, 2025

Uh oh!

Abatom commented Aug 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Abatom commented Aug 22, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

Abatom commented Aug 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Abatom commented Aug 22, 2025 •

edited by github-actions bot

Loading