Skip to content

[Bugfix][V1][P/D]Fix the issue where repeated requests for the same input produce abnormal outputs for P2pNcclConnector#23403

Merged
simon-mo merged 2 commits intovllm-project:mainfrom
Abatom:xpyd-abnormal-outputs
Aug 25, 2025
Merged

[Bugfix][V1][P/D]Fix the issue where repeated requests for the same input produce abnormal outputs for P2pNcclConnector#23403
simon-mo merged 2 commits intovllm-project:mainfrom
Abatom:xpyd-abnormal-outputs

Conversation

@Abatom
Copy link
Contributor

@Abatom Abatom commented Aug 22, 2025

Purpose

Test Plan

Test Result

(Optional) Documentation Update


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: Abatom <abzhonghua@gmail.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug where repeated identical requests could lead to abnormal outputs. The fix involves reverting a change that offloaded KVCache extraction to a background thread. By moving the extraction logic back into the main thread within P2pNcclConnector.save_kv_layer, it resolves a potential race condition where the KV cache could be modified before being sent, ensuring data integrity. The changes in P2pNcclEngine are consistent with this revert.

I've identified a potential critical issue where attn_metadata could be None, leading to a crash. Please see my detailed comment.

Comment on lines +257 to +263
if isinstance(attn_metadata, MLACommonMetadata):
num_pages, page_size = layer.shape[0], layer.shape[1]
return layer.reshape(num_pages * page_size, -1)[slot_mapping,
...]
num_pages, page_size = layer.shape[1], layer.shape[2]
return layer.reshape(2, num_pages * page_size, -1)[:, slot_mapping,
...]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There is a potential None value for attn_metadata that is not handled here, which could lead to a runtime crash.

The start_load_kv method in this class (line 122) checks if attn_metadata is None and returns early. However, save_kv_layer does not have this check. If attn_metadata is None, isinstance(attn_metadata, MLACommonMetadata) on line 257 will evaluate to False. For an MLA model, this will cause the code to incorrectly take the non-MLA path, resulting in an IndexError on line 261 because layer.shape will have 3 dimensions instead of the expected 4.

To prevent this crash, it's recommended to add a check for attn_metadata is None at the beginning of the save_kv_layer method, similar to the implementation in start_load_kv.

@Abatom
Copy link
Contributor Author

Abatom commented Aug 25, 2025

@KuntaiDu PTAL!

@simon-mo simon-mo merged commit 9188ae7 into vllm-project:main Aug 25, 2025
12 checks passed
Terrencezzj pushed a commit to Terrencezzj/vllm that referenced this pull request Aug 25, 2025
…nput produce abnormal outputs for P2pNcclConnector (vllm-project#23403)

Signed-off-by: Abatom <abzhonghua@gmail.com>
Signed-off-by: Terrencezzj <terrence@cohere.ai>
tc-mb pushed a commit to tc-mb/vllm that referenced this pull request Aug 27, 2025
…nput produce abnormal outputs for P2pNcclConnector (vllm-project#23403)

Signed-off-by: Abatom <abzhonghua@gmail.com>
Signed-off-by: tc-mb <caitianchi@modelbest.cn>
epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025
…nput produce abnormal outputs for P2pNcclConnector (vllm-project#23403)

Signed-off-by: Abatom <abzhonghua@gmail.com>
xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025
…nput produce abnormal outputs for P2pNcclConnector (vllm-project#23403)

Signed-off-by: Abatom <abzhonghua@gmail.com>
Signed-off-by: Xiao Yu <xiao.yu@amd.com>
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025
…nput produce abnormal outputs for P2pNcclConnector (vllm-project#23403)

Signed-off-by: Abatom <abzhonghua@gmail.com>
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Sep 3, 2025
…nput produce abnormal outputs for P2pNcclConnector (vllm-project#23403)

Signed-off-by: Abatom <abzhonghua@gmail.com>
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
…nput produce abnormal outputs for P2pNcclConnector (vllm-project#23403)

Signed-off-by: Abatom <abzhonghua@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants