[Bugfix] Fix P2pNcclConnector NCCL send/recv key mismatch in disaggregated prefill XpYd by shwgao · Pull Request #34278 · vllm-project/vllm

shwgao · 2026-02-10T21:48:49Z

Purpose

Fix NCCL send/recv key mismatch in P2pNcclConnector that causes disaggregated prefill P2P NCCL XpYd KV cache transfer to hang indefinitely.

In the disaggregated prefill XpYd architecture, the proxy generates a single request_id (containing both prefill and decode addresses) and sends it to both vLLM instances. However, InputProcessor.assign_request_id() inside each vLLM instance independently appends a different random 8-char hex suffix to this ID:

# vllm/v1/engine/input_processor.py
request.request_id = f"{request.external_req_id}-{random_uuid():.8}"

This causes the Prefill producer and Decode consumer to use different keys for send_tensor() / recv_tensor(), so the NCCL transfer never matches:

Prefill sends: "___prefill_addr_...___decode_addr_..._<uuid>-AAAA#layer.0"
Decode  recvs: "___prefill_addr_...___decode_addr_..._<uuid>-BBBB#layer.0"
                                                             ^^^^ MISMATCH

Fix: Since the suffix format is deterministic (- + 8 hex chars = 9 characters), we add a helper _strip_internal_id_suffix() that recovers the original proxy-generated ID by stripping the last 9 characters. A new transfer_id field on ReqMeta stores this stripped ID, and the worker-side methods (start_load_kv, save_kv_layer) use transfer_id instead of request_id for NCCL key matching and address parsing.

The fix is entirely scoped to p2p_nccl_connector.py — no changes to any other files.

Changes summary:

Added _strip_internal_id_suffix() helper function to recover the original proxy ID.
Added transfer_id field to ReqMeta dataclass, computed automatically in make_meta().
Updated start_load_kv() and save_kv_layer() to use request.transfer_id for NCCL send/recv keys and parse_request_id() address parsing.

Test Plan

Launch the disaggregated prefill XpYd example with ≥ 2 GPUs:

cd examples/online_serving/disaggregated_serving_p2p_nccl_xpyd/
export HF_TOKEN=<your_token>
bash disagg_example_p2p_nccl_xpyd.sh

Verify the benchmark completes successfully instead of hanging.
Verify via logs that Prefill and Decode sides now use the same transfer key (the original proxy-generated ID without the random suffix).
Unit test: confirm _strip_internal_id_suffix() correctly handles:
- Normal case: "prefix-a1b2c3d4" → "prefix"
- Edge case (no suffix): "short" → "short" (unchanged)
- Edge case (no dash at expected position): returns as-is

Test Result

Before fix: Benchmark hangs indefinitely — Prefill completes and sends KV data, but Decode waits forever because the recv key never matches the send key.

After fix: KV transfer completes successfully. Both Prefill and Decode recover the same original proxy ID and agree on the NCCL transfer key:

Prefill sends: "___prefill_addr_...___decode_addr_..._<uuid>#layer.0"
Decode  recvs: "___prefill_addr_...___decode_addr_..._<uuid>#layer.0"
                                                              ✅ MATCH

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

…D suffix - Added a new function `_strip_internal_id_suffix` to recover the original request ID by removing the random suffix. - Updated `ReqMeta` to include `transfer_id`, which is used for P2P NCCL send/recv key matching. - Refactored P2P NCCL connector methods to utilize `transfer_id` instead of the original request ID for improved consistency in communication.

github-actions · 2026-02-10T21:48:59Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

This pull request addresses a critical bug in the P2pNcclConnector that causes hangs in disaggregated prefill setups due to NCCL key mismatches. The root cause is correctly identified as the random suffix appended to request_id by each vLLM instance. The proposed fix of stripping this suffix to create a consistent transfer_id is a good, localized solution.

My review identifies a critical edge case in the implementation of _strip_internal_id_suffix that could cause the fix to fail when the original request ID is empty, leading to the same hanging behavior. I've provided a suggestion to make the suffix stripping logic more robust, which should resolve this issue and also improve robustness against accidental stripping of similarly formatted IDs.

vllm/distributed/kv_transfer/kv_connector/v1/p2p/p2p_nccl_connector.py

mergify · 2026-02-10T21:53:24Z

Hi @shwgao, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2026-02-10T21:59:52Z

Hi @shwgao, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: shouwei-OSU <gaosho@oregonstate.edu>

markmc · 2026-02-10T22:17:15Z

See #33947 (comment)

…ector.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Shouwei Gao <gaosho@oregonstate.edu>

haosdent · 2026-02-11T06:42:22Z

vllm/distributed/kv_transfer/kv_connector/v1/p2p/p2p_nccl_connector.py

+_INTERNAL_ID_SUFFIX_LEN = 9
+
+
+def _strip_internal_id_suffix(request_id: str) -> str:


Should we have a unit test to cover this, to avoid request_id format changes in the future, and then this method broken.

markmc

We are proposing to merge #34415 as a temporary workaround for the problem

See #33947 (comment) for the correct long-term solution

shwgao · 2026-02-12T15:33:27Z

That's great. I can close this one.

shwgao requested review from ApostaC, NickLucche and orozery as code owners February 10, 2026 21:48

mergify bot added bug Something isn't working kv-connector labels Feb 10, 2026

gemini-code-assist bot reviewed Feb 10, 2026

View reviewed changes

vllm/distributed/kv_transfer/kv_connector/v1/p2p/p2p_nccl_connector.py Show resolved Hide resolved

Merge branch 'main' into p2pncclconnector-hangs

e05b79a

Fix: resolve pre-commit check failures

7f0eedc

Signed-off-by: shouwei-OSU <gaosho@oregonstate.edu>

This was referenced Feb 10, 2026

[Bugfix][KV Transfer] Use kv_transfer_params for P2pNcclConnector coordination #33947

Open

[Bug]: P2pNcclConnector NCCL send/recv key mismatch in disaggregated prefill XpYd architecture due to assign_request_id() random suffix #34277

Open

Update vllm/distributed/kv_transfer/kv_connector/v1/p2p/p2p_nccl_conn…

ede5e71

…ector.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Shouwei Gao <gaosho@oregonstate.edu>

haosdent reviewed Feb 11, 2026

View reviewed changes

tlrmchlsmth added this to Large-Scale Serving Feb 11, 2026

github-project-automation bot moved this to Backlog in Large-Scale Serving Feb 11, 2026

tlrmchlsmth moved this from Backlog to In review in Large-Scale Serving Feb 11, 2026

markmc requested changes Feb 12, 2026

View reviewed changes

shwgao closed this Feb 12, 2026

github-project-automation bot moved this from In review to Done in Large-Scale Serving Feb 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Fix P2pNcclConnector NCCL send/recv key mismatch in disaggregated prefill XpYd#34278

[Bugfix] Fix P2pNcclConnector NCCL send/recv key mismatch in disaggregated prefill XpYd#34278
shwgao wants to merge 4 commits intovllm-project:mainfrom
shwgao:p2pncclconnector-hangs

shwgao commented Feb 10, 2026 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Feb 10, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

mergify bot commented Feb 10, 2026

Uh oh!

mergify bot commented Feb 10, 2026

Uh oh!

markmc commented Feb 10, 2026

Uh oh!

haosdent Feb 11, 2026

Uh oh!

markmc left a comment

Uh oh!

shwgao commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		_INTERNAL_ID_SUFFIX_LEN = 9


		def _strip_internal_id_suffix(request_id: str) -> str:

Uh oh!

Conversation

shwgao commented Feb 10, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

github-actions bot commented Feb 10, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

mergify bot commented Feb 10, 2026

Uh oh!

mergify bot commented Feb 10, 2026

Uh oh!

markmc commented Feb 10, 2026

Uh oh!

haosdent Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

markmc left a comment

Choose a reason for hiding this comment

Uh oh!

shwgao commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

shwgao commented Feb 10, 2026 •

edited by github-actions bot

Loading