Skip to content

[Bugfix] Fix P2pNcclConnector NCCL send/recv key mismatch in disaggregated prefill XpYd#34278

Closed
shwgao wants to merge 4 commits intovllm-project:mainfrom
shwgao:p2pncclconnector-hangs
Closed

[Bugfix] Fix P2pNcclConnector NCCL send/recv key mismatch in disaggregated prefill XpYd#34278
shwgao wants to merge 4 commits intovllm-project:mainfrom
shwgao:p2pncclconnector-hangs

Conversation

@shwgao
Copy link

@shwgao shwgao commented Feb 10, 2026

Purpose

Fix NCCL send/recv key mismatch in P2pNcclConnector that causes disaggregated prefill P2P NCCL XpYd KV cache transfer to hang indefinitely.

In the disaggregated prefill XpYd architecture, the proxy generates a single request_id (containing both prefill and decode addresses) and sends it to both vLLM instances. However, InputProcessor.assign_request_id() inside each vLLM instance independently appends a different random 8-char hex suffix to this ID:

# vllm/v1/engine/input_processor.py
request.request_id = f"{request.external_req_id}-{random_uuid():.8}"

This causes the Prefill producer and Decode consumer to use different keys for send_tensor() / recv_tensor(), so the NCCL transfer never matches:

Prefill sends: "___prefill_addr_...___decode_addr_..._<uuid>-AAAA#layer.0"
Decode  recvs: "___prefill_addr_...___decode_addr_..._<uuid>-BBBB#layer.0"
                                                             ^^^^ MISMATCH

Fix: Since the suffix format is deterministic (- + 8 hex chars = 9 characters), we add a helper _strip_internal_id_suffix() that recovers the original proxy-generated ID by stripping the last 9 characters. A new transfer_id field on ReqMeta stores this stripped ID, and the worker-side methods (start_load_kv, save_kv_layer) use transfer_id instead of request_id for NCCL key matching and address parsing.

The fix is entirely scoped to p2p_nccl_connector.py — no changes to any other files.

Changes summary:

  1. Added _strip_internal_id_suffix() helper function to recover the original proxy ID.
  2. Added transfer_id field to ReqMeta dataclass, computed automatically in make_meta().
  3. Updated start_load_kv() and save_kv_layer() to use request.transfer_id for NCCL send/recv keys and parse_request_id() address parsing.

Test Plan

  1. Launch the disaggregated prefill XpYd example with ≥ 2 GPUs:

    cd examples/online_serving/disaggregated_serving_p2p_nccl_xpyd/
    export HF_TOKEN=<your_token>
    bash disagg_example_p2p_nccl_xpyd.sh
  2. Verify the benchmark completes successfully instead of hanging.

  3. Verify via logs that Prefill and Decode sides now use the same transfer key (the original proxy-generated ID without the random suffix).

  4. Unit test: confirm _strip_internal_id_suffix() correctly handles:

    • Normal case: "prefix-a1b2c3d4""prefix"
    • Edge case (no suffix): "short""short" (unchanged)
    • Edge case (no dash at expected position): returns as-is

Test Result

Before fix: Benchmark hangs indefinitely — Prefill completes and sends KV data, but Decode waits forever because the recv key never matches the send key.

After fix: KV transfer completes successfully. Both Prefill and Decode recover the same original proxy ID and agree on the NCCL transfer key:

Prefill sends: "___prefill_addr_...___decode_addr_..._<uuid>#layer.0"
Decode  recvs: "___prefill_addr_...___decode_addr_..._<uuid>#layer.0"
                                                              ✅ MATCH

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

…D suffix

- Added a new function `_strip_internal_id_suffix` to recover the original request ID by removing the random suffix.
- Updated `ReqMeta` to include `transfer_id`, which is used for P2P NCCL send/recv key matching.
- Refactored P2P NCCL connector methods to utilize `transfer_id` instead of the original request ID for improved consistency in communication.
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@mergify mergify bot added bug Something isn't working kv-connector labels Feb 10, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a critical bug in the P2pNcclConnector that causes hangs in disaggregated prefill setups due to NCCL key mismatches. The root cause is correctly identified as the random suffix appended to request_id by each vLLM instance. The proposed fix of stripping this suffix to create a consistent transfer_id is a good, localized solution.

My review identifies a critical edge case in the implementation of _strip_internal_id_suffix that could cause the fix to fail when the original request ID is empty, leading to the same hanging behavior. I've provided a suggestion to make the suffix stripping logic more robust, which should resolve this issue and also improve robustness against accidental stripping of similarly formatted IDs.

@mergify
Copy link

mergify bot commented Feb 10, 2026

Hi @shwgao, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@mergify
Copy link

mergify bot commented Feb 10, 2026

Hi @shwgao, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: shouwei-OSU <gaosho@oregonstate.edu>
@markmc
Copy link
Member

markmc commented Feb 10, 2026

See #33947 (comment)

…ector.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Shouwei Gao <gaosho@oregonstate.edu>
_INTERNAL_ID_SUFFIX_LEN = 9


def _strip_internal_id_suffix(request_id: str) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a unit test to cover this, to avoid request_id format changes in the future, and then this method broken.

Copy link
Member

@markmc markmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are proposing to merge #34415 as a temporary workaround for the problem

See #33947 (comment) for the correct long-term solution

@shwgao
Copy link
Author

shwgao commented Feb 12, 2026

That's great. I can close this one.

@shwgao shwgao closed this Feb 12, 2026
@github-project-automation github-project-automation bot moved this from In review to Done in Large-Scale Serving Feb 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working kv-connector

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants