[Bugfix] Fix P2pNcclConnector NCCL send/recv key mismatch in disaggregated prefill XpYd#34278
[Bugfix] Fix P2pNcclConnector NCCL send/recv key mismatch in disaggregated prefill XpYd#34278shwgao wants to merge 4 commits intovllm-project:mainfrom
Conversation
…D suffix - Added a new function `_strip_internal_id_suffix` to recover the original request ID by removing the random suffix. - Updated `ReqMeta` to include `transfer_id`, which is used for P2P NCCL send/recv key matching. - Refactored P2P NCCL connector methods to utilize `transfer_id` instead of the original request ID for improved consistency in communication.
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Code Review
This pull request addresses a critical bug in the P2pNcclConnector that causes hangs in disaggregated prefill setups due to NCCL key mismatches. The root cause is correctly identified as the random suffix appended to request_id by each vLLM instance. The proposed fix of stripping this suffix to create a consistent transfer_id is a good, localized solution.
My review identifies a critical edge case in the implementation of _strip_internal_id_suffix that could cause the fix to fail when the original request ID is empty, leading to the same hanging behavior. I've provided a suggestion to make the suffix stripping logic more robust, which should resolve this issue and also improve robustness against accidental stripping of similarly formatted IDs.
|
Hi @shwgao, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
Hi @shwgao, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: shouwei-OSU <gaosho@oregonstate.edu>
|
See #33947 (comment) |
…ector.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Shouwei Gao <gaosho@oregonstate.edu>
| _INTERNAL_ID_SUFFIX_LEN = 9 | ||
|
|
||
|
|
||
| def _strip_internal_id_suffix(request_id: str) -> str: |
There was a problem hiding this comment.
Should we have a unit test to cover this, to avoid request_id format changes in the future, and then this method broken.
markmc
left a comment
There was a problem hiding this comment.
We are proposing to merge #34415 as a temporary workaround for the problem
See #33947 (comment) for the correct long-term solution
|
That's great. I can close this one. |
Purpose
Fix NCCL send/recv key mismatch in
P2pNcclConnectorthat causes disaggregated prefill P2P NCCL XpYd KV cache transfer to hang indefinitely.In the disaggregated prefill XpYd architecture, the proxy generates a single
request_id(containing both prefill and decode addresses) and sends it to both vLLM instances. However,InputProcessor.assign_request_id()inside each vLLM instance independently appends a different random 8-char hex suffix to this ID:This causes the Prefill producer and Decode consumer to use different keys for
send_tensor()/recv_tensor(), so the NCCL transfer never matches:Fix: Since the suffix format is deterministic (
-+ 8 hex chars = 9 characters), we add a helper_strip_internal_id_suffix()that recovers the original proxy-generated ID by stripping the last 9 characters. A newtransfer_idfield onReqMetastores this stripped ID, and the worker-side methods (start_load_kv,save_kv_layer) usetransfer_idinstead ofrequest_idfor NCCL key matching and address parsing.The fix is entirely scoped to
p2p_nccl_connector.py— no changes to any other files.Changes summary:
_strip_internal_id_suffix()helper function to recover the original proxy ID.transfer_idfield toReqMetadataclass, computed automatically inmake_meta().start_load_kv()andsave_kv_layer()to userequest.transfer_idfor NCCL send/recv keys andparse_request_id()address parsing.Test Plan
Launch the disaggregated prefill XpYd example with ≥ 2 GPUs:
Verify the benchmark completes successfully instead of hanging.
Verify via logs that Prefill and Decode sides now use the same transfer key (the original proxy-generated ID without the random suffix).
Unit test: confirm
_strip_internal_id_suffix()correctly handles:"prefix-a1b2c3d4"→"prefix""short"→"short"(unchanged)Test Result
Before fix: Benchmark hangs indefinitely — Prefill completes and sends KV data, but Decode waits forever because the recv key never matches the send key.
After fix: KV transfer completes successfully. Both Prefill and Decode recover the same original proxy ID and agree on the NCCL transfer key:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.