Skip to content

[Bugfix][KV Transfer] Use kv_transfer_params for P2pNcclConnector coordination#33947

Open
eicherseiji wants to merge 1 commit intovllm-project:mainfrom
eicherseiji:ci/add-p2p-lmcache-connector-tests
Open

[Bugfix][KV Transfer] Use kv_transfer_params for P2pNcclConnector coordination#33947
eicherseiji wants to merge 1 commit intovllm-project:mainfrom
eicherseiji:ci/add-p2p-lmcache-connector-tests

Conversation

@eicherseiji
Copy link
Contributor

@eicherseiji eicherseiji commented Feb 5, 2026

After #27987, Prefill and Decode get different internal request_ids, breaking P/D coordination in the P2P NCCL connector. The connector currently encodes the remote address into the request_id string and parses it back out with a regex, which is also fragile.

Implements the design from https://gist.github.com/markmc/0c10179d49bb7fed8b737e1cfa56f912: switch to kv_transfer_params following the NIXL pattern. The proxy injects the decode instance's KV address before sending to prefill, prefill returns its own address and request ID on completion, and the proxy forwards those to decode.

Includes unit tests, updates to both proxy implementations, and design doc changes.

Repro steps

pip install quart aiohttp pyzmq msgpack

# Terminal 1: Proxy
python examples/online_serving/disaggregated_serving_p2p_nccl_xpyd/disagg_proxy_p2p_nccl_xpyd.py

# Terminal 2: Prefill (GPU 0)
CUDA_VISIBLE_DEVICES=0 vllm serve facebook/opt-125m \
    --enforce-eager --host 0.0.0.0 --port 20003 \
    --dtype float16 --max-model-len 2048 --gpu-memory-utilization 0.5 \
    --kv-transfer-config \
    '{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_port":"21001","kv_connector_extra_config":{"proxy_ip":"0.0.0.0","proxy_port":"30001","http_port":"20003","send_type":"PUT_ASYNC"}}'

# Terminal 3: Decode (GPU 1)
CUDA_VISIBLE_DEVICES=1 vllm serve facebook/opt-125m \
    --enforce-eager --host 0.0.0.0 --port 20005 \
    --dtype float16 --max-model-len 2048 --gpu-memory-utilization 0.5 \
    --kv-transfer-config \
    '{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_buffer_size":"5e9","kv_port":"22001","kv_connector_extra_config":{"proxy_ip":"0.0.0.0","proxy_port":"30001","http_port":"20005","send_type":"PUT_ASYNC"}}'

# Terminal 4: Test
curl -s http://localhost:11001/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"facebook/opt-125m","prompt":"The capital of France is","max_tokens":20}' \
  | python -m json.tool

Without fix

546552007-3750c16c-d5b6-4d91-90c9-d33aa796f973

With fix

Screenshot 2026-02-12 at 1 25 35 AM

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds CI coverage for P2pNcclConnector and LMCacheConnectorV1 by adding new integration test steps and parameterizing the test script. The changes are well-structured and address a gap in test coverage. My feedback focuses on improving the new CI steps for better consistency and robustness by using the shared requirements file for installing dependencies.

@mergify mergify bot added the tpu Related to Google TPUs label Feb 5, 2026
@eicherseiji eicherseiji closed this Feb 5, 2026
@eicherseiji eicherseiji reopened this Feb 5, 2026
@eicherseiji eicherseiji changed the title [CI] Add integration tests for P2pNccl and LMCache connectors [WIP][CI] Add integration tests for P2pNccl and LMCache connectors Feb 5, 2026
@eicherseiji eicherseiji changed the title [WIP][CI] Add integration tests for P2pNccl and LMCache connectors [CI][Bugfix] Fix P2P NCCL KV transfer + add PD connector integration tests Feb 7, 2026
@eicherseiji eicherseiji changed the title [CI][Bugfix] Fix P2P NCCL KV transfer + add PD connector integration tests [Bugfix] Fix P2P NCCL KV transfer using external_req_id Feb 7, 2026
@mergify mergify bot added the bug Something isn't working label Feb 7, 2026
@eicherseiji eicherseiji force-pushed the ci/add-p2p-lmcache-connector-tests branch from da41489 to d9cec5f Compare February 7, 2026 08:49
@mergify mergify bot removed the tpu Related to Google TPUs label Feb 7, 2026
@eicherseiji eicherseiji closed this Feb 7, 2026
@eicherseiji eicherseiji reopened this Feb 7, 2026
@eicherseiji eicherseiji force-pushed the ci/add-p2p-lmcache-connector-tests branch from 0c013d9 to ac9f7ef Compare February 7, 2026 09:18
@eicherseiji
Copy link
Contributor Author

Prioritizing bug fix, will follow up with CI test here: #34050

@eicherseiji
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a critical bug in the P2P NCCL KV transfer mechanism, which was causing hangs due to the use of a randomized internal request_id. The fix involves replacing this with the consistent external_req_id for coordination between prefill and decode instances. The changes correctly propagate the external_req_id through the Request and NewRequestData structures and apply it within the P2pNcclConnector. The implementation appears correct and effectively resolves the issue. I have included one suggestion to enhance code maintainability by refactoring an indexed tuple into a more robust NamedTuple or dataclass.

self._requests_need_load: dict[str, Any] = {}
self.is_producer = self._kv_transfer_config.is_kv_producer
self.chunked_prefill: dict[str, tuple[list[int], list[int] | None]] = {}
self.chunked_prefill: dict[str, tuple[list[int], list[int] | None, str]] = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The chunked_prefill dictionary stores a tuple with three elements, which are then accessed by index (e.g., [0], [1], [2]) in build_connector_meta. This practice is fragile and can lead to bugs if the tuple structure is modified in the future. To improve readability and maintainability, consider using a NamedTuple or a dataclass to provide meaningful names to the fields.

For example:

from typing import NamedTuple

class ChunkedPrefillData(NamedTuple):
    block_ids: list[int]
    prompt_token_ids: list[int] | None
    external_req_id: str

# In __init__
self.chunked_prefill: dict[str, ChunkedPrefillData] = {}

# Then in build_connector_meta, you can access fields by name:
# prompt_token_ids = self.chunked_prefill[req_id].prompt_token_ids
# kv_request_id = self.chunked_prefill[req_id].external_req_id

Copy link
Collaborator

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey thanks for the fix @eicherseiji !
Left a comment.
cc @markmc on whether this is the best approach to integrate the request_id.

) -> None:
self.request_id = request_id
self.external_req_id = external_req_id
self.client_index = client_index
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would personally prefer not to edit request.py unless deemed necessary more generally, as we're still trying to figure out how much the nccl one is actually used.
I think this mapping could be stored within the connector itself for the time being.
Let's hear @markmc thoughts on this global change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I could also strip the trailing request ID characters, or add a method that does this on InputProcessor. TIA for taking a look @markmc!

@markmc
Copy link
Member

markmc commented Feb 10, 2026

Hi @eicherseiji

I'm not familiar at all with this connector, and don't really have time right now to dig in deeply. But I did spend some time with Claude, trying to formulate some feedback and a recommendation based on how the NIXL connector works and also #32937 for the moriio connector. Obviously I could be missing something, but I think this is pretty solid: https://gist.github.com/markmc/0c10179d49bb7fed8b737e1cfa56f912

@shwgao
Copy link

shwgao commented Feb 10, 2026

Hi, apologies — I didn't notice that #33947 by @eicherseiji already addresses the same bug before I opened this PR. Sorry for the duplicate!

That said, the two PRs take different approaches:

#33947: Propagates external_req_id through Request and NewRequestData (changes 3 files: request.py, output.py, p2p_nccl_connector.py)
#34278: Strips the assign_request_id() random suffix inside the connector itself via _strip_internal_id_suffix() (changes 1 file: p2p_nccl_connector.py only)
My approach avoids modifying request.py / output.py, which aligns with @NickLucche's comment on #33947 suggesting the mapping be kept within the connector for now.

Totally fine to close #34278. Just wanted to offer an alternative, happy to defer to whatever the maintainers decide.

@orozery
Copy link
Collaborator

orozery commented Feb 11, 2026

Obviously I could be missing something, but I think this is pretty solid: https://gist.github.com/markmc/0c10179d49bb7fed8b737e1cfa56f912

I agree. I think Request.kv_transfer_params is a perfect fit for the solution.

@njhill
Copy link
Member

njhill commented Feb 11, 2026

Agree we should avoid changing request.py and output.py for this if possible.

@shwgao
Copy link

shwgao commented Feb 11, 2026

Obviously I could be missing something, but I think this is pretty solid: https://gist.github.com/markmc/0c10179d49bb7fed8b737e1cfa56f912

I agree. I think Request.kv_transfer_params is a perfect fit for the solution.

Agree, the proper long-term fix is to follow the NIXL pattern. I could also take a look at this factor after @eicherseiji updates the PR, since I am currently working close to the P2pncclconnector.
But, this design touches multiple components (connector + proxy + protocol), it goes beyond a simple bugfix. The P2P NCCL connector is currently completely broken on main, so it might be worth landing a minimal hotfix first to unblock users, then following up with the NIXL-pattern refactor as a separate PR.

@eicherseiji
Copy link
Contributor Author

eicherseiji commented Feb 11, 2026

Thanks @markmc, all for feedback. Will proceed with the kv_transfer_params design here.

In the meantime, maybe we can merge @shwgao's #34278 to recover main? @NickLucche, thoughts?

@markmc
Copy link
Member

markmc commented Feb 11, 2026

I'd be more inclined to accept a CLI argument to disable the request ID randomization - this would be a temporary feature available to users of the broken connectors as a workaround

The P2P NCCL connector encoded network addresses in request_id strings
and parsed them with regex. After PR vllm-project#27987, prefill and decode have
different internal request_ids, breaking this scheme.

Follow the NIXL connector pattern: prefill returns its internal
request_id and KV address via kv_transfer_params in the API response;
the proxy forwards these to decode for coordination. No core engine
changes required.

Design: https://gist.github.com/markmc/0c10179d49bb7fed8b737e1cfa56f912
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
@eicherseiji eicherseiji changed the title [Bugfix] Fix P2P NCCL KV transfer using external_req_id [Bugfix][KV Transfer] Use kv_transfer_params for P2pNcclConnector coordination Feb 12, 2026
@eicherseiji eicherseiji force-pushed the ci/add-p2p-lmcache-connector-tests branch from ac9f7ef to 47b2549 Compare February 12, 2026 09:26
@mergify
Copy link

mergify bot commented Feb 12, 2026

Documentation preview: https://vllm--33947.org.readthedocs.build/en/33947/

@mergify mergify bot added documentation Improvements or additions to documentation performance Performance-related issues labels Feb 12, 2026
@eicherseiji
Copy link
Contributor Author

eicherseiji commented Feb 12, 2026

@markmc, if it's temporary, thoughts on an environment variable for a more minimal change? #34415

This PR is ready to review.

@markmc
Copy link
Member

markmc commented Feb 12, 2026

@markmc, if it's temporary, thoughts on an environment variable for a more minimal change? #34415

This PR is ready to review.

Thanks - I think this is a pragmatic solution 👍

@eicherseiji
Copy link
Contributor Author

@markmc bumping for your review when you have a chance. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ci/build documentation Improvements or additions to documentation kv-connector performance Performance-related issues v1

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

7 participants