Skip to content

[DO NOT MERGE][ROCM][PD] Fix MoRIIO connector with transfer_id for P/D coordination#32937

Draft
markmc wants to merge 1 commit intovllm-project:mainfrom
markmc:moriio-transfer-id
Draft

[DO NOT MERGE][ROCM][PD] Fix MoRIIO connector with transfer_id for P/D coordination#32937
markmc wants to merge 1 commit intovllm-project:mainfrom
markmc:moriio-transfer-id

Conversation

@markmc
Copy link
Member

@markmc markmc commented Jan 23, 2026

After #27987 introduced random suffixes to internal request_ids, the MoRIIO connector broke because Prefill and Decode instances now have different internal request_ids (e.g., "cmpl-uuid-abc" vs "cmpl-uuid-def"). The MoRIIO connector's parallel dispatch model requires a stable, shared identifier for P/D coordination.

This commit introduces a transfer_id (format: "xfer-{uuid}") that is generated by the proxy and shared between Prefill and Decode for KV transfer coordination. The connector worker maintains bidirectional mappings to translate between transfer_id (used for P/D coordination) and internal request_id (used by the scheduler).

Changes:

  • Proxy: Generate transfer_id and include in kv_transfer_params for both Prefill and Decode requests
  • moriio_common.py: Add TransferId type alias, update ReqMeta, WriteTask, and LayerTransferPlan to use transfer_id
  • moriio_connector.py: Add bidirectional mappings between transfer_id and request_id, translate at the boundary in get_finished(), update message protocol to use "transfer_id" instead of "req_id"
  • moriio_engine.py: Update MoRIIOWrapper tracking lists and methods to use transfer_id, update message handling to extract transfer_id

After vllm-project#27987 introduced random suffixes to internal request_ids, the
MoRIIO connector broke because Prefill and Decode instances now have
different internal request_ids (e.g., "cmpl-uuid-abc" vs "cmpl-uuid-def").
The MoRIIO connector's parallel dispatch model requires a stable,
shared identifier for P/D coordination.

This commit introduces a `transfer_id` (format: "xfer-{uuid}") that is
generated by the proxy and shared between Prefill and Decode for KV
transfer coordination. The connector worker maintains bidirectional
mappings to translate between transfer_id (used for P/D coordination)
and internal request_id (used by the scheduler).

Changes:
- Proxy: Generate transfer_id and include in kv_transfer_params for
  both Prefill and Decode requests
- moriio_common.py: Add TransferId type alias, update ReqMeta,
  WriteTask, and LayerTransferPlan to use transfer_id
- moriio_connector.py: Add bidirectional mappings between transfer_id
  and request_id, translate at the boundary in get_finished(), update
  message protocol to use "transfer_id" instead of "req_id"
- moriio_engine.py: Update MoRIIOWrapper tracking lists and methods to
  use transfer_id, update message handling to extract transfer_id

Co-Authored-By: Mark McLoughlin <markmc@redhat.com>

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
@mergify
Copy link

mergify bot commented Jan 23, 2026

Documentation preview: https://vllm--32937.org.readthedocs.build/en/32937/

@mergify mergify bot added documentation Improvements or additions to documentation rocm Related to AMD ROCm kv-connector labels Jan 23, 2026
@@ -0,0 +1,377 @@
# MoRIIO Connector: Transfer ID Design
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear - I don't propose committing this design doc to main. It's just here to help continue working on this changes

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request effectively addresses the issue of differing internal request_ids between Prefill and Decode instances in the MoRIIO connector by introducing a transfer_id. This new identifier is consistently used for KV transfer coordination, while internal request_ids are maintained for scheduler operations. The implementation includes necessary changes in the proxy to generate and propagate the transfer_id, updates to data structures, and the establishment of bidirectional mappings within the connector worker to translate between transfer_id and request_id. The design document clearly outlines the problem, solution strategy, architecture, and implementation plan, which is very helpful for understanding the changes. The changes appear to be thorough and correctly implemented according to the described design.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation kv-connector rocm Related to AMD ROCm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants