[DO NOT MERGE][ROCM][PD] Fix MoRIIO connector with transfer_id for P/D coordination#32937
[DO NOT MERGE][ROCM][PD] Fix MoRIIO connector with transfer_id for P/D coordination#32937markmc wants to merge 1 commit intovllm-project:mainfrom
Conversation
After vllm-project#27987 introduced random suffixes to internal request_ids, the MoRIIO connector broke because Prefill and Decode instances now have different internal request_ids (e.g., "cmpl-uuid-abc" vs "cmpl-uuid-def"). The MoRIIO connector's parallel dispatch model requires a stable, shared identifier for P/D coordination. This commit introduces a `transfer_id` (format: "xfer-{uuid}") that is generated by the proxy and shared between Prefill and Decode for KV transfer coordination. The connector worker maintains bidirectional mappings to translate between transfer_id (used for P/D coordination) and internal request_id (used by the scheduler). Changes: - Proxy: Generate transfer_id and include in kv_transfer_params for both Prefill and Decode requests - moriio_common.py: Add TransferId type alias, update ReqMeta, WriteTask, and LayerTransferPlan to use transfer_id - moriio_connector.py: Add bidirectional mappings between transfer_id and request_id, translate at the boundary in get_finished(), update message protocol to use "transfer_id" instead of "req_id" - moriio_engine.py: Update MoRIIOWrapper tracking lists and methods to use transfer_id, update message handling to extract transfer_id Co-Authored-By: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com>
|
Documentation preview: https://vllm--32937.org.readthedocs.build/en/32937/ |
| @@ -0,0 +1,377 @@ | |||
| # MoRIIO Connector: Transfer ID Design | |||
There was a problem hiding this comment.
To be clear - I don't propose committing this design doc to main. It's just here to help continue working on this changes
There was a problem hiding this comment.
Code Review
The pull request effectively addresses the issue of differing internal request_ids between Prefill and Decode instances in the MoRIIO connector by introducing a transfer_id. This new identifier is consistently used for KV transfer coordination, while internal request_ids are maintained for scheduler operations. The implementation includes necessary changes in the proxy to generate and propagate the transfer_id, updates to data structures, and the establishment of bidirectional mappings within the connector worker to translate between transfer_id and request_id. The design document clearly outlines the problem, solution strategy, architecture, and implementation plan, which is very helpful for understanding the changes. The changes appear to be thorough and correctly implemented according to the described design.
After #27987 introduced random suffixes to internal request_ids, the MoRIIO connector broke because Prefill and Decode instances now have different internal request_ids (e.g., "cmpl-uuid-abc" vs "cmpl-uuid-def"). The MoRIIO connector's parallel dispatch model requires a stable, shared identifier for P/D coordination.
This commit introduces a
transfer_id(format: "xfer-{uuid}") that is generated by the proxy and shared between Prefill and Decode for KV transfer coordination. The connector worker maintains bidirectional mappings to translate between transfer_id (used for P/D coordination) and internal request_id (used by the scheduler).Changes: