[Omni Connector] Omni Transfer Engine Connector: Enable 1-receiver-to-N-senders to support Bagel TP/CFG parallel#2731
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
8406d30 to
5181a92
Compare
| serialized_data = self.serialize_obj(data) | ||
| key = self._make_key(put_key, from_stage, to_stage) | ||
| self.store.put(key, serialized_data, self.pin) | ||
| put_rc = self.store.put(key, serialized_data, self.pin) |
There was a problem hiding this comment.
Will the 1-receiver-to-N-senders pattern (partial metadata / per-rank endpoint) also extend to MooncakeStoreConnector? Right now only MooncakeTransferEngineConnector has the multi-sender routing, but the store connector still uses a single store.put(key, data) — curious if heterogeneous TP will need similar changes there.
There was a problem hiding this comment.
MooncakeStoreConnector does not need to create p2p side channel peers, and it naturally supports this.
…leanup Mooncake RDMA connector — multi-source receiving: - Add per-rank sender endpoint registry (_sender_endpoints) and update_sender_info(sender_rank=...) for N-sender registration - Add _resolve_sender_endpoint() for rank-based endpoint routing - Add get() Path 2: partial metadata (host/port only, no data_size) queries the specified sender then RDMA pulls — enables heterogeneous TP where one receiver pulls KV shards from multiple sender ranks - Extract _query_metadata_at() from _query_metadata_from_sender() to deduplicate ZMQ query logic (~53 lines saved) - Fix data_size check from falsy to "data_size" not in metadata SHM connector — metadata fallback & lifecycle: - Add _get_by_key() fallback when metadata lacks SHM handles (e.g. RDMA-style metadata passed to SHM connector) - Track _pending_keys for cleanup(request_id) and close() lifecycle Other: - base.py: document metadata parameter semantics for heterogeneous TP - mooncake_store_connector: align with updated connector interface - initialization: add KV_RANK_PORT_STRIDE constant for per-rank ZMQ port - tests: add test_shm_connector covering key-based R/W, metadata fallback, heterogeneous TP multi-key, and cleanup/close Signed-off-by: natureofnature <wzliu@connect.hku.hk>
Signed-off-by: natureofnature <wzliu@connect.hku.hk>
1bd945f to
d0c11ac
Compare
Signed-off-by: natureofnature <wzliu@connect.hku.hk>
d0c11ac to
272e520
Compare
|
PR #2731 - [Omni Connector] 1-receiver-to-N-senders OVERALL: NO BLOCKERS (check AMD CI) Correctness: PASS Summary: Omni Transfer Engine Connector: 1-receiver-to-N-senders for Bagel TP/CFG parallel. 422 add, 87 del. pre-commit+build pass, AMD CI failing. Test results show 16 samples generated. Please check AMD CI failure. |
It seems AMD CI failure is because of timeout and not related to this PR. @hsliuustc0106 |
|
@natureofnature I fixed single-stage t2i bug(add multiple <start_vision> and <end_vision>), can you compare them again? |
I tried both online and offline mode (no thinking), but still different even without this PR's code. @princepride
|
…-N-senders to support Bagel TP/CFG parallel (vllm-project#2731) Signed-off-by: natureofnature <wzliu@connect.hku.hk>
…-N-senders to support Bagel TP/CFG parallel (vllm-project#2731) Signed-off-by: natureofnature <wzliu@connect.hku.hk>


PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Refer to #2705, this is the first PR, which updates the connector. This PR aims to support TP+CFG parallel for AR/DIT disaggregation.
connector: support 1-receiver-to-N-senders, SHM metadata fallback & cleanup
Mooncake Transfer Engine connector — multi-source receiving:
update_sender_info(sender_rank=...) for N-sender registration
queries the specified sender then RDMA pulls — enables heterogeneous
TP where one receiver pulls KV shards from multiple sender ranks
SHM connector — metadata fallback & lifecycle:
RDMA-style metadata passed to SHM connector)
Other:
heterogeneous TP multi-key, and cleanup/close
Test Plan
Test Result
@hsliuustc0106 @princepride @yangsonglin13
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)