Skip to content

[Omni Connector] Omni Transfer Engine Connector: Enable 1-receiver-to-N-senders to support Bagel TP/CFG parallel#2731

Merged
princepride merged 4 commits intovllm-project:mainfrom
natureofnature:pr1/fix_tp_cfg_parallel
Apr 14, 2026
Merged

[Omni Connector] Omni Transfer Engine Connector: Enable 1-receiver-to-N-senders to support Bagel TP/CFG parallel#2731
princepride merged 4 commits intovllm-project:mainfrom
natureofnature:pr1/fix_tp_cfg_parallel

Conversation

@natureofnature
Copy link
Copy Markdown
Contributor

@natureofnature natureofnature commented Apr 13, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Refer to #2705, this is the first PR, which updates the connector. This PR aims to support TP+CFG parallel for AR/DIT disaggregation.

connector: support 1-receiver-to-N-senders, SHM metadata fallback & cleanup

  1. Mooncake Transfer Engine connector — multi-source receiving:

    • Add per-rank sender endpoint registry (_sender_endpoints) and
      update_sender_info(sender_rank=...) for N-sender registration
    • Add _resolve_sender_endpoint() for rank-based endpoint routing
    • Add get() Path 2: partial metadata (host/port only, no data_size)
      queries the specified sender then RDMA pulls — enables heterogeneous
      TP where one receiver pulls KV shards from multiple sender ranks
    • Fix data_size check from falsy to "data_size" not in metadata
  2. SHM connector — metadata fallback & lifecycle:

    • Add _get_by_key() fallback when metadata lacks SHM handles (e.g.
      RDMA-style metadata passed to SHM connector)
    • Track _pending_keys for cleanup(request_id) and close() lifecycle
  3. Other:

    • base.py: document metadata parameter semantics for heterogeneous TP
    • mooncake_store_connector: align with updated connector interface
    • initialization: add KV_RANK_PORT_STRIDE constant for per-rank ZMQ port
    • tests: add test_shm_connector covering key-based R/W, metadata fallback,
      heterogeneous TP multi-key, and cleanup/close

Test Plan

  1. Bagel RDMA
  2. Unit test
python3 -m pytest -q     tests/distributed/omni_connectors/test_shm_connector.py     tests/distributed/omni_connectors/test_basic_connectors.py     tests/distributed/omni_connectors/test_adapter_and_flow.py

Test Result

  1. Bagel t2i/i2i tests
prompt Baseline AR/DIT shm AR/DIT RDMA
A cute cat wearing sunglasses iter_0_slot0 iter_0_slot0 iter_0_slot0
Let the woman wear a white dress iter_0_white_slot0 iter_0_white_slot0 iter_0_white_slot0
  1. Unit test
Screenshot from 2026-04-13 17-41-13

@hsliuustc0106 @princepride @yangsonglin13


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@natureofnature natureofnature force-pushed the pr1/fix_tp_cfg_parallel branch from 8406d30 to 5181a92 Compare April 13, 2026 10:11
@hsliuustc0106 hsliuustc0106 added the ready label to trigger buildkite CI label Apr 13, 2026
Comment thread vllm_omni/distributed/omni_connectors/connectors/shm_connector.py
Comment thread vllm_omni/distributed/omni_connectors/connectors/shm_connector.py
Comment on lines 79 to +81
serialized_data = self.serialize_obj(data)
key = self._make_key(put_key, from_stage, to_stage)
self.store.put(key, serialized_data, self.pin)
put_rc = self.store.put(key, serialized_data, self.pin)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the 1-receiver-to-N-senders pattern (partial metadata / per-rank endpoint) also extend to MooncakeStoreConnector? Right now only MooncakeTransferEngineConnector has the multi-sender routing, but the store connector still uses a single store.put(key, data) — curious if heterogeneous TP will need similar changes there.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MooncakeStoreConnector does not need to create p2p side channel peers, and it naturally supports this.

…leanup

Mooncake RDMA connector — multi-source receiving:
- Add per-rank sender endpoint registry (_sender_endpoints) and
  update_sender_info(sender_rank=...) for N-sender registration
- Add _resolve_sender_endpoint() for rank-based endpoint routing
- Add get() Path 2: partial metadata (host/port only, no data_size)
  queries the specified sender then RDMA pulls — enables heterogeneous
  TP where one receiver pulls KV shards from multiple sender ranks
- Extract _query_metadata_at() from _query_metadata_from_sender() to
  deduplicate ZMQ query logic (~53 lines saved)
- Fix data_size check from falsy to "data_size" not in metadata

SHM connector — metadata fallback & lifecycle:
- Add _get_by_key() fallback when metadata lacks SHM handles (e.g.
  RDMA-style metadata passed to SHM connector)
- Track _pending_keys for cleanup(request_id) and close() lifecycle

Other:
- base.py: document metadata parameter semantics for heterogeneous TP
- mooncake_store_connector: align with updated connector interface
- initialization: add KV_RANK_PORT_STRIDE constant for per-rank ZMQ port
- tests: add test_shm_connector covering key-based R/W, metadata fallback,
  heterogeneous TP multi-key, and cleanup/close

Signed-off-by: natureofnature <wzliu@connect.hku.hk>
Signed-off-by: natureofnature <wzliu@connect.hku.hk>
@natureofnature natureofnature force-pushed the pr1/fix_tp_cfg_parallel branch 2 times, most recently from 1bd945f to d0c11ac Compare April 13, 2026 16:18
Signed-off-by: natureofnature <wzliu@connect.hku.hk>
@natureofnature natureofnature force-pushed the pr1/fix_tp_cfg_parallel branch from d0c11ac to 272e520 Compare April 13, 2026 17:24
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

PR #2731 - [Omni Connector] 1-receiver-to-N-senders

OVERALL: NO BLOCKERS (check AMD CI)
VERDICT: COMMENT

Correctness: PASS
Reliability: PASS
Breaking: PASS
Tests: PASS (test results show samples, AMD CI fail)
Documentation: PASS (inline doc added)
Security: PASS

Summary: Omni Transfer Engine Connector: 1-receiver-to-N-senders for Bagel TP/CFG parallel. 422 add, 87 del. pre-commit+build pass, AMD CI failing. Test results show 16 samples generated. Please check AMD CI failure.

@natureofnature
Copy link
Copy Markdown
Contributor Author

PR #2731 - [Omni Connector] 1-receiver-to-N-senders

OVERALL: NO BLOCKERS (check AMD CI) VERDICT: COMMENT

Correctness: PASS Reliability: PASS Breaking: PASS Tests: PASS (test results show samples, AMD CI fail) Documentation: PASS (inline doc added) Security: PASS

Summary: Omni Transfer Engine Connector: 1-receiver-to-N-senders for Bagel TP/CFG parallel. 422 add, 87 del. pre-commit+build pass, AMD CI failing. Test results show 16 samples generated. Please check AMD CI failure.

It seems AMD CI failure is because of timeout and not related to this PR. @hsliuustc0106

@princepride
Copy link
Copy Markdown
Collaborator

@natureofnature I fixed single-stage t2i bug(add multiple <start_vision> and <end_vision>), can you compare them again?

@natureofnature
Copy link
Copy Markdown
Contributor Author

natureofnature commented Apr 14, 2026

@natureofnature I fixed single-stage t2i bug(add multiple <start_vision> and <end_vision>), can you compare them again?

I tried both online and offline mode (no thinking), but still different even without this PR's code. @princepride
Commit ID:dd1389173b4e2893d21cf742979c89ab0255a5d5

Mode Scenario Config File Startup Command Execution Command Image
Online baseline single-stage bagel_single_stage.yaml python3 -m vllm_omni.entrypoints.cli.main serve ByteDance-Seed/BAGEL-7B-MoT --omni --port 8091 --stage-configs-path vllm_omni/model_executor/stage_configs/bagel_single_stage.yaml python3 examples/online_serving/bagel/openai_chat_client.py --server http://127.0.0.1:8091 --prompt 'A cute cat' --modality text2img --height 512 --width 512 --steps 15 --seed 42 --output out/baseline_online.png baseline_online1
Online split SHM runtime-only 2-GPU variant bagel.yaml python3 -m vllm_omni.entrypoints.cli.main serve ByteDance-Seed/BAGEL-7B-MoT --omni --port 8092 --stage-configs-path vllm_omni/model_executor/stage_configs/bagel.yaml python3 examples/online_serving/bagel/openai_chat_client.py --server http://127.0.0.1:8092 --prompt 'A cute cat' --modality text2img --height 512 --width 512 --steps 15 --seed 42 --output /out/split_online.png split_online1

@princepride princepride merged commit 6d01a8b into vllm-project:main Apr 14, 2026
8 checks passed
y123456y78 pushed a commit to y123456y78/vllm-omni that referenced this pull request Apr 15, 2026
…-N-senders to support Bagel TP/CFG parallel (vllm-project#2731)

Signed-off-by: natureofnature <wzliu@connect.hku.hk>
lvliang-intel pushed a commit to lvliang-intel/vllm-omni that referenced this pull request Apr 20, 2026
…-N-senders to support Bagel TP/CFG parallel (vllm-project#2731)

Signed-off-by: natureofnature <wzliu@connect.hku.hk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants