add moriio transfer engine by inkcherry · Pull Request #1742 · vllm-project/vllm-omni

inkcherry · 2026-03-09T06:14:35Z

Purpose

Add MoriTransferEngineConnector — a new OmniConnector backend using Mori RDMA transfer engine for zero-copy data transfers between disaggregated pipeline stages. The implementation follows a similar architecture to MooncakeTransferEngineConnector, adapted to Mori's IOEngine / MemoryDesc / EngineDesc API.

RDMA data plane via Mori IOEngine.batch_write() with async TransferStatus tracking
ZMQ control plane with msgspec-encoded pull-request / query handshakes
Pinned-CPU or CUDA memory pool with first-fit allocator and TTL-based cleanup
Raw torch.Tensor / bytes fast path bypassing serialization
Graceful fallback when mori is not installed

Changes include the connector implementation, factory registration, module exports, design doc, and example stage config YAML.

Test Plan

Hardware: 3x AMD Instinct MI300X nodes (8 GPUs each), Mellanox ConnectX-7 400Gbps RoCE NIC (mlx5_0).

3-node disaggregated serving with Qwen2.5-Omni-7B, each stage on a separate node with TP=2:

# Node 1 — Thinker (stage 0)
vllm serve Qwen/Qwen2.5-Omni-7B --omni \
    --port 8000 \
    --tensor-parallel-size 2 \
    --stage-configs-path qwen2_5_omni_mori.yaml \
    --stage-id 0 --log-stats \
    -oma <orchestrator_ip> -omp 9000

# Node 2 — Talker (stage 1)
vllm serve Qwen/Qwen2.5-Omni-7B --omni \
    --tensor-parallel-size 2 \
    --stage-configs-path qwen2_5_omni_mori.yaml \
    --stage-id 1 --headless --log-stats \
    -oma <orchestrator_ip> -omp 9000

# Node 3 — Code2Wav (stage 2)
vllm serve Qwen/Qwen2.5-Omni-7B --omni \
    --tensor-parallel-size 2 \
    --stage-configs-path qwen2_5_omni_mori.yaml \
    --stage-id 2 --headless --log-stats \
    -oma <orchestrator_ip> -omp 9000

Benchmark client:

python openai_chat_completion_client_for_multimodal_generation.py --query-type text --port 8000

Test Result

0→1 (Thinker → Talker)

Field	Mooncake	Mori
in_flight_time_ms	1.389	0.829
rx_decode_time_ms	2.243	2.541
size_kbytes	1,611.713	1,611.713
tx_time_ms	0.914	0.926

1→2 (Talker → Code2Wav)

Field	Mooncake	Mori
in_flight_time_ms	1.291	0.510
rx_decode_time_ms	1.381	1.983
size_kbytes	3.305	3.305
tx_time_ms	0.362	0.350

0→1 (Thinker → Talker)(intranode communication)

Field	Mooncake	Mori
in_flight_time_ms	1.523	1.153
rx_decode_time_ms	2.453	1.887
size_kbytes	1,611.713	1,611.713
tx_time_ms	0.977	1.069

1→2 (Talker → Code2Wav)(intranode communication)

Field	Mooncake	Mori
in_flight_time_ms	1.566	1.927
rx_decode_time_ms	1.186	1.327
size_kbytes	3.305	3.305
tx_time_ms	0.316	0.341

Mori shows lower in-flight latency compared to Mooncake, with comparable tx/rx performance. The correctness of the generated audio output was also manually verified by listening to the results. It provides a viable RDMA backend option for AMD devices.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 30142a6a0a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-09T06:19:58Z

+                [[src_offset]],
+                [remote_mem],
+                [[pull.dst_offset]],
+                [[pull.length]],


Use source buffer size for RDMA write length

The sender already reads src_size from its own buffer table, but the transfer length is taken from pull.length (receiver-provided metadata). If that metadata is stale or malformed, the sender can truncate a valid payload (when pull.length < src_size) or attempt to read beyond the stored buffer (when pull.length > src_size) while still treating the operation as a normal transfer path. The write length should be validated against src_size (or derived solely from src_size) before calling batch_write.

Useful? React with 👍 / 👎.

Signed-off-by: inkcherry <mingzhi.liu@amd.com>

Signed-off-by: zejwang <zejwang@amd.com> Signed-off-by: zejwang <zejwang@amd.com>

Signed-off-by: inkcherry <mingzhi.liu@amd.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ed7bde1f46

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-17T14:06:18Z

+        if not bound:
+            try:
+                sock.bind(f"tcp://{self.host}:*")
+                bound = True
+            except zmq.ZMQError as exc:


Fail fast instead of rebinding sender ZMQ to random port

When bind on self.zmq_port fails with EADDRINUSE, this code silently rebinds to an OS-assigned port, but the metadata-less receive path still resolves the sender using configured sender_host/sender_zmq_port (see get() -> _query_metadata_from_sender(), and call sites like kv_transfer_manager.receive_kv_cache_for_request and chunk_transfer_adapter._poll_single_request that call connector.get(..., metadata=None)). In that context, receivers keep querying the old configured port and time out indefinitely even though the sender is running, so this should fail fast or explicitly propagate the new port to receivers.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-17T14:06:18Z

+            stale = [k for k, v in self._local_buffers.items() if now - v[5] > _BUFFER_TTL_SECONDS]
+            for k in stale:


Exclude in-flight buffers from TTL reclamation

TTL cleanup reclaims entries purely by age, but _handle_pull_request() only reads _local_buffers under lock and then performs batch_write(...); st.Wait() without marking that entry in-flight; if a receiver pulls a buffer near/after the 300s TTL, the listener thread can reclaim and release that allocator region concurrently while the RDMA write is still pending. This creates a race that can corrupt transfers or cause intermittent failures under delayed pulls, so stale-buffer GC needs an in-flight guard (or timestamp refresh) before transfer starts.

Useful? React with 👍 / 👎.

hongxiayang · 2026-03-17T14:08:18Z

thanks @inkcherry for the PR.

tjtanaa · 2026-03-19T04:20:59Z

@inkcherry is MoRII beneficial for intranode communication? will you enable it for intranode communication? Because vLLM-Omni also has many use cases on single node deployment.

inkcherry · 2026-03-24T10:44:33Z

@inkcherry is MoRII beneficial for intranode communication? will you enable it for intranode communication? Because vLLM-Omni also has many use cases on single node deployment.

Hi, @tjtanaa, I've added intra-node benchmark results. Mori delivers correct outputs with competitive performance.

junkang1991 · 2026-03-26T06:15:57Z

Hi @inkcherry, thanks for the PR. We tried to test with Qwen/Qwen2.5-Omni-7B using the provided qwen2_5_omni_mori.yaml (for intra-node communication), but we couldn't confirm that Mori is actually being used. The serve log only shows:

Loaded OmniTransferConfig with 2 connector configurations

There's no mention of MoriTransferEngineConnector being initialized anywhere, so it's hard to tell if Mori is active or if something is silently falling back.

Would you be able to share the reproduction steps you used to verify this?

inkcherry · 2026-03-26T11:04:12Z

hi, @junkang1991 , thanks for the try, during the initialization phase, I can see such a log. The log shows that XGMI is automatically selected as the mori backend.

I have added logs wrapping st.Wait() in mori_transfer_engine_connector.py. You can also use this method to check transfer details.

            success = True
            for i, st in enumerate(statuses):
                logger.info(
                    "[MORI TRANSFER] request=%s: TransferStatus[%d] pre-Wait "
                    "code=%s init=%s in_progress=%s",
                    pull.request_id,
                    i,
                    st.Code(),
                    st.Init(),
                    st.InProgress(),
                )
                _wait_t0 = _time_mod.perf_counter()
                st.Wait()
                _wait_ms = (_time_mod.perf_counter() - _wait_t0) * 1000
                logger.info(
                    "[MORI TRANSFER] request=%s: TransferStatus[%d] post-Wait "
                    "wait_ms=%.3f succeeded=%s failed=%s code=%s msg=%r",
                    pull.request_id,
                    i,
                    _wait_ms,
                    st.Succeeded(),
                    st.Failed(),
                    st.Code(),
                    st.Message(),
                )

If you still encounter issues, could you try updating Mori to a newer commit?

…node XGMI Signed-off-by: junkang1991 <junkangchow@gmail.com>

tjtanaa · 2026-04-01T09:00:35Z

@inkcherry @junkang1991 please fix the readthedocs and pre-commit error.

Signed-off-by: inkcherry <mingzhi.liu@amd.com>

inkcherry · 2026-04-08T07:58:27Z

@tjtanaa , thanks for the reminder, fixd, cc @junkang1991

inkcherry · 2026-04-10T03:17:55Z

This refactor #1908 remove multi-node deployment on vllm-omni codebase. Current single node deployment still work. I think we can move forward with the PR at this stage, I've removed the inter-node configuration script for now. Once the vllm-omni codebase supports inter-node deployment, we will add it back after re-testing.

Signed-off-by: inkcherry <mingzhi.liu@amd.com>

tjtanaa · 2026-04-10T04:04:58Z

+
+## When to Use
+
+Best for high-performance multi-node data transfer using Mori RDMA transfer engine.


can you update the description for now. Stating this reason #1742 (comment) . And mention that intra-node is currently supported.

yes, updated.

Signed-off-by: inkcherry <mingzhi.liu@amd.com>

… moriio

Signed-off-by: inkcherry <mingzhi.liu@amd.com>

tjtanaa

LGTM.

@hsliuustc0106 PTAL, thank you.

Signed-off-by: Zejian Wang <zejianwang@sjtu.edu.cn>

Upstream marge conflicts fix

inkcherry · 2026-04-20T08:17:31Z

@hsliuustc0106 could you please take a look? many thanks!

gcanlin · 2026-04-20T09:01:13Z

@@ -0,0 +1,138 @@
+  # Qwen2.5-Omni single-node config using Mori XGMI transfer (TP2).


After #2383, we don't introduce the new stage configs. Please follow the new way.

Hi, this is fixed in a4c2083

gcanlin · 2026-04-20T09:09:10Z

@inkcherry Could you submit a topic about this PR in the sync meeting? If you're at UTC+8 Timezone, you could submit it here: tinyurl.com/vllm-omni-meeting.

gcanlin · 2026-04-20T09:09:27Z

cc @natureofnature @hsliuustc0106

inkcherry · 2026-04-21T12:36:51Z

@inkcherry Could you submit a topic about this PR in the sync meeting? If you're at UTC+8 Timezone, you could submit it here: tinyurl.com/vllm-omni-meeting.

Thanks @gcanlin for the invitation, happy to join! Let me first address the pending review comments. Once the PR is ready, I'll submit the topic via tinyurl.com/vllm-omni-meeting and join an upcoming sync — hopefully next week or soon after. Appreciate it!

After vllm-project#2383, new per-model stage configs are not introduced. Replace vllm_omni/model_executor/stage_configs/qwen2_5_omni_mori_intranode.yaml with vllm_omni/deploy/qwen2_5_omni_mori_intranode.yaml using the new top-level connectors/stages schema, and update the design doc to use --deploy-config. Signed-off-by: Zejian Wang <zejianwang@sjtu.edu.cn>

Signed-off-by: Zejian Wang <zejianwang@sjtu.edu.cn>

Align MoriTransferEngineConnector with SharedMemoryConnector and every other OmniConnector implementation by reading ``stage_id`` from the config dict in __init__. The value is injected by arg_utils.py into the connector config on the chunk_transfer_adapter path and is later read back by OmniChunkTransferAdapter (process_pending_chunks / _send_single_request / _poll_single_request) to decide whether the connector sits at the sender or receiver end of a stage edge. Without this attribute Mori cannot be used as a chunk-mode connector at all: the first decode step crashes in chunk_transfer_adapter.py with ``AttributeError: 'MoriTransferEngineConnector' object has no attribute 'stage_id'`` the moment the scheduler inspects self.connector.stage_id. Signed-off-by: Zejian Wang <zejianwang@sjtu.edu.cn> Made-with: Cursor

… yaml After junkang1991's entrypoints -> engine hook (e65275a) was reverted earlier in this branch (9cc7672), the only live transfer path for MoriTransferEngineConnector is the async_chunk=true chunk_transfer_ adapter route. qwen2_5_omni_mori_intranode.yaml has async_chunk=false and Qwen2.5-Omni's pipeline does not yet define thinker2talker_async_chunk / talker2code2wav_async_chunk input processors, so its Mori connector entries are silently ignored by the engine: verified by loading the yaml via load_omni_transfer_config and observing that get_stage_connector_spec returns {} for every stage when async_chunk=false (stage_init_utils.py:407 gate), which means no MoriTransferEngineConnector is ever instantiated and stage-to-stage transfer falls back to the default in-memory set_engine_outputs / process_engine_inputs path. Shipping a deploy yaml that claims Mori/XGMI but in practice uses none of it is user-confusing, so drop it rather than leave it as a dead pointer. Qwen3-Omni-MoE (qwen3_omni_moe_mori_intranode.yaml, added in c749f4d) is the supported Mori deploy target in this PR and exercises the full chunk_transfer_adapter + Mori path end-to-end. Also updates the Mori connector design doc to point at the Qwen3-Omni- MoE yaml and documents the Qwen2.5-Omni follow-up prerequisite (writing the async_chunk input processors) inline. Signed-off-by: Zejian Wang <zejianwang@sjtu.edu.cn> Made-with: Cursor

A middle stage of a chunk_transfer_adapter pipeline (e.g. the talker stage of Qwen3-Omni-MoE) is simultaneously the receiver of its incoming edge and the sender of its outgoing edge. The adapter keeps its historical "one connector per stage" model (same shape that SharedMemoryConnector has always used -- single instance, put()+get() from the same object), so middle-stage support for role-bound RDMA connectors has to come from the connector itself: a single instance capable of put and get simultaneously. This commit adds a first-class ``role="dual"`` to both MoriTransferEngineConnector and MooncakeTransferEngineConnector. A dual connector: * binds the ZMQ listener (as a sender would) so a downstream receiver can pull data from it via the existing MoriPullRequest / Mooncake ``put(...)``/``trans_done`` handshake; * simultaneously populates ``sender_host`` / ``sender_zmq_port`` so that its own ``get()`` can query an upstream sender, exactly as a receiver-only instance would. Both fields already existed -- dual-role simply flips ``can_put=True`` while keeping the receiver-side handshake path live. ``self.role`` is recorded so boot-time log lines can distinguish sender / receiver / dual ("... DUAL ready (ZMQ listening on ...; upstream sender at ...)"). The contract with the framework is unchanged: connectors still only read their own role from ``config["role"]`` and never reverse-infer it from side-channels like ``stage_id``. The framework decides whether a stage should instantiate sender / receiver / dual based on topology (see companion commit updating ``get_connectors_config_for_stage``). Existing sender-only / receiver-only behaviour is bit-for-bit unchanged; the only accepted-role set widens from {sender, receiver} to {sender, receiver, dual}. Signed-off-by: Zejian Wang <zejianwang@sjtu.edu.cn> Made-with: Cursor

…ection Extends ``get_connectors_config_for_stage`` in ``distributed/omni_connectors/utils/initialization.py`` to: 1. Infer per-stage role from pipeline topology: * stage with only outgoing edges -> ``role="sender"`` * stage with only incoming edges -> ``role="receiver"`` * stage with both incoming and outgoing edges -> ``role="dual"`` (the new role added to Mori / Mooncake in the companion commit) Dual stages emit matching ``from_stage_*`` and ``to_stage_*`` entries that share the *same* composite extra (identical role, identical ``zmq_port`` / ``sender_host`` / ``sender_zmq_port``), so engine-side flattening (``get_stage_connector_spec`` in ``engine/stage_init_utils.py`` returning the first spec it sees -- untouched) always recovers a self-consistent per-stage config. That lets ``OmniChunkTransferAdapter`` keep its historical one-connector-per-stage model (same shape SharedMemoryConnector always used) while still serving middle-stage put+get through a single role-bound RDMA connector. 2. Inject per-edge ZMQ endpoints for role-bound connectors listed in the new ``_ROLE_BOUND_ZMQ_CONNECTORS`` frozenset (``MoriTransferEngineConnector`` and ``MooncakeTransferEngineConnector``): * ``zmq_port = base + own_stage_id`` (sender / dual): offsetting the listener by stage id prevents three co-located stage listeners on a single intranode node from all binding the same base port. * ``sender_zmq_port = base + upstream_stage_id`` (receiver / dual): points at the upstream sender's listener. * ``sender_host``: framework-detected local IP (via new ``_detect_local_ip``, mirroring the connector-side ``_get_local_ip``) when yaml uses ``host: "auto"``; the explicit yaml ``host`` value otherwise. Non-role-bound connectors (SharedMemory, Yuanrong, ...) are not touched by the endpoint injection. Explicit yaml ``sender_host`` / ``sender_zmq_port`` always win, so cross-node deployments that know their peer endpoint are unaffected. 3. Orchestrator-level ``create_connectors_from_config`` is untouched: it has its own Mooncake-specific port adjustment for KV transfer + PD disaggregation, and its callers (``build_stage_connectors`` etc.) only filter ``from_stage_*`` keys, so orchestrator instantiation behaviour is bit-for-bit unchanged. Design rationale An earlier iteration tried to own the endpoint derivation inside ``MoriTransferEngineConnector.__init__`` using a ``stage_id >= 0`` side-channel, and later tried to fix the middle- stage put+get problem by splitting each stage into two connector instances (``sender_connector`` + ``receiver_connector``) with a corresponding ``{name, extra, input, output}`` schema that had to travel through ``engine/stage_init_utils.get_stage_connector_spec`` and ``engine/arg_utils.OmniEngineArgs.create_model_config``. Both approaches imposed ripple-effect changes outside the ``distributed/omni_connectors`` tree -- either the connector had to reverse-infer its deployment mode, or public engine files had to learn a new connector-config schema. Moving role inference and endpoint computation to the framework's topology layer avoids both: connectors stay mechanical and read their role verbatim from config, and engine/ needs zero changes -- the schema ``get_stage_connector_spec`` returns is exactly the same ``{name, extra}`` it returned before this PR, with the ``extra`` content pre-populated. Tests (``test_omni_connector_configs.py``) Twelve new parametrized-over-{Mori,Mooncake} cases: * stage-0 sender-only: only ``to_stage_*``, role=sender, zmq_port = base + 0; * final-stage receiver-only: only ``from_stage_*``, role=receiver, sender endpoints resolved; * middle-stage dual: both directions present, both entries share identical composite extra with role=dual, both own-stage listener port and upstream sender endpoint populated; * SharedMemoryConnector is not touched by endpoint injection; * explicit yaml ``sender_host`` / ``sender_zmq_port`` override the derivation; * non-auto ``host`` cascades into ``sender_host``; * ``_inject_chunk_path_endpoints`` is a no-op for unknown connector types and non-integer stage ids. Full 301-test suite (``tests/engine/``, ``tests/distributed/omni_connectors/``, ``tests/core/sched/``, ``tests/config/``) passes, including all four Mooncake orchestrator path tests that exercise ``create_connectors_from_config``. Signed-off-by: Zejian Wang <zejianwang@sjtu.edu.cn> Made-with: Cursor

Parallel to 7d7aeb0 for MoriTransferEngineConnector. OmniChunkTransferAdapter.process_pending_chunks and the AR / generation schedulers branch on ``self.connector.stage_id == 0`` to gate "is this stage 0 of the pipeline". SharedMemoryConnector and MoriTransferEngineConnector both already expose ``self.stage_id``; MooncakeTransferEngineConnector did not, so deploying Mooncake on the chunk_transfer_adapter path (e.g. ``qwen3_omni_moe_mooncake_intranode.yaml``) aborted the scheduler with an AttributeError on the first schedule() tick. The ``stage_id`` key is already injected into connector extras by ``OmniEngineArgs.create_model_config`` (via ``get_connectors_config_for_stage``); Mooncake now reads it like its siblings. Signed-off-by: Zejian Wang <zejianwang@sjtu.edu.cn> Made-with: Cursor

Sibling of qwen3_omni_moe_mori_intranode.yaml for Mori-vs-Mooncake benchmark comparison on a single AMD Instinct MI300X node. Pipeline topology, sampling params, and chunk-transfer knobs are kept identical to the Mori yaml; only the connector definition changes. Verified end-to-end on the chunk_transfer_adapter path with the dual-role Mooncake connector from commit 6f3ecf4: stage boot log shows Stage 0 SENDER / Stage 1 DUAL / Stage 2 RECEIVER identically to the Mori topology, and three consecutive text-mode chat completions through the full thinker -> talker -> code2wav pipeline produce valid audio in 18.7 / 16.8 / 16.4 s warm-state with 93 RDMA GETs across three runs (per-chunk p50 = 0.7 ms on edge 0 -> 1, 0.9 ms on edge 1 -> 2; peak 398 MB/s on the thinker hidden-state edge). Transport note: MooncakeTransferEngineConnector currently forwards ``protocol`` to mooncake.engine.TransferEngine.initialize, and the mooncake build shipped in the current container image still installs ``type=rdma`` transport regardless of the string passed. Intranode MI300X transfers therefore flow over Mellanox RoCE/IB with memory_pool_device: cuda + GPUDirect RDMA rather than true XGMI Infinity Fabric -- whenever Mooncake lands a first-class XGMI transport this yaml just has to flip ``protocol`` to ``"xgmi"``. Signed-off-by: Zejian Wang <zejianwang@sjtu.edu.cn> Made-with: Cursor

Clarifies how to opt this deploy yaml into true AMD Infinity Fabric XGMI via the HIP transport landed in Mooncake PRs vllm-project#1742 and vllm-project#1550, and pins the default value to ``protocol: "rdma"`` — the path that is actually end-to-end validated on this branch (three consecutive text completions at 18.7 / 16.8 / 16.4 s with 93 per-chunk RDMA GETs). XGMI opt-in requires three things in lock step and is therefore not the default: 1. A mooncake wheel rebuilt with ``-DUSE_HIP=ON`` and reinstalled into the container's Python env (stock wheels don't ship HIP transport). 2. ``memory_pool_device: "cuda"`` so Mooncake picks the HIP allocator. 3. ``MC_FORCE_MNNVL=1`` in the launch environment so ``TransferEngineImpl::init()`` installs the HIP transport instead of defaulting back to RDMA whenever the node has HCAs (the current auto-topology path treats RDMA as the preferred fabric, even when HIP is compiled). Signed-off-by: Zejian Wang <zejianwang@sjtu.edu.cn> Made-with: Cursor

remove changes on public files

Pure cosmetic reformatting requested by ruff-format: - mooncake_transfer_engine_connector.py: collapse a 2-line f-string into one line. - mori_transfer_engine_connector.py: same. - utils/initialization.py: drop redundant indentation in the `_ROLE_BOUND_ZMQ_CONNECTORS` frozenset literal and collapse two list comprehensions in `get_connectors_config_for_stage` to a single line. Verified locally with `pre-commit run --all-files` (all hooks pass). Signed-off-by: Zejian Wang <zejianwang@sjtu.edu.cn>

- P1 length: validate pull.length against src_size before batch_write (prevents silent truncation / cross-allocation OOB read). - P2 port: drop EADDRINUSE silent fallback; match Mooncake's fail-fast. - P2 TTL: TODO mirroring wzliu's note; same race deferred to follow-up. Co-authored-by: Cursor <cursoragent@cursor.com>

knitcapcat-amd · 2026-05-12T08:13:29Z

@inkcherry Could you submit a topic about this PR in the sync meeting? If you're at UTC+8 Timezone, you could submit it here: tinyurl.com/vllm-omni-meeting.

Hi @gcanlin, we plan to share this PR this Wednesday. Does this meeting require any registration before the sharing session?

inkcherry requested a review from hsliuustc0106 as a code owner March 9, 2026 06:14

inkcherry marked this pull request as draft March 9, 2026 06:15

chatgpt-codex-connector Bot reviewed Mar 9, 2026

View reviewed changes

inkcherry changed the title ~~add moriio backend~~ add moriio transfer engine Mar 11, 2026

inkcherry force-pushed the moriio branch from e79a51f to d87dddc Compare March 17, 2026 05:30

inkcherry and others added 3 commits March 17, 2026 10:28

add mori transfer engine

8655ca0

Signed-off-by: inkcherry <mingzhi.liu@amd.com>

format

8171308

Signed-off-by: zejwang <zejwang@amd.com> Signed-off-by: zejwang <zejwang@amd.com>

update

ed7bde1

Signed-off-by: inkcherry <mingzhi.liu@amd.com>

inkcherry force-pushed the moriio branch from d87dddc to ed7bde1 Compare March 17, 2026 13:45

inkcherry marked this pull request as ready for review March 17, 2026 13:59

chatgpt-codex-connector Bot reviewed Mar 17, 2026

View reviewed changes

Merge branch 'main' into moriio

075ee48

fix(mori): fix MoriTransferEngineConnector for vllm-omni 0.18.0 intra…

e65275a

…node XGMI Signed-off-by: junkang1991 <junkangchow@gmail.com>

tjtanaa mentioned this pull request Apr 1, 2026

[RFC]: vLLM-Omni ROCm 2026 Q2 Roadmap #2413

Open

1 task

inkcherry force-pushed the moriio branch from d890ee5 to e65275a Compare April 8, 2026 07:09

inkcherry added 2 commits April 8, 2026 07:45

Merge remote-tracking branch 'upstream/main' into mori_0408

02d9b69

format

3515f2d

Signed-off-by: inkcherry <mingzhi.liu@amd.com>

remove unused config

375a1e4

Signed-off-by: inkcherry <mingzhi.liu@amd.com>

tjtanaa reviewed Apr 10, 2026

View reviewed changes

inkcherry added 3 commits April 10, 2026 05:30

update doc

7d3d4bc

Signed-off-by: inkcherry <mingzhi.liu@amd.com>

update doc

58a74bc

Signed-off-by: inkcherry <mingzhi.liu@amd.com>

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

221c187

… moriio

inkcherry and others added 2 commits April 10, 2026 08:14

update

b538a3a

Signed-off-by: inkcherry <mingzhi.liu@amd.com>

Merge branch 'main' into moriio

90dded5

tjtanaa approved these changes Apr 10, 2026

View reviewed changes

Merge upstream/main into moriio-fix

b2331a1

Signed-off-by: Zejian Wang <zejianwang@sjtu.edu.cn>

knitcapcat-amd mentioned this pull request Apr 20, 2026

Merge upstream/main into moriio branch inkcherry/vllm-omni#1

Closed

inkcherry and others added 2 commits April 20, 2026 16:16

Merge pull request #2 from knitcapcat-amd/moriio-fix

87f0518

Upstream marge conflicts fix

Merge branch 'main' into moriio

9ad895b

gcanlin reviewed Apr 20, 2026

View reviewed changes

knitcapcat and others added 15 commits April 28, 2026 01:52

[bugfix] fix the hard-coded backend type

dab541a

Signed-off-by: Zejian Wang <zejianwang@sjtu.edu.cn>

revert changes on public files- like 'engine'

9cc7672

Signed-off-by: Zejian Wang <zejianwang@sjtu.edu.cn>

add qwen3_omni_moe mori schema

c749f4d

Signed-off-by: Zejian Wang <zejianwang@sjtu.edu.cn>

Merge pull request #4 from knitcapcat-amd/moriio-stage-cfg-fix

750d137

remove changes on public files

Merge branch 'main' into moriio

2fa4220

		stale = [k for k, v in self._local_buffers.items() if now - v[5] > _BUFFER_TTL_SECONDS]
		for k in stale:


		## When to Use

		Best for high-performance multi-node data transfer using Mori RDMA transfer engine.

		@@ -0,0 +1,138 @@
		# Qwen2.5-Omni single-node config using Mori XGMI transfer (TP2).

Conversation

inkcherry commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

0→1 (Thinker → Talker)

1→2 (Talker → Code2Wav)

0→1 (Thinker → Talker)(intranode communication)

1→2 (Talker → Code2Wav)(intranode communication)

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

hongxiayang commented Mar 17, 2026

Uh oh!

tjtanaa commented Mar 19, 2026

Uh oh!

inkcherry commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

junkang1991 commented Mar 26, 2026

Uh oh!

inkcherry commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tjtanaa commented Apr 1, 2026

Uh oh!

inkcherry commented Apr 8, 2026

Uh oh!

inkcherry commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tjtanaa Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

inkcherry Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

tjtanaa left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

inkcherry commented Apr 20, 2026

Uh oh!

gcanlin Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

knitcapcat-amd Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

gcanlin commented Apr 20, 2026

Uh oh!

gcanlin commented Apr 20, 2026

Uh oh!

inkcherry commented Apr 21, 2026

Uh oh!

knitcapcat-amd commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

inkcherry commented Mar 9, 2026 •

edited

Loading

inkcherry commented Mar 24, 2026 •

edited

Loading

inkcherry commented Mar 26, 2026 •

edited

Loading

inkcherry commented Apr 10, 2026 •

edited

Loading

tjtanaa left a comment •

edited

Loading