Fix engine_id collision + MoRIIO robustness for multi-node disagg DP by raviguptaamd · Pull Request #39276 · vllm-project/vllm

raviguptaamd · 2026-04-08T07:49:32Z

Summary

1. Engine_id collision fix (`core.py`, `utils.py`)

Use global dp_rank instead of local_dp_rank when generating the engine_id suffix for KV transfer in data-parallel deployments. In multi-node DP with --headless child nodes, using local_dp_rank causes master and child nodes to produce identical engine_ids, routing all data to whichever engine registered first.

2. MoRIIOConnector fixes (`moriio_connector.py`, `moriio_common.py`)

Fix A -- Cache kv_transfer_params across scheduler steps: Snapshot into _req_kv_params dict at allocation time; use cached copy in build_connector_meta. Handles chunked prefill correctly: params are retained for requests moved to _reqs_need_pending_save and only cleaned up when the final chunk is processed.

Fix B -- Handle attention backend block_size overrides: FlashMLA silently overrides block_size to 64 for MLA models. Trust the actual tensor shape instead of asserting against the CLI config value.

Fix C -- Default values for MORI-IO-specific ports: remote_handshake_port and remote_notify_port now use .get() with MoRIIOConstants defaults, preventing KeyError from external coordinators (e.g., NIXL sidecar in LLM-D).

Fix D -- Multi-node DP sizing and ranking: Use data_parallel_size_local and local dp_rank (global % local_size) so port offsets stay in 0..N range per node. Applied to both MoRIIOConfig.from_vllm_config and MoRIIOConnectorScheduler.

Fix E -- Prevent child nodes from sending block notifications: Added _is_kv_master flag to guard send_notify_block calls.

3. MoRIIO robustness fixes (`moriio_engine.py`, `moriio_connector.py`)

Fix F -- Timeout-guarded transfer completion: Replace blocking status.Wait() with a polling loop bounded by VLLM_MORIIO_TRANSFER_TIMEOUT_S (default 120s).

Fix G -- Failure detection in _pop_done_transfers: Check Failed() status and implement age-based timeout for pending RDMA reads.

Fix H -- Deferred task timeout: Expire write tasks stuck waiting for remote block allocation after VLLM_MORIIO_DEFERRED_TIMEOUT_S (default 300s).

Fix I -- ZMQ notification retry: Non-blocking send with 3 retries and exponential backoff for transient ZMQ failures.

Environment Variables

Variable	Default	Description
`VLLM_MORIIO_TRANSFER_TIMEOUT_S`	120	Max seconds to wait for a single RDMA transfer
`VLLM_MORIIO_DEFERRED_TIMEOUT_S`	300	Max seconds a deferred write task waits for remote block allocation

Files Changed

File	Lines	Change
`core.py`	+4/-3	engine_id: local_dp_rank -> dp_rank
`utils.py`	+4/-3	engine_id: local_index -> index
`moriio_common.py`	+11/-4	Fix C (default ports), Fix D (local dp_size/dp_rank)
`moriio_connector.py`	+71/-17	Fix A-E + Fix G (failure detection + timeout)
`moriio_engine.py`	+60/-21	Fix F (transfer timeout), Fix H (deferred timeout), Fix I (ZMQ retry)

Signed-off-by: Rav Gupta ravi.gupta@amd.com

gemini-code-assist

Code Review

This pull request ensures that engine IDs in the kv_transfer_config are globally unique across multi-node data parallel deployments by switching from local ranks to global ranks (dp_rank and index). I have no feedback to provide.

…Ionic AINIC Extends the Ionic AINIC RDMA support to multi-node disaggregated inference with 2 Prefill + 2 Decode nodes (DP=16). Key changes: - Remove 1P/1D restriction from run_xPyD_models.slurm and vllm_disagg_mori_ep.sh to allow xP>1 / yD>1 topologies - Add --ulimit memlock=-1:-1 to podman for large RDMA memory registrations (>32GB) required by MoRI IO - Pass NCCL_IB_HCA, NCCL_IB_GID_INDEX, NCCL_NET_GDR_LEVEL, NCCL_CROSS_NIC, and MORI_SOCKET_IFNAME into containers for proper multi-node RCCL and MoRI bootstrap over Ionic AINICs - Add apply_moriio_2pd_patches.sh for runtime vLLM patches (PR vllm-project/vllm#39276) fixing engine_id collisions and MoRIIO robustness in multi-node DP configurations - Restrict --kv-transfer-config to master nodes only (child nodes join via --headless and participate in EP all-to-all) - Add launch_mori_2p2d.sh example launcher for 2P/2D benchmarks Tested on AAC MI355X cluster with Ionic RDMA NICs achieving balanced RDMA traffic across all 4 nodes and 1,344 tok/s total throughput on DeepSeek-V3-5layer. Made-with: Cursor

… graph decode support (#150) * [DistInf] Enable multi-node MoRI EP disaggregated inference (2P/2D+) Enable multi-node xP/yD disaggregated prefill/decode inference with MoRI EP, proven on OCI MI300X (jobs 18044-18408) and AAC MI355X clusters. Key changes: - Remove 1P/1D restriction: support arbitrary xP/yD topologies - Fix 6: --kv-transfer-config only on master nodes; child nodes join via --headless (prevents spurious proxy registration from headless workers) - Add apply_moriio_2pd_patches.sh: downloads and applies vLLM PR #39276 at container startup for engine_id collision fix + MoRIIO robustness - Add RDMA/NCCL/MoRI env var passthrough to Docker containers - Add host-local compilation caches (/tmp/vllm_cache) to avoid NFS races - Add --ulimit memlock=-1:-1 for large RDMA memory registrations - Add auto-discovery of host RDMA provider libs (mlx5, ionic, bnxt) - Add stall detection with configurable per-step timeout in benchmark - Add PyTorch default_pg_timeout patch (30min -> configurable, default 2h) Proven config: --enforce-eager, moriio_toy_proxy_server.py (co-located), warmup ISL=32/OSL=32 con=1, PR #39276 applied at runtime. Docker image: rocm/pytorch-private:20260407_itej89_vllm_mori_docker Depends on: vllm-project/vllm#39276 (applied at runtime) Made-with: Cursor * Add FULL_DECODE_ONLY CUDA graph mode support for decode nodes Decode nodes can now optionally use CUDA graphs via VLLM_CUDAGRAPH_MODE env var (e.g. FULL_DECODE_ONLY) while prefill nodes always run eager. This captures CUDA graphs for the autoregressive decode phase only, reducing per-token dispatch overhead while preserving eager flexibility for prefill and MoRI EP all-to-all. Usage: VLLM_CUDAGRAPH_MODE=FULL_DECODE_ONLY sbatch ... run_xPyD_models.slurm Default behavior (no env var set) remains --enforce-eager on all nodes. Made-with: Cursor * Default MORI_SOCKET_IFNAME to eth0 for bootstrap communication MoRI v1.1.0+ requires a valid interface name for shmem bootstrap. Setting a sensible default prevents empty-string failures on OCI clusters where eth0 is the management NIC. Made-with: Cursor * Address Copilot review comments on PR #221 - Fix set -e abort in apply_moriio_2pd_patches.sh: move python3 fallback out of for-loop word expansion to prevent script abort when vLLM is not importable - Fail fast on patch failure for multi-node DP (xP>1 or yD>1): patches are mandatory for multi-node, optional for 1P/1D - Fix timeout exit code in benchmark_xPyD.sh: use PIPESTATUS[0] instead of $? to capture timeout's exit code through the pipe - Restore PROXY_TYPE, ROUTER_PORT, BENCHMARK_PORT passthrough to Docker container for Default/DeepEP mode compatibility - Revert barrier port cleanup to hardcoded defaults (5000, 2222, 15000) to stay aligned with in-container scripts Made-with: Cursor * fix(patches): fail fast on download/verification failures Patch download and verification failures now exit non-zero so that multi-node DP runs abort early instead of proceeding unpatched. Made-with: Cursor * restore default #SBATCH -N 2 for 1P/1D baseline Command-line sbatch -N overrides this for larger topologies (2P/2D, 4P/4D). Made-with: Cursor

…V transfer This PR bundles three independent fixes for the MoRIIO-based P/D disaggregation path (no overlap with open PRs vllm-project#40344, vllm-project#39276, vllm-project#32630, vllm-project#39835). 1. Proxy: handle max_completion_tokens for chat completions API. The OpenAI Chat Completions API uses max_completion_tokens; the toy proxy was only decrementing max_tokens, dropping the field on forward and causing the backend to use its default. When both fields are present, max_completion_tokens takes precedence per the OpenAI spec, so decrement that one first; fall back to max_tokens otherwise. 2. Proxy: add /health endpoint. Benchmark harnesses (e.g. vllm bench serve) perform health probes against the proxy before sending traffic. Without this endpoint the probe fails and the benchmark aborts. Add a minimal 200-OK responder that also reports the current instance counts. 3. Engine: defer write task when remote block_ids is None. When a prefill-side write task arrives before the scheduler has populated the remote block_ids (async handshake race), the previous code silently dropped the task. The fix is in _is_remote_ready: it now returns False when block_ids is still None, so the existing outer deferral path (in _write_worker_loop and _process_deferred_tasks) re-queues the task safely. The inner None-check inside _execute_write_task is removed; appending to self._deferred_tasks from inside the iteration in _process_deferred_tasks would have been clobbered by the still_deferred overwrite at end of loop. All three are narrow, independent fixes. They do not modify the scheduler's hot path or introduce new control flow. Verified to apply cleanly on upstream/main as of 2026-04-22. This change was developed with AI assistance; the author has reviewed and tested each hunk and is accountable for the behavior. Signed-off-by: Chaemin Lim <chaemin.lim@mangoboost.io>

mergify · 2026-05-23T08:05:28Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @raviguptaamd.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2026-06-01T19:59:54Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @raviguptaamd.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Two complementary fixes for multi-node data-parallel disaggregated inference with MoRIIO: 1. Engine_id collision fix (core.py, utils.py): Use global dp_rank instead of local_dp_rank when generating the engine_id suffix. In multi-node DP with --headless child nodes, each node spawns engines with local_dp_rank 0..N-1, causing collisions that route all KV transfers to the master node while child nodes sit idle. 2. MoRIIO robustness (moriio_engine.py, moriio_connector.py): Prevent permanent stalls from lost RDMA completions or transient ZMQ failures: a) Timeout-guarded transfer completion: Replace blocking status.Wait() with a polling loop bounded by VLLM_MORIIO_TRANSFER_TIMEOUT_S (default 120s). A lost CQE no longer blocks the write-worker thread forever. b) Failure detection in _pop_done_transfers: Check Failed() status and implement age-based timeout for pending RDMA reads. Timed-out or failed transfers are cleaned up instead of accumulating indefinitely. c) Deferred task timeout: Expire write tasks stuck waiting for remote block allocation after VLLM_MORIIO_DEFERRED_TIMEOUT_S (default 300s), preventing unbounded growth of the deferred queue when the decode side never allocates blocks. d) ZMQ notification retry: Use non-blocking send with 3 retries and exponential backoff to handle transient send buffer full conditions on the ZMQ DEALER socket. Signed-off-by: Rav Gupta <ravi.gupta@amd.com> Made-with: Cursor Signed-off-by: raviguptaamd <ravi.gupta@amd.com>

…INS env var PIECEWISE cudagraph capture on large models (e.g. DeepSeek-V3) takes ~7 minutes, exceeding the hardcoded 5-minute handshake timeout and causing DP workers to fail in multi-node disaggregated setups (4P/4D and beyond). Allow overriding via environment variable while preserving the 5-minute default. Made-with: Cursor Signed-off-by: raviguptaamd <ravi.gupta@amd.com>

Signed-off-by: raviguptaamd <ravi.gupta@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com>

raviguptaamd · 2026-06-02T06:51:03Z

Status update — 2026-06-01

Rebased onto today's main (266b9d9c6). All previous merge conflicts are now resolved.

New head: 55fccbe66
Commits: 3 (history preserved — original fix, HANDSHAKE env override, ruff-format compliance)
Net diff: 5 files, 181+/108−

Conflict resolutions during rebase (favored upstream's robustness in every case, then layered PR's new behaviors):

utils.py — pure addition (engine_id global suffix in the newer Ray-DP code path).
moriio_connector.py master-only notify — added PR's _is_kv_master guard; kept upstream's is_producer=False (correct per get_peer_zmq_from_request_id docstring: "consumer/decode needs the prefill's address").
moriio_connector.py _recving_transfers_callback_addr — kept upstream's new 3-tuple type (host, port, xfer_id); ADDED PR's _recving_transfers_start dict for timeout tracking.
moriio_connector.py _pop_done_transfers — kept upstream's robust error path (try/except + notify-on-failure so prefill frees blocks); ADDED PR's age-based timeout elif req_id in self._recving_transfers_start: branch.
moriio_engine.py _process_deferred_tasks — kept upstream's cleaner expiry (uses self._defer_timeout + task.enqueue_time); dropped PR's now-redundant _defer_time attribute + duplicate env-var read.
moriio_engine.py waiting_for_transfer_complete — kept upstream's deadline + error-aggregating TransferError raise (strictly more robust than PR's logging-only version).
moriio_engine.py import os — already addressed upstream; no-op.

CI: DCO passing. pre-run-check is blocked by vLLM's contributor gate (PR needs verified/ready label or 4+ merged PRs from author — currently 1). Could a maintainer please add ready so the rest of CI can run?

Marked as Draft pending the final CI green. Ready for re-review.

raviguptaamd requested a review from njhill as a code owner April 8, 2026 07:49

mergify Bot added v1 kv-connector labels Apr 8, 2026

gemini-code-assist Bot reviewed Apr 8, 2026

View reviewed changes

raviguptaamd force-pushed the fix/nixl-engine-id-collision-multinode-dp branch 2 times, most recently from ccd0e1e to cd4cd7f Compare April 12, 2026 22:59

raviguptaamd requested review from ApostaC, NickLucche and orozery as code owners April 12, 2026 22:59

raviguptaamd changed the title ~~Fix NIXL engine_id collision in multi-node disaggregated DP~~ Fix engine_id collision + MoRIIO robustness for multi-node disagg DP Apr 12, 2026

raviguptaamd force-pushed the fix/nixl-engine-id-collision-multinode-dp branch 2 times, most recently from df5016c to 429a464 Compare April 14, 2026 08:21

raviguptaamd requested a review from xuechendi as a code owner April 17, 2026 22:36

raviguptaamd mentioned this pull request Apr 21, 2026

[DistInf] Enable multi-node MoRI EP disaggregated inference with CUDA graph decode support ROCm/MAD#150

Merged

raviguptaamd force-pushed the fix/nixl-engine-id-collision-multinode-dp branch from 49b1585 to 050c2d9 Compare April 30, 2026 23:23

This was referenced May 19, 2026

[Core][KV-transfer] MoRIIO multi-node TP prefill→decode dispatch #43063

Draft

[Bugfix][KV-transfer] MoRIIO add_new_req: tolerate short request_ids #43065

Draft

[Bugfix][KV-transfer] MoRIIO READ-mode completion notification ID #43066

Closed

mergify Bot added the needs-rebase label May 23, 2026

raviguptaamd force-pushed the fix/nixl-engine-id-collision-multinode-dp branch from 050c2d9 to f1174f9 Compare June 1, 2026 19:52

mergify Bot removed the needs-rebase label Jun 1, 2026

raviguptaamd force-pushed the fix/nixl-engine-id-collision-multinode-dp branch from f1174f9 to 050c2d9 Compare June 1, 2026 19:56

mergify Bot added the needs-rebase label Jun 1, 2026

raviguptaamd added 2 commits June 1, 2026 20:18

raviguptaamd marked this pull request as draft June 1, 2026 20:31

raviguptaamd force-pushed the fix/nixl-engine-id-collision-multinode-dp branch from 050c2d9 to 8591dfa Compare June 1, 2026 20:31

style: ruff format fixes for pre-commit compliance

55fccbe

Signed-off-by: raviguptaamd <ravi.gupta@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com>

raviguptaamd force-pushed the fix/nixl-engine-id-collision-multinode-dp branch from 8591dfa to 55fccbe Compare June 1, 2026 20:33

mergify Bot removed the needs-rebase label Jun 1, 2026

shikamd123 mentioned this pull request Jun 2, 2026

llmd+vllm+mori-ep(intra node wide-ep)+mori-io(write) for 1p1d with dp=ep=8 tp=1 #44355

Draft

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix engine_id collision + MoRIIO robustness for multi-node disagg DP#39276

Fix engine_id collision + MoRIIO robustness for multi-node disagg DP#39276
raviguptaamd wants to merge 3 commits into
vllm-project:mainfrom
raviguptaamd:fix/nixl-engine-id-collision-multinode-dp

raviguptaamd commented Apr 8, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

mergify Bot commented May 23, 2026

Uh oh!

mergify Bot commented Jun 1, 2026

Uh oh!

raviguptaamd commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

raviguptaamd commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. Engine_id collision fix (core.py, utils.py)

2. MoRIIOConnector fixes (moriio_connector.py, moriio_common.py)

3. MoRIIO robustness fixes (moriio_engine.py, moriio_connector.py)

Environment Variables

Files Changed

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mergify Bot commented May 23, 2026

Uh oh!

mergify Bot commented Jun 1, 2026

Uh oh!

raviguptaamd commented Jun 2, 2026

Status update — 2026-06-01

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

raviguptaamd commented Apr 8, 2026 •

edited

Loading

1. Engine_id collision fix (`core.py`, `utils.py`)

2. MoRIIOConnector fixes (`moriio_connector.py`, `moriio_common.py`)

3. MoRIIO robustness fixes (`moriio_engine.py`, `moriio_connector.py`)