Skip to content

Fix engine_id collision + MoRIIO robustness for multi-node disagg DP#39276

Draft
raviguptaamd wants to merge 3 commits into
vllm-project:mainfrom
raviguptaamd:fix/nixl-engine-id-collision-multinode-dp
Draft

Fix engine_id collision + MoRIIO robustness for multi-node disagg DP#39276
raviguptaamd wants to merge 3 commits into
vllm-project:mainfrom
raviguptaamd:fix/nixl-engine-id-collision-multinode-dp

Conversation

@raviguptaamd
Copy link
Copy Markdown
Contributor

@raviguptaamd raviguptaamd commented Apr 8, 2026

Summary

1. Engine_id collision fix (core.py, utils.py)

Use global dp_rank instead of local_dp_rank when generating the engine_id suffix for KV transfer in data-parallel deployments. In multi-node DP with --headless child nodes, using local_dp_rank causes master and child nodes to produce identical engine_ids, routing all data to whichever engine registered first.

2. MoRIIOConnector fixes (moriio_connector.py, moriio_common.py)

Fix A -- Cache kv_transfer_params across scheduler steps: Snapshot into _req_kv_params dict at allocation time; use cached copy in build_connector_meta. Handles chunked prefill correctly: params are retained for requests moved to _reqs_need_pending_save and only cleaned up when the final chunk is processed.

Fix B -- Handle attention backend block_size overrides: FlashMLA silently overrides block_size to 64 for MLA models. Trust the actual tensor shape instead of asserting against the CLI config value.

Fix C -- Default values for MORI-IO-specific ports: remote_handshake_port and remote_notify_port now use .get() with MoRIIOConstants defaults, preventing KeyError from external coordinators (e.g., NIXL sidecar in LLM-D).

Fix D -- Multi-node DP sizing and ranking: Use data_parallel_size_local and local dp_rank (global % local_size) so port offsets stay in 0..N range per node. Applied to both MoRIIOConfig.from_vllm_config and MoRIIOConnectorScheduler.

Fix E -- Prevent child nodes from sending block notifications: Added _is_kv_master flag to guard send_notify_block calls.

3. MoRIIO robustness fixes (moriio_engine.py, moriio_connector.py)

Fix F -- Timeout-guarded transfer completion: Replace blocking status.Wait() with a polling loop bounded by VLLM_MORIIO_TRANSFER_TIMEOUT_S (default 120s).

Fix G -- Failure detection in _pop_done_transfers: Check Failed() status and implement age-based timeout for pending RDMA reads.

Fix H -- Deferred task timeout: Expire write tasks stuck waiting for remote block allocation after VLLM_MORIIO_DEFERRED_TIMEOUT_S (default 300s).

Fix I -- ZMQ notification retry: Non-blocking send with 3 retries and exponential backoff for transient ZMQ failures.

Environment Variables

Variable Default Description
VLLM_MORIIO_TRANSFER_TIMEOUT_S 120 Max seconds to wait for a single RDMA transfer
VLLM_MORIIO_DEFERRED_TIMEOUT_S 300 Max seconds a deferred write task waits for remote block allocation

Files Changed

File Lines Change
core.py +4/-3 engine_id: local_dp_rank -> dp_rank
utils.py +4/-3 engine_id: local_index -> index
moriio_common.py +11/-4 Fix C (default ports), Fix D (local dp_size/dp_rank)
moriio_connector.py +71/-17 Fix A-E + Fix G (failure detection + timeout)
moriio_engine.py +60/-21 Fix F (transfer timeout), Fix H (deferred timeout), Fix I (ZMQ retry)

Signed-off-by: Rav Gupta ravi.gupta@amd.com

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request ensures that engine IDs in the kv_transfer_config are globally unique across multi-node data parallel deployments by switching from local ranks to global ranks (dp_rank and index). I have no feedback to provide.

@raviguptaamd raviguptaamd force-pushed the fix/nixl-engine-id-collision-multinode-dp branch 2 times, most recently from ccd0e1e to cd4cd7f Compare April 12, 2026 22:59
@raviguptaamd raviguptaamd changed the title Fix NIXL engine_id collision in multi-node disaggregated DP Fix engine_id collision + MoRIIO robustness for multi-node disagg DP Apr 12, 2026
@raviguptaamd raviguptaamd force-pushed the fix/nixl-engine-id-collision-multinode-dp branch 2 times, most recently from df5016c to 429a464 Compare April 14, 2026 08:21
raviguptaamd pushed a commit to raviguptaamd/MAD that referenced this pull request Apr 16, 2026
…Ionic AINIC

Extends the Ionic AINIC RDMA support to multi-node disaggregated
inference with 2 Prefill + 2 Decode nodes (DP=16).

Key changes:
- Remove 1P/1D restriction from run_xPyD_models.slurm and
  vllm_disagg_mori_ep.sh to allow xP>1 / yD>1 topologies
- Add --ulimit memlock=-1:-1 to podman for large RDMA memory
  registrations (>32GB) required by MoRI IO
- Pass NCCL_IB_HCA, NCCL_IB_GID_INDEX, NCCL_NET_GDR_LEVEL,
  NCCL_CROSS_NIC, and MORI_SOCKET_IFNAME into containers for
  proper multi-node RCCL and MoRI bootstrap over Ionic AINICs
- Add apply_moriio_2pd_patches.sh for runtime vLLM patches
  (PR vllm-project/vllm#39276) fixing engine_id collisions and
  MoRIIO robustness in multi-node DP configurations
- Restrict --kv-transfer-config to master nodes only (child
  nodes join via --headless and participate in EP all-to-all)
- Add launch_mori_2p2d.sh example launcher for 2P/2D benchmarks

Tested on AAC MI355X cluster with Ionic RDMA NICs achieving
balanced RDMA traffic across all 4 nodes and 1,344 tok/s total
throughput on DeepSeek-V3-5layer.

Made-with: Cursor
@raviguptaamd raviguptaamd requested a review from xuechendi as a code owner April 17, 2026 22:36
gargrahul pushed a commit to ROCm/MAD that referenced this pull request Apr 24, 2026
… graph decode support (#150)

* [DistInf] Enable multi-node MoRI EP disaggregated inference (2P/2D+)

Enable multi-node xP/yD disaggregated prefill/decode inference with
MoRI EP, proven on OCI MI300X (jobs 18044-18408) and AAC MI355X clusters.

Key changes:
- Remove 1P/1D restriction: support arbitrary xP/yD topologies
- Fix 6: --kv-transfer-config only on master nodes; child nodes join
  via --headless (prevents spurious proxy registration from headless workers)
- Add apply_moriio_2pd_patches.sh: downloads and applies vLLM PR #39276
  at container startup for engine_id collision fix + MoRIIO robustness
- Add RDMA/NCCL/MoRI env var passthrough to Docker containers
- Add host-local compilation caches (/tmp/vllm_cache) to avoid NFS races
- Add --ulimit memlock=-1:-1 for large RDMA memory registrations
- Add auto-discovery of host RDMA provider libs (mlx5, ionic, bnxt)
- Add stall detection with configurable per-step timeout in benchmark
- Add PyTorch default_pg_timeout patch (30min -> configurable, default 2h)

Proven config: --enforce-eager, moriio_toy_proxy_server.py (co-located),
warmup ISL=32/OSL=32 con=1, PR #39276 applied at runtime.

Docker image: rocm/pytorch-private:20260407_itej89_vllm_mori_docker
Depends on: vllm-project/vllm#39276 (applied at runtime)

Made-with: Cursor

* Add FULL_DECODE_ONLY CUDA graph mode support for decode nodes

Decode nodes can now optionally use CUDA graphs via VLLM_CUDAGRAPH_MODE
env var (e.g. FULL_DECODE_ONLY) while prefill nodes always run eager.
This captures CUDA graphs for the autoregressive decode phase only,
reducing per-token dispatch overhead while preserving eager flexibility
for prefill and MoRI EP all-to-all.

Usage: VLLM_CUDAGRAPH_MODE=FULL_DECODE_ONLY sbatch ... run_xPyD_models.slurm

Default behavior (no env var set) remains --enforce-eager on all nodes.

Made-with: Cursor

* Default MORI_SOCKET_IFNAME to eth0 for bootstrap communication

MoRI v1.1.0+ requires a valid interface name for shmem bootstrap.
Setting a sensible default prevents empty-string failures on OCI
clusters where eth0 is the management NIC.

Made-with: Cursor

* Address Copilot review comments on PR #221

- Fix set -e abort in apply_moriio_2pd_patches.sh: move python3
  fallback out of for-loop word expansion to prevent script abort
  when vLLM is not importable
- Fail fast on patch failure for multi-node DP (xP>1 or yD>1):
  patches are mandatory for multi-node, optional for 1P/1D
- Fix timeout exit code in benchmark_xPyD.sh: use PIPESTATUS[0]
  instead of $? to capture timeout's exit code through the pipe
- Restore PROXY_TYPE, ROUTER_PORT, BENCHMARK_PORT passthrough to
  Docker container for Default/DeepEP mode compatibility
- Revert barrier port cleanup to hardcoded defaults (5000, 2222,
  15000) to stay aligned with in-container scripts

Made-with: Cursor

* fix(patches): fail fast on download/verification failures

Patch download and verification failures now exit non-zero so that
multi-node DP runs abort early instead of proceeding unpatched.

Made-with: Cursor

* restore default #SBATCH -N 2 for 1P/1D baseline

Command-line sbatch -N overrides this for larger topologies (2P/2D, 4P/4D).

Made-with: Cursor
jaeyoun98 pushed a commit to jaeyoun98/vllm that referenced this pull request Apr 30, 2026
…V transfer

This PR bundles three independent fixes for the MoRIIO-based P/D
disaggregation path (no overlap with open PRs vllm-project#40344, vllm-project#39276, vllm-project#32630,
vllm-project#39835).

1. Proxy: handle max_completion_tokens for chat completions API.
   The OpenAI Chat Completions API uses max_completion_tokens; the toy
   proxy was only decrementing max_tokens, dropping the field on
   forward and causing the backend to use its default. When both
   fields are present, max_completion_tokens takes precedence per the
   OpenAI spec, so decrement that one first; fall back to max_tokens
   otherwise.

2. Proxy: add /health endpoint.
   Benchmark harnesses (e.g. vllm bench serve) perform health probes
   against the proxy before sending traffic. Without this endpoint the
   probe fails and the benchmark aborts. Add a minimal 200-OK
   responder that also reports the current instance counts.

3. Engine: defer write task when remote block_ids is None.
   When a prefill-side write task arrives before the scheduler has
   populated the remote block_ids (async handshake race), the previous
   code silently dropped the task. The fix is in _is_remote_ready:
   it now returns False when block_ids is still None, so the existing
   outer deferral path (in _write_worker_loop and _process_deferred_tasks)
   re-queues the task safely. The inner None-check inside
   _execute_write_task is removed; appending to self._deferred_tasks
   from inside the iteration in _process_deferred_tasks would have
   been clobbered by the still_deferred overwrite at end of loop.

All three are narrow, independent fixes. They do not modify the
scheduler's hot path or introduce new control flow. Verified to apply
cleanly on upstream/main as of 2026-04-22.

This change was developed with AI assistance; the author has reviewed
and tested each hunk and is accountable for the behavior.

Signed-off-by: Chaemin Lim <chaemin.lim@mangoboost.io>
@raviguptaamd raviguptaamd force-pushed the fix/nixl-engine-id-collision-multinode-dp branch from 49b1585 to 050c2d9 Compare April 30, 2026 23:23
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 23, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @raviguptaamd.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label May 23, 2026
@raviguptaamd raviguptaamd force-pushed the fix/nixl-engine-id-collision-multinode-dp branch from 050c2d9 to f1174f9 Compare June 1, 2026 19:52
@mergify mergify Bot removed the needs-rebase label Jun 1, 2026
@raviguptaamd raviguptaamd force-pushed the fix/nixl-engine-id-collision-multinode-dp branch from f1174f9 to 050c2d9 Compare June 1, 2026 19:56
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Jun 1, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @raviguptaamd.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Jun 1, 2026
Two complementary fixes for multi-node data-parallel disaggregated
inference with MoRIIO:

1. Engine_id collision fix (core.py, utils.py):
   Use global dp_rank instead of local_dp_rank when generating the
   engine_id suffix. In multi-node DP with --headless child nodes,
   each node spawns engines with local_dp_rank 0..N-1, causing
   collisions that route all KV transfers to the master node while
   child nodes sit idle.

2. MoRIIO robustness (moriio_engine.py, moriio_connector.py):
   Prevent permanent stalls from lost RDMA completions or transient
   ZMQ failures:

   a) Timeout-guarded transfer completion: Replace blocking
      status.Wait() with a polling loop bounded by
      VLLM_MORIIO_TRANSFER_TIMEOUT_S (default 120s). A lost CQE
      no longer blocks the write-worker thread forever.

   b) Failure detection in _pop_done_transfers: Check Failed()
      status and implement age-based timeout for pending RDMA
      reads. Timed-out or failed transfers are cleaned up instead
      of accumulating indefinitely.

   c) Deferred task timeout: Expire write tasks stuck waiting for
      remote block allocation after VLLM_MORIIO_DEFERRED_TIMEOUT_S
      (default 300s), preventing unbounded growth of the deferred
      queue when the decode side never allocates blocks.

   d) ZMQ notification retry: Use non-blocking send with 3 retries
      and exponential backoff to handle transient send buffer full
      conditions on the ZMQ DEALER socket.

Signed-off-by: Rav Gupta <ravi.gupta@amd.com>
Made-with: Cursor

Signed-off-by: raviguptaamd <ravi.gupta@amd.com>
…INS env var

PIECEWISE cudagraph capture on large models (e.g. DeepSeek-V3) takes ~7 minutes,
exceeding the hardcoded 5-minute handshake timeout and causing DP workers to fail
in multi-node disaggregated setups (4P/4D and beyond). Allow overriding via
environment variable while preserving the 5-minute default.

Made-with: Cursor

Signed-off-by: raviguptaamd <ravi.gupta@amd.com>
@raviguptaamd raviguptaamd marked this pull request as draft June 1, 2026 20:31
@raviguptaamd raviguptaamd force-pushed the fix/nixl-engine-id-collision-multinode-dp branch from 050c2d9 to 8591dfa Compare June 1, 2026 20:31
Signed-off-by: raviguptaamd <ravi.gupta@amd.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@raviguptaamd raviguptaamd force-pushed the fix/nixl-engine-id-collision-multinode-dp branch from 8591dfa to 55fccbe Compare June 1, 2026 20:33
@mergify mergify Bot removed the needs-rebase label Jun 1, 2026
@raviguptaamd
Copy link
Copy Markdown
Contributor Author

Status update — 2026-06-01

Rebased onto today's main (266b9d9c6). All previous merge conflicts are now resolved.

New head: 55fccbe66
Commits: 3 (history preserved — original fix, HANDSHAKE env override, ruff-format compliance)
Net diff: 5 files, 181+/108−

Conflict resolutions during rebase (favored upstream's robustness in every case, then layered PR's new behaviors):

  1. utils.py — pure addition (engine_id global suffix in the newer Ray-DP code path).
  2. moriio_connector.py master-only notify — added PR's _is_kv_master guard; kept upstream's is_producer=False (correct per get_peer_zmq_from_request_id docstring: "consumer/decode needs the prefill's address").
  3. moriio_connector.py _recving_transfers_callback_addr — kept upstream's new 3-tuple type (host, port, xfer_id); ADDED PR's _recving_transfers_start dict for timeout tracking.
  4. moriio_connector.py _pop_done_transfers — kept upstream's robust error path (try/except + notify-on-failure so prefill frees blocks); ADDED PR's age-based timeout elif req_id in self._recving_transfers_start: branch.
  5. moriio_engine.py _process_deferred_tasks — kept upstream's cleaner expiry (uses self._defer_timeout + task.enqueue_time); dropped PR's now-redundant _defer_time attribute + duplicate env-var read.
  6. moriio_engine.py waiting_for_transfer_complete — kept upstream's deadline + error-aggregating TransferError raise (strictly more robust than PR's logging-only version).
  7. moriio_engine.py import os — already addressed upstream; no-op.

CI: DCO passing. pre-run-check is blocked by vLLM's contributor gate (PR needs verified/ready label or 4+ merged PRs from author — currently 1). Could a maintainer please add ready so the rest of CI can run?

Marked as Draft pending the final CI green. Ready for re-review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant