Fix engine_id collision + MoRIIO robustness for multi-node disagg DP#39276
Fix engine_id collision + MoRIIO robustness for multi-node disagg DP#39276raviguptaamd wants to merge 3 commits into
Conversation
ccd0e1e to
cd4cd7f
Compare
df5016c to
429a464
Compare
…Ionic AINIC Extends the Ionic AINIC RDMA support to multi-node disaggregated inference with 2 Prefill + 2 Decode nodes (DP=16). Key changes: - Remove 1P/1D restriction from run_xPyD_models.slurm and vllm_disagg_mori_ep.sh to allow xP>1 / yD>1 topologies - Add --ulimit memlock=-1:-1 to podman for large RDMA memory registrations (>32GB) required by MoRI IO - Pass NCCL_IB_HCA, NCCL_IB_GID_INDEX, NCCL_NET_GDR_LEVEL, NCCL_CROSS_NIC, and MORI_SOCKET_IFNAME into containers for proper multi-node RCCL and MoRI bootstrap over Ionic AINICs - Add apply_moriio_2pd_patches.sh for runtime vLLM patches (PR vllm-project/vllm#39276) fixing engine_id collisions and MoRIIO robustness in multi-node DP configurations - Restrict --kv-transfer-config to master nodes only (child nodes join via --headless and participate in EP all-to-all) - Add launch_mori_2p2d.sh example launcher for 2P/2D benchmarks Tested on AAC MI355X cluster with Ionic RDMA NICs achieving balanced RDMA traffic across all 4 nodes and 1,344 tok/s total throughput on DeepSeek-V3-5layer. Made-with: Cursor
… graph decode support (#150) * [DistInf] Enable multi-node MoRI EP disaggregated inference (2P/2D+) Enable multi-node xP/yD disaggregated prefill/decode inference with MoRI EP, proven on OCI MI300X (jobs 18044-18408) and AAC MI355X clusters. Key changes: - Remove 1P/1D restriction: support arbitrary xP/yD topologies - Fix 6: --kv-transfer-config only on master nodes; child nodes join via --headless (prevents spurious proxy registration from headless workers) - Add apply_moriio_2pd_patches.sh: downloads and applies vLLM PR #39276 at container startup for engine_id collision fix + MoRIIO robustness - Add RDMA/NCCL/MoRI env var passthrough to Docker containers - Add host-local compilation caches (/tmp/vllm_cache) to avoid NFS races - Add --ulimit memlock=-1:-1 for large RDMA memory registrations - Add auto-discovery of host RDMA provider libs (mlx5, ionic, bnxt) - Add stall detection with configurable per-step timeout in benchmark - Add PyTorch default_pg_timeout patch (30min -> configurable, default 2h) Proven config: --enforce-eager, moriio_toy_proxy_server.py (co-located), warmup ISL=32/OSL=32 con=1, PR #39276 applied at runtime. Docker image: rocm/pytorch-private:20260407_itej89_vllm_mori_docker Depends on: vllm-project/vllm#39276 (applied at runtime) Made-with: Cursor * Add FULL_DECODE_ONLY CUDA graph mode support for decode nodes Decode nodes can now optionally use CUDA graphs via VLLM_CUDAGRAPH_MODE env var (e.g. FULL_DECODE_ONLY) while prefill nodes always run eager. This captures CUDA graphs for the autoregressive decode phase only, reducing per-token dispatch overhead while preserving eager flexibility for prefill and MoRI EP all-to-all. Usage: VLLM_CUDAGRAPH_MODE=FULL_DECODE_ONLY sbatch ... run_xPyD_models.slurm Default behavior (no env var set) remains --enforce-eager on all nodes. Made-with: Cursor * Default MORI_SOCKET_IFNAME to eth0 for bootstrap communication MoRI v1.1.0+ requires a valid interface name for shmem bootstrap. Setting a sensible default prevents empty-string failures on OCI clusters where eth0 is the management NIC. Made-with: Cursor * Address Copilot review comments on PR #221 - Fix set -e abort in apply_moriio_2pd_patches.sh: move python3 fallback out of for-loop word expansion to prevent script abort when vLLM is not importable - Fail fast on patch failure for multi-node DP (xP>1 or yD>1): patches are mandatory for multi-node, optional for 1P/1D - Fix timeout exit code in benchmark_xPyD.sh: use PIPESTATUS[0] instead of $? to capture timeout's exit code through the pipe - Restore PROXY_TYPE, ROUTER_PORT, BENCHMARK_PORT passthrough to Docker container for Default/DeepEP mode compatibility - Revert barrier port cleanup to hardcoded defaults (5000, 2222, 15000) to stay aligned with in-container scripts Made-with: Cursor * fix(patches): fail fast on download/verification failures Patch download and verification failures now exit non-zero so that multi-node DP runs abort early instead of proceeding unpatched. Made-with: Cursor * restore default #SBATCH -N 2 for 1P/1D baseline Command-line sbatch -N overrides this for larger topologies (2P/2D, 4P/4D). Made-with: Cursor
…V transfer This PR bundles three independent fixes for the MoRIIO-based P/D disaggregation path (no overlap with open PRs vllm-project#40344, vllm-project#39276, vllm-project#32630, vllm-project#39835). 1. Proxy: handle max_completion_tokens for chat completions API. The OpenAI Chat Completions API uses max_completion_tokens; the toy proxy was only decrementing max_tokens, dropping the field on forward and causing the backend to use its default. When both fields are present, max_completion_tokens takes precedence per the OpenAI spec, so decrement that one first; fall back to max_tokens otherwise. 2. Proxy: add /health endpoint. Benchmark harnesses (e.g. vllm bench serve) perform health probes against the proxy before sending traffic. Without this endpoint the probe fails and the benchmark aborts. Add a minimal 200-OK responder that also reports the current instance counts. 3. Engine: defer write task when remote block_ids is None. When a prefill-side write task arrives before the scheduler has populated the remote block_ids (async handshake race), the previous code silently dropped the task. The fix is in _is_remote_ready: it now returns False when block_ids is still None, so the existing outer deferral path (in _write_worker_loop and _process_deferred_tasks) re-queues the task safely. The inner None-check inside _execute_write_task is removed; appending to self._deferred_tasks from inside the iteration in _process_deferred_tasks would have been clobbered by the still_deferred overwrite at end of loop. All three are narrow, independent fixes. They do not modify the scheduler's hot path or introduce new control flow. Verified to apply cleanly on upstream/main as of 2026-04-22. This change was developed with AI assistance; the author has reviewed and tested each hunk and is accountable for the behavior. Signed-off-by: Chaemin Lim <chaemin.lim@mangoboost.io>
49b1585 to
050c2d9
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
050c2d9 to
f1174f9
Compare
f1174f9 to
050c2d9
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
Two complementary fixes for multi-node data-parallel disaggregated
inference with MoRIIO:
1. Engine_id collision fix (core.py, utils.py):
Use global dp_rank instead of local_dp_rank when generating the
engine_id suffix. In multi-node DP with --headless child nodes,
each node spawns engines with local_dp_rank 0..N-1, causing
collisions that route all KV transfers to the master node while
child nodes sit idle.
2. MoRIIO robustness (moriio_engine.py, moriio_connector.py):
Prevent permanent stalls from lost RDMA completions or transient
ZMQ failures:
a) Timeout-guarded transfer completion: Replace blocking
status.Wait() with a polling loop bounded by
VLLM_MORIIO_TRANSFER_TIMEOUT_S (default 120s). A lost CQE
no longer blocks the write-worker thread forever.
b) Failure detection in _pop_done_transfers: Check Failed()
status and implement age-based timeout for pending RDMA
reads. Timed-out or failed transfers are cleaned up instead
of accumulating indefinitely.
c) Deferred task timeout: Expire write tasks stuck waiting for
remote block allocation after VLLM_MORIIO_DEFERRED_TIMEOUT_S
(default 300s), preventing unbounded growth of the deferred
queue when the decode side never allocates blocks.
d) ZMQ notification retry: Use non-blocking send with 3 retries
and exponential backoff to handle transient send buffer full
conditions on the ZMQ DEALER socket.
Signed-off-by: Rav Gupta <ravi.gupta@amd.com>
Made-with: Cursor
Signed-off-by: raviguptaamd <ravi.gupta@amd.com>
…INS env var PIECEWISE cudagraph capture on large models (e.g. DeepSeek-V3) takes ~7 minutes, exceeding the hardcoded 5-minute handshake timeout and causing DP workers to fail in multi-node disaggregated setups (4P/4D and beyond). Allow overriding via environment variable while preserving the 5-minute default. Made-with: Cursor Signed-off-by: raviguptaamd <ravi.gupta@amd.com>
050c2d9 to
8591dfa
Compare
Signed-off-by: raviguptaamd <ravi.gupta@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com>
8591dfa to
55fccbe
Compare
Status update — 2026-06-01Rebased onto today's New head: Conflict resolutions during rebase (favored upstream's robustness in every case, then layered PR's new behaviors):
CI: DCO passing. Marked as Draft pending the final CI green. Ready for re-review. |
Summary
1. Engine_id collision fix (
core.py,utils.py)Use global
dp_rankinstead oflocal_dp_rankwhen generating the engine_id suffix for KV transfer in data-parallel deployments. In multi-node DP with--headlesschild nodes, usinglocal_dp_rankcauses master and child nodes to produce identical engine_ids, routing all data to whichever engine registered first.2. MoRIIOConnector fixes (
moriio_connector.py,moriio_common.py)Fix A -- Cache
kv_transfer_paramsacross scheduler steps: Snapshot into_req_kv_paramsdict at allocation time; use cached copy inbuild_connector_meta. Handles chunked prefill correctly: params are retained for requests moved to_reqs_need_pending_saveand only cleaned up when the final chunk is processed.Fix B -- Handle attention backend
block_sizeoverrides: FlashMLA silently overridesblock_sizeto 64 for MLA models. Trust the actual tensor shape instead of asserting against the CLI config value.Fix C -- Default values for MORI-IO-specific ports:
remote_handshake_portandremote_notify_portnow use.get()withMoRIIOConstantsdefaults, preventingKeyErrorfrom external coordinators (e.g., NIXL sidecar in LLM-D).Fix D -- Multi-node DP sizing and ranking: Use
data_parallel_size_localand localdp_rank(global % local_size) so port offsets stay in 0..N range per node. Applied to bothMoRIIOConfig.from_vllm_configandMoRIIOConnectorScheduler.Fix E -- Prevent child nodes from sending block notifications: Added
_is_kv_masterflag to guardsend_notify_blockcalls.3. MoRIIO robustness fixes (
moriio_engine.py,moriio_connector.py)Fix F -- Timeout-guarded transfer completion: Replace blocking
status.Wait()with a polling loop bounded byVLLM_MORIIO_TRANSFER_TIMEOUT_S(default 120s).Fix G -- Failure detection in
_pop_done_transfers: CheckFailed()status and implement age-based timeout for pending RDMA reads.Fix H -- Deferred task timeout: Expire write tasks stuck waiting for remote block allocation after
VLLM_MORIIO_DEFERRED_TIMEOUT_S(default 300s).Fix I -- ZMQ notification retry: Non-blocking send with 3 retries and exponential backoff for transient ZMQ failures.
Environment Variables
VLLM_MORIIO_TRANSFER_TIMEOUT_SVLLM_MORIIO_DEFERRED_TIMEOUT_SFiles Changed
core.pyutils.pymoriio_common.pymoriio_connector.pymoriio_engine.pySigned-off-by: Rav Gupta ravi.gupta@amd.com