[DistInf] Enable multi-node MoRI EP disaggregated inference with CUDA graph decode support#150
Conversation
Enable multi-node xP/yD disaggregated prefill/decode inference with MoRI EP, proven on OCI MI300X (jobs 18044-18408) and AAC MI355X clusters. Key changes: - Remove 1P/1D restriction: support arbitrary xP/yD topologies - Fix 6: --kv-transfer-config only on master nodes; child nodes join via --headless (prevents spurious proxy registration from headless workers) - Add apply_moriio_2pd_patches.sh: downloads and applies vLLM PR #39276 at container startup for engine_id collision fix + MoRIIO robustness - Add RDMA/NCCL/MoRI env var passthrough to Docker containers - Add host-local compilation caches (/tmp/vllm_cache) to avoid NFS races - Add --ulimit memlock=-1:-1 for large RDMA memory registrations - Add auto-discovery of host RDMA provider libs (mlx5, ionic, bnxt) - Add stall detection with configurable per-step timeout in benchmark - Add PyTorch default_pg_timeout patch (30min -> configurable, default 2h) Proven config: --enforce-eager, moriio_toy_proxy_server.py (co-located), warmup ISL=32/OSL=32 con=1, PR #39276 applied at runtime. Docker image: rocm/pytorch-private:20260407_itej89_vllm_mori_docker Depends on: vllm-project/vllm#39276 (applied at runtime) Made-with: Cursor
Decode nodes can now optionally use CUDA graphs via VLLM_CUDAGRAPH_MODE env var (e.g. FULL_DECODE_ONLY) while prefill nodes always run eager. This captures CUDA graphs for the autoregressive decode phase only, reducing per-token dispatch overhead while preserving eager flexibility for prefill and MoRI EP all-to-all. Usage: VLLM_CUDAGRAPH_MODE=FULL_DECODE_ONLY sbatch ... run_xPyD_models.slurm Default behavior (no env var set) remains --enforce-eager on all nodes. Made-with: Cursor
MoRI v1.1.0+ requires a valid interface name for shmem bootstrap. Setting a sensible default prevents empty-string failures on OCI clusters where eth0 is the management NIC. Made-with: Cursor
- Fix set -e abort in apply_moriio_2pd_patches.sh: move python3 fallback out of for-loop word expansion to prevent script abort when vLLM is not importable - Fail fast on patch failure for multi-node DP (xP>1 or yD>1): patches are mandatory for multi-node, optional for 1P/1D - Fix timeout exit code in benchmark_xPyD.sh: use PIPESTATUS[0] instead of $? to capture timeout's exit code through the pipe - Restore PROXY_TYPE, ROUTER_PORT, BENCHMARK_PORT passthrough to Docker container for Default/DeepEP mode compatibility - Revert barrier port cleanup to hardcoded defaults (5000, 2222, 15000) to stay aligned with in-container scripts Made-with: Cursor
Patch download and verification failures now exit non-zero so that multi-node DP runs abort early instead of proceeding unpatched. Made-with: Cursor
There was a problem hiding this comment.
Pull request overview
Enables multi-node MoRI EP disaggregated inference (multi-node xP/yD topologies) on OCI MI300X clusters, including optional decode-side CUDA graph execution and runtime patching for upstream vLLM multi-node DP fixes.
Changes:
- Extend
vllm_disagg_mori_ep.shto support multi-node DP (master-only KV transfer config, headless child nodes) with optionalFULL_DECODE_ONLYCUDA graph decode mode and expanded RDMA/NCCL + cache tuning. - Update Slurm launcher to add memlock ulimit, host-local compilation caches, and broader RDMA library auto-mounting into containers.
- Add an idempotent runtime patch script that downloads/applies vLLM PR #39276 and verifies expected markers.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
scripts/vllm_dissag/vllm_disagg_mori_ep.sh |
Multi-node MoRI EP topology support, master-only KV transfer config, CUDA-graph decode option, and RDMA/cache/timeouts tuning. |
scripts/vllm_dissag/run_xPyD_models.slurm |
Container runtime tuning (memlock, caches), RDMA library mounts, and env passthrough for multi-node settings. |
scripts/vllm_dissag/benchmark_xPyD.sh |
Benchmark warmup parameterization and per-step timeout/stall logging. |
scripts/vllm_dissag/apply_moriio_2pd_patches.sh |
New startup script to fetch/apply/verify vLLM PR #39276 patch for multi-node DP robustness. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -1,7 +1,5 @@ | |||
| #!/bin/bash | |||
| #SBATCH --job-name=vllm-pd # Specify a custom string for your slurm batch job | |||
| #SBATCH -N 2 # Request xP + yD nodes (proxy co-located on prefill master) | |||
There was a problem hiding this comment.
is this change accidental?
There was a problem hiding this comment.
Fixed in 30a0a8d — restored #SBATCH -N 2 as the default for 1P/1D. Command-line sbatch -N 4 overrides it for larger topologies (2P/2D, 4P/4D).
lcskrishna
left a comment
There was a problem hiding this comment.
LGTM. If the change is accidental please fix it,
Command-line sbatch -N overrides this for larger topologies (2P/2D, 4P/4D). Made-with: Cursor
Summary
Enable multi-node MoRI EP disaggregated prefill/decode inference (2P/2D, 4P/4D) on OCI MI300X clusters with InfiniBand RDMA. Adds optional
FULL_DECODE_ONLYCUDA graph mode for decode nodes.Key changes:
--kv-transfer-config(child nodes join via--headless)apply_moriio_2pd_patches.sh--ulimit memlock=-1:-1, host RDMA library auto-discovery/tmp/vllm_cache/) for AITER JIT, Triton, COMGR, vLLMFULL_DECODE_ONLYCUDA graphs for decode nodes (prefill always eager)Files Changed
scripts/vllm_dissag/vllm_disagg_mori_ep.shscripts/vllm_dissag/run_xPyD_models.slurmscripts/vllm_dissag/benchmark_xPyD.shscripts/vllm_dissag/apply_moriio_2pd_patches.shArchitecture
Docker Image & Dependencies
docker/vllm_disagg_inference.ubuntu.amd.Dockerfile(unchanged in this PR)rocm/vllm-dev:base_torch2.10_triton3.6_rocm7.2_torch_build_20260216v0.18.1rc1.dev133+g7d6917bef(same commit as DockerfileVLLM_COMMIT)How to run
1. Build the Docker image
Push to your registry so all nodes can pull it:
2. Run 2P/2D (default eager mode)
3. Run 2P/2D with CUDA graph decode
4. Run 4P/4D
Compilation caches
AITER JIT, Triton, COMGR, and vLLM caches are auto-mounted to host-local
/tmp/vllm_cache/. First run on a node takes longer (JIT compilation); subsequent runs reuse cached artifacts. Cache paths are configurable viaAITER_JIT_DIR,TRITON_CACHE_DIR,COMGR_CACHE_DIR,VLLM_CACHE_ROOT.Result log locations
Logs are written to
/shared_inference/$USER/model_blog_logs/{SLURM_JOB_ID}/:prefill_NODE{N}.log/decode_NODE{N}.log— per-node vLLM server logsproxy_NODE0.log— proxy server logbenchmark_{JOBID}_*_CONCURRENCY.log— benchmark resultspd_vllm_bench_NODE0.log— benchmark driver outputKnown limitations
--request-rateto throttle.