llmd+vllm+mori-ep(intra node wide-ep)+mori-io(write) for 1p1d with dp=ep=8 tp=1#44355
llmd+vllm+mori-ep(intra node wide-ep)+mori-io(write) for 1p1d with dp=ep=8 tp=1#44355shikamd123 wants to merge 8 commits into
Conversation
…car omits it The llm-d routing-sidecar (`--kv-connector=nixlv2`) splits an incoming chat completion into separate prefill and decode requests, forwarding NIXL-shaped `kv_transfer_params` on each leg. Those params include `do_remote_decode` / `do_remote_prefill` plus the `remote_engine_id`/`host`/`port`/`block_ids` triplet for the NIXL READ flow, but do not include MoRI-IO's own `transfer_id` field -- that is a MoRI-IO concept the sidecar has no knowledge of. Without this patch, `MoRIIOConnectorScheduler.update_state_after_alloc` unconditionally dereferences `params["transfer_id"]` and the prefill engine crashes on first traffic with `KeyError: 'transfer_id'`. Synthesize a stable `transfer_id` from `request.request_id` so both producer and consumer (which see the same `request_id` through the sidecar fan-out) end up with the same `transfer_id`, without requiring any wire-protocol change in the sidecar. This is the MoRI-IO-side counterpart to PR vllm-project#39276's "Fix C", which only fixed the analogous `KeyError` for `remote_handshake_port` / `remote_notify_port`. See-also: vllm-project#39276 (Fix C, by @raviguptaamd: same defensive .get()-with-default pattern, applied there to remote_handshake_port and remote_notify_port; this commit extends it to transfer_id). Signed-off-by: shikpate <shikpate@amd.com>
… requests `Scheduler._update_from_kv_xfer_finished` currently asserts that every `req_id` reported in `kv_connector_output.finished_sending` is still present in `self.requests`. That invariant holds for connectors that report completion synchronously inside the same scheduler step, but it is violated by WRITE-mode connectors (e.g. MoRI-IO in WRITE mode) which report `finished_sending` out-of-band from a deferred-write task that can complete one or more steps after the scheduler already removed the request via the normal finish path. Symptom: `AssertionError` in `_update_from_kv_xfer_finished` randomly under sustained disagg P/D traffic, killing the engine. Fix: replace the `assert req_id in self.requests` with a skip-if-missing guard, so the late completion is silently ignored. The block-free already happened on the synchronous finish; nothing else needs to be done. The matching `finished_recving` branch above already keeps the "missing" req_id in `finished_recving_kv_req_ids` for one more step and is handled separately, so this change is asymmetric on purpose. Related: vllm-project#39276 (engine-side timeouts for the same async-MoRI-IO completion races this commit handles scheduler-side; independent fixes, complementary surface). Signed-off-by: shikpate <shikpate@amd.com>
`DPEngineCoreProc.add_request` currently gates the ``engines_running`` flip and the ``start_wave`` broadcast on ``request_wave != self.current_wave``. Both ``current_wave`` and ``request_wave`` default to ``0``, so on the very first request after engine init the gate is False and the broadcast never happens. Consequence on collectives-heavy models (Wide-EP, large DP): * The DP rank that received the first request enters its forward pass and blocks on a collective (e.g. EP all2all, MoE all2all, ``has_unfinished_dp`` all-reduce). * The other DP ranks observe ``engines_running == False`` and ``local_unfinished_reqs == False`` in ``run_busy_loop``, take the ``continue`` path, and never call ``execute_dummy_batch`` -- so they never enter the collective. * The busy rank hangs forever on the collective until the ``multiproc_executor`` 1800 s timeout fires: ``RPC call to sample_tokens timed out``. Warm requests work fine because ``current_wave`` has already advanced past 0, so subsequent first-of-wave requests do trigger the broadcast; only the very first cold request after engine init hangs. This makes the bug invisible in CI but reliably reproduces in production startup of large DP topologies. Reproduced on DeepSeek-V3, DP=8, DP=16, TP=1, EP=8/16, on a fresh engine -- 100% deterministic hang on the first request, never on subsequent ones. Fix: drop the ``request_wave != self.current_wave`` outer gate. We still ``return`` early in steady state because ``engines_running`` is already True. When engines are idle and the scheduler is unpaused, we wake them up via the same code path that previously only fired for ``request_wave > current_wave``. Related: vllm-project#36594, vllm-project#36608, vllm-project#37024, vllm-project#38009 all touch the same region but for a different pause/resume race. This first-wave race is distinct and not fixed by any of them. Signed-off-by: shikpate <shikpate@amd.com>
…back
When the OpenAI server is run with ``--api-server-count N`` (N > 1),
Linux SO_REUSEPORT shuffles incoming connections across ApiServer
processes. Two legs of a disaggregated prefill/decode pair (which
share a ``request_id``) can land on different ApiServers and be
load-balanced to different DP ranks. KV-transfer protocols that pin
source/target by DP rank (MoRI-IO, NIXL WRITE-mode, ...) then end up
exchanging handshakes with the wrong peer and the request deadlocks
at the connector level.
The result is rank-asymmetric: requests that happen to land on a
``(prefill DP=H, decode DP=H)`` pair succeed, all others time out
after ``VLLM_MORIIO_DEFERRED_TIMEOUT_S`` (300 s by default).
This patch adds a ``_pick_dp_rank_for_request`` helper that
``AsyncLLM.add_request`` consults when the caller did not supply a
``data_parallel_rank``. The helper synthesizes a stable rank in this
order:
1. ``params.extra_args["kv_transfer_params"]["dp_rank_hint"]`` if
the caller (or an upstream routing sidecar) already picked the
rank.
2. Otherwise a stable ``blake2s(request_id) % effective_dp_size``
hash. Because the disagg sidecar uses the same ``request_id`` on
the prefill and decode legs, both sides hash to the same rank H
and the SO_REUSEPORT shuffle is neutralised.
When ``data_parallel_size_local`` is set and smaller than
``data_parallel_size`` (multi-pod DP, "Wide-EP"), the modulus is
capped to the local pod size so that both legs route to the same pod
-- cross-pod handshake requires a coordinator that may not exist in
the disagg orchestrator.
The helper returns ``None`` when there is no DP fan-out to
disambiguate, leaving the existing dispatch path unchanged. Callers
that already pass an explicit ``data_parallel_rank`` (e.g. via the
``X-data-parallel-rank`` header) are untouched.
Once ``data_parallel_rank`` is set,
``DPLBAsyncMPClient.get_core_engine_for_request`` already honours the
hint and dispatches the request to ``EngineCore_DPH`` instead of
load-balancing -- no changes are needed in the dispatch core.
Related: vllm-project#39276 (multi-node engine_id collision fix: same theme of deterministic DP routing in P/D pairs; that PR handles --headless multi-node DP, this commit handles --api-server-count > 1 ApiServer fan-out).
Signed-off-by: shikpate <shikpate@amd.com>
In a DP>1 disaggregated deployment, ``request_finished`` on the
decode-side scheduler reads the prefill DP rank from
``request.kv_transfer_params.get("remote_dp_rank", 0)``. That field
is a **static** value injected by the routing sidecar (e.g. via
``--moriio-prefill-dp-rank``, default 0), so every decode->prefill
notify lands on a single prefill DP rank.
When the prefill dispatcher actually routes requests round-robin /
hash / least-loaded across multiple DP ranks, every request whose
prefill leg ran on rank N>0 never gets its ``done_remote_allocate``
notify -- those workers spin in ``save_kv_layer`` until the deferred
write task expires after ``VLLM_MORIIO_DEFERRED_TIMEOUT_S``
(300 s default), and the request fails.
In practice, **only requests that happen to land on the pinned
prefill rank succeed**. Counts collected from a 138 minute run with
DP=8, TP=1, EP=8, DeepSeek-V3, sidecar pinned to prefill DP0:
Worker_DP0: 0 EXPIRED (works)
Worker_DP1: 183
Worker_DP2: 61
Worker_DP3: 183
Worker_DP4: 427
Worker_DP5: 122
Worker_DP6: 61
Worker_DP7: 61
Fix: when ``remote_dp_size > 1`` and the caller did not explicitly
set ``remote_dp_rank_override``, compute the prefill DP rank from a
stable hash of ``request_id``:
remote_dp_rank = int.from_bytes(
hashlib.blake2s(request_id, digest_size=8).digest(), "big"
) % remote_dp_size
The dispatcher-side helper (``AsyncLLM._pick_dp_rank_for_request``)
uses the same blake2s scheme so both legs (prefill dispatch + decode
notify) agree on the rank, neutralising the SO_REUSEPORT shuffle that
the disagg sidecar can otherwise induce.
By the time ``request_finished`` runs, MoRI-IO has appended a
per-transfer suffix ``-<8 hex>`` to ``request.request_id`` (it isn't
on the AsyncLLM rid the dispatcher hashes). Strip the suffix so both
legs hash the same canonical base id.
For multi-pod DP topologies (Wide-EP-16: 8 ranks per pod on
master+child), cap the modulus to ``remote_dp_size_local`` when set so
the notify lands on the same pod the dispatcher routed to. Without
the cap, hash mod 16 can pick a rank on the other pod and the notify
goes to an engine that never serviced the request.
Single-DP and unspecified-DP deployments are unchanged (``_dp_size
<= 1`` short-circuits to the previous behaviour).
Companion to the dispatcher-side
``AsyncLLM._pick_dp_rank_for_request`` (separate PR). The two patches
are independently useful but together provide end-to-end stable DP
routing for disagg prefill/decode pairs.
Related: vllm-project#39276 (Fix E + engine_id collision: same deterministic-DP-routing problem space; companion to the dispatcher-side AsyncLLM blake2s fix in the previous commit).
Signed-off-by: shikpate <shikpate@amd.com>
``MoRIIOConnectorScheduler.unmap_request_id`` (decode-side, run from
``request_finished``) currently does an exact-match lookup on
``self.request_id_to_transfer_id`` and warns on miss with::
Could not find <rid> in transfer_id_to_request_id lookup table.
This could lead to a possible hang.
In multi-pod disagg routing we observe in production that MoRI-IO
appends a "-[0-9a-f]{8}" per-transfer suffix to ``request.request_id``
between the call that populated the map (``update_state_after_alloc``,
alloc-time) and the call that drains it (``request_finished``,
finish-time). The lookup is exact-match, so the suffix mutation
produces a spurious warning, leaks the dict entry, and ships stale
state to the worker via ``meta.transfer_id_to_request_id`` -- which
manifests as rank-asymmetric MoRI-IO transfer-id lookup failures in
worker logs on the pod where the suffix gets appended (decode-master
in Wide-EP DP=16, ranks 0..7).
Concretely::
map request_id = "cmpl-bda091899755d21b-0" (no suffix)
unmap request_id = "cmpl-bda091899755d21b-0-956053a4" (8-hex suffix)
Same canonical request, different keys -> dict lookup misses, dict
entry leaks.
This patch makes ``unmap_request_id`` robust to the suffix mutation:
1. Try exact match first. This is the existing fast path and is
zero-overhead / bit-identical to the pre-patch behaviour for
callers that pass the canonical rid (decode-child, ranks 8..15).
2. If the exact-match misses, strip a trailing ``-[0-9a-f]{8}``
suffix and retry. If the canonical base id is present, log a
``debug``-level note and proceed with the canonical id.
3. If both miss, log a more informative warning (table size +
canonical base id) so a real "never mapped" bug is still easy
to tell apart from the suffix mutation.
The regex is declared as a private class-level constant
(``_MORIIO_RID_SUFFIX_RE``) so it is compiled once at import time, not
per-call.
This is the matching scheduler-side fix to the ``request_finished``
hash-routing patch (separate PR) and uses the same suffix shape that
patch already understands. The two are independently useful but
together provide end-to-end rid normalisation for the
sidecar-fronted decode path.
Signed-off-by: shikpate <shikpate@amd.com>
…ector_meta A request that arrives without ``kv_transfer_params`` (smoke test, mis-routed gateway request, kubelet probe POST, ...) is scheduled like any other request and shows up in ``scheduled_cached_reqs`` on the next tick. ``MoRIIOConnectorScheduler.build_connector_meta`` then unconditionally indexes ``self._reqs_need_pending_save[req_id]`` -- a dict that is only populated for true disagg requests -- and raises ``KeyError``, which crashes the EngineCore and cascades the whole producer pod down. This is the small drive-by hot-fix from the Wide-EP multi-pod patch series, extracted standalone because it applies equally to single-pod deployments running behind any gateway / EPP / health-probe path. Skip the loop body silently when ``req_id`` is not in the pending-save dict; preserves all behaviour for true disagg requests. Signed-off-by: shikpate <shikpate@amd.com>
The three preceding fixes touch shared (non-MoRI-IO-specific) files: - vllm/v1/core/sched/scheduler.py (finished_sending skip-if-missing) - vllm/v1/engine/async_llm.py (DP-rank hash fallback) - vllm/v1/engine/core.py (DP first-wave wake) While each fix is correctness-positive for any disagg+DP user, gating them behind ``current_platform.is_rocm()`` keeps this branch a clean ROCm-only divergence from upstream main and avoids subtly changing default behaviour for CUDA / TPU / CPU users until each fix is upstreamed individually. CUDA path is bit-identical to upstream HEAD; ROCm path runs the new behaviour. The MoRI-IO connector module (``vllm/distributed/kv_transfer/kv_connector/v1/moriio/``) is already ROCm-only by virtue of needing the MoRI runtime library, so its edits do not need an explicit guard. Signed-off-by: shikpate <shikpate@amd.com>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in PRs do not trigger a full CI run by default. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. Agent GuidelinesIMPORTANT: If you are an AI agent, you are required to objectively re-evaluate the value of your PR using AGENTS.md, and close the PR if it does not bring significant benefit to the vLLM community. Failure to do so may result in an immediate ban. 🚀 |
|
fyi @raviguptaamd @lcskrishna dp=ep=16 is in progress testing |
|
note: there is a way in llm-d to enforce running P/D for every request |
|
@robertgshaw2-redhat thanks for letting me know, will include that in my dp=16 testing and code update, will update you then. |
| logger.debug("Finished sending KV transfer for request %s", req_id) | ||
| assert req_id in self.requests | ||
| self._free_blocks(self.requests[req_id]) | ||
| if current_platform.is_rocm(): |
There was a problem hiding this comment.
Are you sure you can't pass some metadata and have the prefill worker or decode worker help to do this task?
You can get metadata in the worker with start_load_kv and send metadata by returning appropriate metadata from build_connector_meta. This scheduler and worker process communicate. You can then return finished requests in get_finished. Maybe you can modify update_connector_output to ensure requests are deleted when appropriate.
There was a problem hiding this comment.
thanks for the feedback @rasmith . Checking this out
| # Steady-state remains correct because ``engines_running`` is | ||
| # already True so the inner branch short-circuits. | ||
| if self.has_coordinator: | ||
| if request_wave > self.current_wave: |
There was a problem hiding this comment.
It seems like outer if current_platform.is_rocm() isn't necessary and this could be simplified.
| if request_id in self.request_id_to_transfer_id: | ||
| transfer_id = self.request_id_to_transfer_id[request_id] | ||
| del self.request_id_to_transfer_id[request_id] | ||
| # In multi-pod disagg routing, MoRI-IO can append a |
|
This pull request has merge conflicts that must be resolved before it can be |
Purpose
This PR delivers eight independent correctness fixes for vLLM's V1
data-parallel + KV-transfer machinery, all motivated by the same
real-world deployment: multi-replica disaggregated prefill/decode
serving with
MoRI-IOas the KV connector, fronted by thellm-drouting sidecar on AMD MI300X (gfx942).
The fixes split into three thematic groups; each commit is atomic and
revertible on its own. No CUDA / TPU / CPU runtime behaviour changes
in this PR — the V1 commits are gated behind
current_platform.is_rocm()so non-ROCm users see byte-identicalbehaviour to upstream HEAD until the maintainers decide otherwise.
Group A — V1 disagg-DP correctness (3 commits, 3 files)
These three were each a 100%-reproducible deadlock or AssertionError
in production multi-DP runs and would not have been visible in CI:
[Bugfix][V1][Scheduler] Tolerate finished_sending for already-removed requestsAssertionErrorinScheduler._update_from_kv_xfer_finishedunder sustained WRITE-mode KV trafficassert req_id in self.requestswith skip-if-missing guard for late async completions. The synchronous finish path already freed blocks; the latefinished_sendingcallback has nothing to do. Asymmetric on purpose vs thefinished_recvingbranch above (which already keeps the rid for one extra step).[Bugfix][V1][DP] Wake other DP engines on first request of first waverequest_wave != self.current_waveouter gate inDPEngineCoreProc.add_request. Both default to 0 → first request never broadcastsstart_wave→ other DP ranks never enter the collective. Warm requests work; only the very first cold one hangs (so CI never catches it).[Bugfix][V1][DP] AsyncLLM: stable per-request data_parallel_rank fallback--api-server-count > 1(SO_REUSEPORT shuffle); KV handshake addresses the wrong peer; request hangs untilVLLM_MORIIO_DEFERRED_TIMEOUT_S(300s default) tripsAsyncLLM._pick_dp_rank_for_requesthelper: honours an explicitkv_transfer_params["dp_rank_hint"](set by routing sidecars), else falls back toblake2s(request_id) % effective_dp_size. ReturnsNonewhen there's no DP fan-out, leaving the existing dispatch path unchanged. Caps modulus todata_parallel_size_localfor multi-pod DP.Group B — MoRI-IO connector hardening (4 commits, 1 file)
All four commits live in
vllm/distributed/kv_transfer/kv_connector/v1/moriio/moriio_connector.pybut address distinct, independent failure modes:
[Bugfix][KVConnector][MoRI-IO] Synthesize transfer_id when llm-d sidecar omits it--kv-connector=nixlv2) injects NIXL-shapedkv_transfer_paramsthat don't include MoRI-IO'stransfer_id.MoRIIOConnectorScheduler.update_state_after_allocunconditionally dereferencedparams["transfer_id"]→KeyErroron first traffic. Fix synthesises a stabletransfer_idfromrequest.request_idso producer + consumer agree without any wire-protocol change.[Bugfix][KVConnector][MoRI-IO] Hash-route decode->prefill notify in DP>1request_finisheddecode-side reads the prefill DP rank from a static sidecar-injected field (default 0). When the prefill dispatcher actually round-robins across DP ranks, every request whose prefill leg ran on rank N>0 never gets itsdone_remote_allocatenotify and starves until 300s timeout. 138-min run with DP=8 EP=8 DSV3 sidecar pinned to prefill DP0 produced the asymmetric pattern: DP0=0 expired, DP1-7 between 61 and 427 expired. Fix uses the sameblake2s(request_id) % remote_dp_sizescheme as the dispatcher-side helper above; honours aremote_dp_rank_overridesentinel so the sidecar can pre-empt this fallback when it has its own rank-pinning logic.[Bugfix][KVConnector][MoRI-IO] Tolerate per-transfer rid suffix in unmap-[0-9a-f]{8}per-transfer suffix torequest.request_idbetweenupdate_state_after_alloc(alloc-time, pre-suffix) andunmap_request_id(finish-time, post-suffix). The exact-match dict lookup misses, dict entry leaks, staletransfer_id_to_request_idships to the worker → rank-asymmetric MoRI-IO transfer-id lookup failures. Fix: try exact match first, else strip the regex-defined suffix and retry; warn only if both miss.[Bugfix][KVConnector][MoRI-IO] Skip non-disagg requests in build_connector_metakv_transfer_params(smoke test, kubelet probe POST, mis-routed gateway request) crashes the whole EngineCore viaKeyErroronself._reqs_need_pending_save[req_id]. Fix is a single-line skip-if-missing guard. Drive-by hot-fix that applies equally to single-pod deployments behind any gateway / EPP / health-probe path.Group C — ROCm-only gating (1 commit, 3 files)
[ROCm] Gate disagg-DP fixes behind current_platform.is_rocm()if current_platform.is_rocm()guards. Keeps this PR a clean ROCm-only divergence from upstream main; CUDA / TPU / CPU users see byte-identical behaviour until each Group A fix is upstreamed individually. The MoRI-IO connector fixes (Group B) are inherently ROCm-only (they need the MoRI runtime), so no explicit guard is needed there.Acknowledgements
The MoRI-IO connector fixes share themes with vLLM PR #39276
("Fix engine_id collision + MoRIIO robustness for multi-node disagg
DP", by @raviguptaamd).
Specifically:
Synthesize transfer_id when llm-d sidecar omits itextends PRFix engine_id collision + MoRIIO robustness for multi-node disagg DP #39276's "Fix C" (
.get(default)forremote_handshake_port/remote_notify_port) with the same defensive pattern fortransfer_id.Hash-route decode->prefill notify in DP>1is companion to PRFix engine_id collision + MoRIIO robustness for multi-node disagg DP #39276's "Fix E" (
_is_kv_mastermaster-only notify guard) and"Fix Fix a bug in tying OPT embeddings #1" (engine_id global dp_rank): same deterministic-DP-routing
problem space, but PR Fix engine_id collision + MoRIIO robustness for multi-node disagg DP #39276 handles the multi-node
--headlesschild case while this PR handles the
--api-server-count > 1ApiServer fan-out case. The two PRs are independent and compose
cleanly: this PR's
remote_dp_rank_overridesentinel lets asidecar-side rank pinner (e.g. llm-d's
pickDPRank) pre-empt thisfallback, leaving it dormant in production while still acting as
a fail-safe.
Tolerate finished_sending for already-removed requestsis thescheduler-side counterpart to PR Fix engine_id collision + MoRIIO robustness for multi-node disagg DP #39276's engine-side timeout
handling for the same async-MoRI-IO completion races.
Each commit body has a
Related:/See-also:trailer pointing atthe relevant PR #39276 fix where applicable. Credit to @raviguptaamd
for the upstream MoRI-IO design contributions.
Test Plan: TBD
Each fix is exercised end-to-end in a real disaggregated-prefill/decode
deployment on AMD MI300X (gfx942) using
llm-d as the
routing sidecar and a vLLM image with the MoRI-EP / MoRI-IO Wide-EP
patches built in.
Test Result: TBD
1P1D DP=EP=8 TP=1 — passing on AMD MI300X (gfx942)
End-to-end serving +
vllm bench serveC=1,2,4,8,16,32 sweep for DSV3completed without hangs, KeyErrors, or rank-asymmetric expirations:
2P2D DP=EP=16 TP=1: TBD bring-up in progress
Essential Elements of an Effective PR Description Checklist
V1 disagg-DP + MoRI-IO KV connector on ROCm; gated to keep CUDA
bit-identical. Each commit body has its own purpose, related
issues / PRs, and reproducer notes.
AMD MI300X with
llm-dsidecar; benchmark sweep withvllm bench serve.before/after expired-count breakdown; 2P2D DP=16 results to
follow as a comment when the multi-pod run completes (depends
on PR Fix engine_id collision + MoRIIO robustness for multi-node disagg DP #39276).
connector is still maturing inside vLLM; no public-facing
supported_models.md/exampleschange is implied. Happy toadd release notes if maintainers request.
Signed-off-by: Shiksha Patel shiksha.patel@amd.com