[Bugfix][KV Transfer][NIXL] Notify P node on pre-admission rejection to free stranded KV blocks#41269
[Bugfix][KV Transfer][NIXL] Notify P node on pre-admission rejection to free stranded KV blocks#41269Dao007forever wants to merge 13 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements a mechanism to notify KV connectors when a request is rejected before engine admission, allowing the NIXL connector to release remote KV blocks. It adds a new interface method and updates the OpenAI serving entrypoints to trigger cleanup during early failures. The review feedback identifies that the current error handling in the serving layer should be expanded to include adapter resolution and model identification, ensuring that all pre-admission failures are correctly handled.
…to free stranded KV blocks When a request carrying KV-transfer params is rejected on the D node before it is admitted to the engine scheduler (e.g. validation failure, render error, model check failure), the P node has no way to learn about the rejection and the prefill KV blocks remain pinned until VLLM_NIXL_ABORT_REQUEST_TIMEOUT (default 480s). This change plumbs a `notify_kv_transfer_request_rejected` path from the OpenAI-compatible serving layer down through the engine client, EngineCore, scheduler, and KV connector. For NIXL, the connector schedules an empty recv with the original `remote_*` params so the worker side issues a notification that frees the prefill blocks immediately. The scheduler also exposes `has_requests()` so the engine loop wakes up to flush the cleanup even when no admitted requests are running. Signed-off-by: Dao Le <Dao007forever@gmail.com> Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
1f424e4 to
bf4ce90
Compare
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Dao007forever <dao007forever@gmail.com>
|
Q: what's the benefit vs setting |
|
60s is a very large in our tests, that reduced the throughput significantly as the concurrency was only ~4-5 and 1 hanging request affect ~25% concurrency. |
njhill
left a comment
There was a problem hiding this comment.
LGTM now thanks @Dao007forever, just one more minor comment from a last review.
Signed-off-by: Dao Le <Dao007forever@gmail.com>
Summary
Fixes a bug where prefill KV blocks on the P (prefill) node are stranded for up to
VLLM_NIXL_ABORT_REQUEST_TIMEOUT(default 480s) when the D (decode) node rejects an incoming request before it is admitted to the engine scheduler.When a request that carries KV-transfer params (
do_remote_prefill=True,remote_block_ids,remote_engine_id, etc.) is rejected on D for reasons like:_check_model)previous_response_idnot found (responses API)…D never opens a NIXL transfer for it, so P never receives the implicit "transfer complete → free blocks" signal. The blocks linger until the abort timeout fires.
This PR adds an explicit early-rejection notification:
chat_completion,completion,responses) callsengine_client.notify_kv_transfer_request_rejected(...)on every pre-admission rejection path that haskv_transfer_params.do_remote_prefill=True.AsyncLLM→EngineCoreClient(in-process / sync MP / async MP / DPLB-async MP) →EngineCore→Scheduler→KVConnectorBase_V1.request_rejected_before_admission(...).NixlConnectorSchedulerrecognizes the params and enqueues an empty_reqs_need_recventry. On the next scheduler tick the worker side issues the notification that releases the remote blocks. To make sure that tick actually happens when D has no other in-flight work,Scheduler.has_requests()now also reports pending connector metadata.MultiConnectorfans out to its child connectors and accepts the first one that recognizes the params.DPLBAsyncMPClientbroadcasts the notification to all local DP engines when nodata_parallel_rankheader is present (the rejection happens before admission, so the request isn't yet tracked inreqs_in_flight).The
_reqs_need_recvvalue type changed from(Request, BlockIds)to(dict[str, Any], BlockIds)because pre-admission rejections do not have aRequestobject — only the rawkv_transfer_paramsdict — and that's all the existing code on the_build_kv_connector_metaside actually consumed (req.kv_transfer_params).If
do_remote_prefillis set but the requiredremote_*metadata is incomplete, the connector logs a warning and returns False (no-op) rather than guessing.Why this is not duplicating an existing PR
I checked open PRs and issues with searches like
stranded prefill,stranded KV blocks rejected NIXL,request_rejected_before_admission,kv_transfer_params reject, anddecode reject prefill. The closest matches are:No existing PR covers the rejected-before-admission notification path.
Test plan
tests/v1/kv_connector/unit/test_nixl_connector.py::test_rejected_remote_prefill_request_enqueues_empty_recv— verifies the connector enqueues an empty recv with the originalremote_*params andScheduler.has_requests()flips True until the tick flushes it.tests/v1/kv_connector/unit/test_nixl_connector.py::test_rejected_remote_prefill_request_missing_metadata_is_ignored— verifies the connector no-ops (and does not mutatedo_remote_prefill) when requiredremote_*fields are missing.tests/v1/kv_connector/unit/test_multi_connector.py::test_request_rejected_before_admission_uses_first_accepting_connector— verifies short-circuit fan-out behavior inMultiConnector..venv/bin/python -m pytest tests/v1/kv_connector/unit/test_multi_connector.py tests/v1/kv_connector/unit/test_nixl_connector.py -v— 76 passed, 2 skipped, 1 unrelated failure (test_multi_example_connector_consistency, which fails onOSError: gated repo meta-llama/Llama-3.2-1B-Instruct— unrelated to this change).kv_transfer_params.do_remote_prefill=Trueand observing that the P-side block usage drops immediately rather than afterVLLM_NIXL_ABORT_REQUEST_TIMEOUT.AI assistance
This change was prepared with assistance from Claude (Anthropic). I (the human submitter) reviewed every changed line, ran the tests above, and can defend the design end-to-end.
🤖 Generated with Claude Code