Skip to content

[Bugfix][KV Transfer][NIXL] Notify P node on pre-admission rejection to free stranded KV blocks#41269

Open
Dao007forever wants to merge 13 commits intovllm-project:mainfrom
Dao007forever:bug/notif-rejected
Open

[Bugfix][KV Transfer][NIXL] Notify P node on pre-admission rejection to free stranded KV blocks#41269
Dao007forever wants to merge 13 commits intovllm-project:mainfrom
Dao007forever:bug/notif-rejected

Conversation

@Dao007forever
Copy link
Copy Markdown
Contributor

Summary

Fixes a bug where prefill KV blocks on the P (prefill) node are stranded for up to VLLM_NIXL_ABORT_REQUEST_TIMEOUT (default 480s) when the D (decode) node rejects an incoming request before it is admitted to the engine scheduler.

When a request that carries KV-transfer params (do_remote_prefill=True, remote_block_ids, remote_engine_id, etc.) is rejected on D for reasons like:

  • Render / chat-template error
  • Model existence check failure (_check_model)
  • Input validation error
  • Engine errored
  • Beam-search-with-stream rejection
  • previous_response_id not found (responses API)

…D never opens a NIXL transfer for it, so P never receives the implicit "transfer complete → free blocks" signal. The blocks linger until the abort timeout fires.

This PR adds an explicit early-rejection notification:

  1. The OpenAI-compatible serving layer (chat_completion, completion, responses) calls engine_client.notify_kv_transfer_request_rejected(...) on every pre-admission rejection path that has kv_transfer_params.do_remote_prefill=True.
  2. AsyncLLMEngineCoreClient (in-process / sync MP / async MP / DPLB-async MP) → EngineCoreSchedulerKVConnectorBase_V1.request_rejected_before_admission(...).
  3. NixlConnectorScheduler recognizes the params and enqueues an empty _reqs_need_recv entry. On the next scheduler tick the worker side issues the notification that releases the remote blocks. To make sure that tick actually happens when D has no other in-flight work, Scheduler.has_requests() now also reports pending connector metadata.
  4. MultiConnector fans out to its child connectors and accepts the first one that recognizes the params.
  5. DPLBAsyncMPClient broadcasts the notification to all local DP engines when no data_parallel_rank header is present (the rejection happens before admission, so the request isn't yet tracked in reqs_in_flight).

The _reqs_need_recv value type changed from (Request, BlockIds) to (dict[str, Any], BlockIds) because pre-admission rejections do not have a Request object — only the raw kv_transfer_params dict — and that's all the existing code on the _build_kv_connector_meta side actually consumed (req.kv_transfer_params).

If do_remote_prefill is set but the required remote_* metadata is incomplete, the connector logs a warning and returns False (no-op) rather than guessing.

Why this is not duplicating an existing PR

I checked open PRs and issues with searches like stranded prefill, stranded KV blocks rejected NIXL, request_rejected_before_admission, kv_transfer_params reject, and decode reject prefill. The closest matches are:

No existing PR covers the rejected-before-admission notification path.

Test plan

  • New unit tests added:
    • tests/v1/kv_connector/unit/test_nixl_connector.py::test_rejected_remote_prefill_request_enqueues_empty_recv — verifies the connector enqueues an empty recv with the original remote_* params and Scheduler.has_requests() flips True until the tick flushes it.
    • tests/v1/kv_connector/unit/test_nixl_connector.py::test_rejected_remote_prefill_request_missing_metadata_is_ignored — verifies the connector no-ops (and does not mutate do_remote_prefill) when required remote_* fields are missing.
    • tests/v1/kv_connector/unit/test_multi_connector.py::test_request_rejected_before_admission_uses_first_accepting_connector — verifies short-circuit fan-out behavior in MultiConnector.
  • Local: .venv/bin/python -m pytest tests/v1/kv_connector/unit/test_multi_connector.py tests/v1/kv_connector/unit/test_nixl_connector.py -v — 76 passed, 2 skipped, 1 unrelated failure (test_multi_example_connector_consistency, which fails on OSError: gated repo meta-llama/Llama-3.2-1B-Instruct — unrelated to this change).
  • Pre-commit hooks staged / will run on push.
  • Reviewer should confirm behavior end-to-end on a P/D NIXL deployment by sending an oversized prompt (or one whose chat template fails) carrying kv_transfer_params.do_remote_prefill=True and observing that the P-side block usage drops immediately rather than after VLLM_NIXL_ABORT_REQUEST_TIMEOUT.

AI assistance

This change was prepared with assistance from Claude (Anthropic). I (the human submitter) reviewed every changed line, ran the tests above, and can defend the design end-to-end.

🤖 Generated with Claude Code

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@mergify mergify Bot added frontend v1 bug Something isn't working kv-connector labels Apr 29, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a mechanism to notify KV connectors when a request is rejected before engine admission, allowing the NIXL connector to release remote KV blocks. It adds a new interface method and updates the OpenAI serving entrypoints to trigger cleanup during early failures. The review feedback identifies that the current error handling in the serving layer should be expanded to include adapter resolution and model identification, ensuring that all pre-admission failures are correctly handled.

Comment thread vllm/entrypoints/openai/chat_completion/serving.py Outdated
Comment thread vllm/entrypoints/openai/completion/serving.py Outdated
Comment thread vllm/entrypoints/openai/responses/serving.py Outdated
Dao007forever and others added 3 commits April 30, 2026 17:51
…to free stranded KV blocks

When a request carrying KV-transfer params is rejected on the D node before
it is admitted to the engine scheduler (e.g. validation failure, render
error, model check failure), the P node has no way to learn about the
rejection and the prefill KV blocks remain pinned until
VLLM_NIXL_ABORT_REQUEST_TIMEOUT (default 480s).

This change plumbs a `notify_kv_transfer_request_rejected` path from the
OpenAI-compatible serving layer down through the engine client, EngineCore,
scheduler, and KV connector. For NIXL, the connector schedules an empty
recv with the original `remote_*` params so the worker side issues a
notification that frees the prefill blocks immediately. The scheduler also
exposes `has_requests()` so the engine loop wakes up to flush the cleanup
even when no admitted requests are running.

Signed-off-by: Dao Le <Dao007forever@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Dao Le <Dao007forever@gmail.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 30, 2026
@liuzijing2014
Copy link
Copy Markdown
Collaborator

Q: what's the benefit vs setting VLLM_NIXL_ABORT_REQUEST_TIMEOUT=60s and let prefill naturally timeout? 60s feel fine for edge case handling (there are many cases where decode could fail and never fetch from prefill e.g. engine level kv allocation failure).

@Dao007forever
Copy link
Copy Markdown
Contributor Author

60s is a very large in our tests, that reduced the throughput significantly as the concurrency was only ~4-5 and 1 hanging request affect ~25% concurrency.

Copy link
Copy Markdown
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now thanks @Dao007forever, just one more minor comment from a last review.

Comment thread vllm/v1/engine/__init__.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working frontend kv-connector ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants