Skip to content

[Feat][NIXL] Add KV lease refresh mechanism for disaggregated prefill#35764

Draft
robertgshaw2-redhat wants to merge 5 commits intomainfrom
lease-refresh
Draft

[Feat][NIXL] Add KV lease refresh mechanism for disaggregated prefill#35764
robertgshaw2-redhat wants to merge 5 commits intomainfrom
lease-refresh

Conversation

@robertgshaw2-redhat
Copy link
Copy Markdown
Collaborator

Summary

  • Problem: The static VLLM_NIXL_ABORT_REQUEST_TIMEOUT on the P side causes premature KV block expiry when the D queue is bursty, and wastes memory when set large enough to be safe.
  • Solution: Replace it with an active lease mechanism — D workers periodically refresh the lease while requests are queued, so P only frees blocks if D truly goes silent (crashed or failed).

Key changes

  • D scheduler (NixlConnectorScheduler): tracks all do_remote_prefill=True requests in _requires_lease_dict (req_id → P host/http_port) until they are scheduled. A daemon thread (nixl-d-lease-refresh) POSTs to P's /internal/nixl/lease_refresh every timeout // 3 seconds.
  • P API server: new route POST /internal/nixl/lease_refresh (registered unconditionally; no-op on non-NIXL instances) calls engine_client.call_utility_async("nixl_lease_refresh", request_ids).
  • P EngineCore: new nixl_lease_refresh() utility method delegates to connector.refresh_lease(), which runs in the engine loop thread so no locking is needed.
  • P scheduler (NixlConnectorScheduler.refresh_lease): updates _lease_refreshes[req_id] = now + timeout; passed to the worker via NixlConnectorMetadata.reqs_to_refresh each step.
  • P worker (NixlConnectorWorker): applies refreshes in start_load_kv; get_finished now does a full scan instead of early-break (expiry order is no longer guaranteed monotone after refreshes).
  • New env var VLLM_NIXL_HTTP_PORT (default 8000): tells D where P's HTTP server is; P includes remote_http_port in kv_transfer_params.

Thread safety

State Writers Readers Lock
_requires_lease_dict scheduler thread lease-refresh thread _requires_lease_lock
_lease_refreshes engine loop (refresh_lease) engine loop (build_connector_meta) none (single thread)

Test plan

  • test_abort_timeout_on_prefiller — unchanged behaviour: D never adds the request to _requires_lease_dict (no remote_host), so P expires after VLLM_NIXL_ABORT_REQUEST_TIMEOUT as before
  • test_disagg_accuracy.py — D refreshes leases while requests queue; transfer completes normally
  • Manual: kill D mid-queue, verify P frees blocks after timeout

🤖 Generated with Claude Code

Replace the static P-side KV block timeout with an active lease
mechanism. D workers periodically POST /internal/nixl/lease_refresh
to extend the hold window while requests sit in the D queue, preventing
premature block expiry on bursty workloads without requiring a large
static timeout.

- D scheduler tracks pending remote-prefill requests in
  `_requires_lease_dict`; a background thread POSTs refreshes every
  `timeout // 3` seconds
- P scheduler receives refreshes via new EngineCore utility method
  `nixl_lease_refresh`, stores updated expiry in `_lease_refreshes`,
  and passes them to the P worker through `NixlConnectorMetadata`
- P worker applies refreshes in `start_load_kv` and does a full scan
  (not early-break) in `get_finished` since expiry order may change
- New `VLLM_NIXL_HTTP_PORT` env var (default 8000) lets D locate P's
  HTTP server; P includes it in `kv_transfer_params`
- New FastAPI route registered unconditionally in `build_app`; no-op
  on non-NIXL instances

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: Robert Shaw <robshaw@redhat.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a KV lease refresh mechanism for disaggregated prefill in vLLM, addressing the issue of premature KV block expiry due to static timeout values. The changes include modifications to the D scheduler, P API server, P EngineCore, P scheduler, and P worker, along with a new environment variable for configuring the P's HTTP port. The implementation involves tracking requests requiring lease refresh, periodically refreshing leases from D workers, and updating lease expiry on the P side. The changes also include a full scan in get_finished to handle lease refreshes.

Comment on lines +701 to +711
with self._requires_lease_lock:
if request.request_id not in self._requires_lease_dict:
remote_host = params.get("remote_host", "")
remote_http_port = params.get(
"remote_http_port", envs.VLLM_NIXL_HTTP_PORT
)
if remote_host:
self._requires_lease_dict[request.request_id] = (
remote_host,
remote_http_port,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The code adds the remote_host and remote_http_port to _requires_lease_dict if remote_host exists. However, it does not handle the case where remote_host is empty. This could lead to issues if a request is added to _requires_lease_dict without a valid remote_host, potentially causing the lease refresh mechanism to fail. It's critical to ensure that only requests with valid remote_host values are added to _requires_lease_dict to prevent unexpected behavior.

                        if remote_host:
                            self._requires_lease_dict[request.request_id] = (
                                remote_host,
                                remote_http_port,
                            )

Comment on lines +937 to +939
with urlreq.urlopen(req, timeout=5) as resp:
if resp.status != 200:
raise RuntimeError(f"HTTP {resp.status}")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The code checks the HTTP status code but raises a generic RuntimeError without providing specific details about the error. This makes it difficult to debug issues related to lease refresh failures. It's critical to include the URL in the error message to provide more context for debugging.

                        if resp.status != 200:
                            raise RuntimeError(f"HTTP {resp.status} at {url}")

Comment on lines +2087 to +2089
expired = [
req_id for req_id, expires in self._reqs_to_send.items() if now >= expires
]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The code iterates through self._reqs_to_send to identify expired requests. However, it doesn't handle potential exceptions that might occur during the iteration or within the list comprehension. This could lead to the loop terminating prematurely and not releasing all expired KV blocks. It's critical to add error handling to ensure that all expired requests are processed and their KV blocks are released, even if some requests encounter issues.

        now = time.perf_counter()
        expired = []
        for req_id, expires in self._reqs_to_send.items():
            try:
                if now >= expires:
                    expired.append(req_id)
            except Exception as e:
                logger.warning(f"Error checking expiry for request {req_id}: {e}")

Robert Shaw added 4 commits March 2, 2026 21:39
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Signed-off-by: Robert Shaw <robshaw@redhat.com>
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Mar 6, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @robertgshaw2-redhat.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Mar 6, 2026
@markmc
Copy link
Copy Markdown
Member

markmc commented Mar 25, 2026

xref #38027

@markmc
Copy link
Copy Markdown
Member

markmc commented Apr 16, 2026

Relevant: https://github.com/llm-d/llm-d/blob/84ba2f7004abf3ba0e323fd8ffe3f8ce3c94656f/docs/wip-docs-new/architecture/advanced/disaggregation/operations-vllm.md

KV blocks are stranded on the P instance until the timeout VLLM_NIXL_ABORT_REQUEST_TIMEOUT, which defaults to 480s. We are currently working on a lease-extension strategy that will dramatically shorten the timeout window.

@mergify mergify Bot removed the needs-rebase label Apr 23, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 23, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @robertgshaw2-redhat.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants