[Nixl][PD] Lease renewal TTL KV blocks on P#38027
[Nixl][PD] Lease renewal TTL KV blocks on P#38027NickLucche wants to merge 11 commits intovllm-project:mainfrom
Conversation
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
|
Documentation preview: https://vllm--38027.org.readthedocs.build/en/38027/ |
There was a problem hiding this comment.
Code Review
This pull request introduces a heartbeat-based lease management system for KV cache blocks in the Nixl connector. It replaces the single VLLM_NIXL_ABORT_REQUEST_TIMEOUT with VLLM_NIXL_KV_LEASE_DURATION, VLLM_NIXL_KV_LEASE_EXTENSION, and VLLM_NIXL_KV_HEARTBEAT_INTERVAL to manage KV block leases between producer and consumer. The changes include implementing D-side heartbeat sending, P-side lease extension and expiration, and updating related documentation and environment variable definitions. Review comments highlight inconsistencies in default values between documentation and code for the new environment variables, suggest improving a test by using the caplog fixture, point out a potential log pollution issue with empty heartbeat messages, and identify an inefficiency in the cleanup logic for _pending_transfers_by_engine.
| for engine_reqs in self._pending_transfers_by_engine.values(): | ||
| engine_reqs.discard(req_id) |
There was a problem hiding this comment.
The cleanup of _pending_transfers_by_engine iterates through all engine request sets to remove the completed req_id. This is inefficient, with a complexity of O(N_engines).
You can optimize this to O(1). Inside _pop_done_transfers, you can access self._recving_metadata to get the meta object for the req_id. From the metadata, you can get the engine_id and directly remove the req_id from the specific engine's request set in _pending_transfers_by_engine without iterating over all engines.
There was a problem hiding this comment.
id rather keep tight execution as lean as possible, there's really no need to clean in _pop_done_transfers if the data structure is well bounded
There was a problem hiding this comment.
But this code is cleaning in pop_done_transfers() ? (Not sure I understand your comment)
The issue here is you have no information here about which engine this request relates to
It would make sense to me if transfers was req_id -> (engine_id, handle) which would solve this?
There was a problem hiding this comment.
I have to say, I don't quite understand the design of this lease extension
The key issue we are solving is that requests get stuck in the WAITING queue in the scheduler if the KV cache is overloaded.
IIUC, if the requests are in the WAITING queue, they have not yet been passed to the NIXL connector. So we need a mechanism by which the requests that have NOT YET been passed to the NIXL connector to be able to extend the leases
Am I missing something on this?
|
This pull request has merge conflicts that must be resolved before it can be |
markmc
left a comment
There was a problem hiding this comment.
Overall looks great ... until I saw @robertgshaw2-redhat's comment 🤣
we need a mechanism by which the requests that have NOT YET been passed to the NIXL connector to be able to extend the leases
|
|
||
| - `VLLM_NIXL_ABORT_REQUEST_TIMEOUT`: Timeout (in seconds) for automatically releasing the prefiller’s KV cache for a particular request. (Optional) | ||
| - Default: 480 | ||
| - If a request is aborted and the decoder has not yet read the KV-cache blocks through the nixl channel, the prefill instance will release its KV-cache blocks after this timeout to avoid holding them indefinitely. |
There was a problem hiding this comment.
Just thinking about ... "is it ok to just drop this, what about compatibility?" ... which reminds me ...
We need P and D to upgrade to this in lockstep, so that means we need to update NIXL_CONNECTOR_VERSION
| - `VLLM_NIXL_ABORT_REQUEST_TIMEOUT`: Timeout (in seconds) for automatically releasing the prefiller’s KV cache for a particular request. (Optional) | ||
| - Default: 480 | ||
| - If a request is aborted and the decoder has not yet read the KV-cache blocks through the nixl channel, the prefill instance will release its KV-cache blocks after this timeout to avoid holding them indefinitely. | ||
| - `VLLM_NIXL_KV_LEASE_DURATION`: Initial lease duration (in seconds) for KV blocks on the prefiller. (Optional) |
There was a problem hiding this comment.
See #25700 - let's model this as config properly. Probably kv_connector_extra_config is the place
The "lease_extension" and "heartbeat_interval" configs are particularly niche - we would probably be fine with a hard-coded lease_extension = lease_duration * 2 /3 and heartbeat_interval = lease_duration / 6. And if anyone really needs to tweak these, we can add that later 👍
| # In progress transfers. | ||
| # [req_id -> list[handle]] | ||
| # In-progress transfer tracking (D-side / consumer). | ||
| # Keyed by req_id to ensure ALL handles complete before marking done. |
There was a problem hiding this comment.
Not sure I understand this comment
| # grow indefinitely when new P remotes are added. | ||
| for k in list(self._pending_transfers_by_engine.keys()): | ||
| if not self._pending_transfers_by_engine[k]: | ||
| del self._pending_transfers_by_engine[k] |
There was a problem hiding this comment.
This seems a bit odd - why not clear out the per-engine entry in _pop_done_transfers() when the set is empty?
| continue | ||
|
|
||
| # Build batched heartbeat message: "HB:req1,req2,req3,..." | ||
| heartbeat_msg = ("HB:" + ",".join(req_ids)).encode() |
There was a problem hiding this comment.
Is there any limit to the notification message size? This string can be pretty huge?
| # Build batched heartbeat message: "HB:req1,req2,req3,..." | ||
| heartbeat_msg = ("HB:" + ",".join(req_ids)).encode() | ||
|
|
||
| # Send to ALL remote agents we handhshaked with for this remote. |
There was a problem hiding this comment.
| # Send to ALL remote agents we handhshaked with for this remote. | |
| # Send to ALL remote agents we handshaked with for this remote. |
| try: | ||
| self.nixl_wrapper.send_notif(agent_name, notif_msg=heartbeat_msg) | ||
| num_notifs += 1 | ||
| except Exception as e: |
There was a problem hiding this comment.
I'm really not a fan of catching Exception like this ... e.g. we shouldn't swallow a simple programming error like TypeError ... but you're just copying an existing pattern here, so nevermind
| for handle in handles: | ||
| self.nixl_wrapper.release_xfer_handle(handle) | ||
| self._recving_transfers.clear() | ||
| self._pending_transfers_by_engine.clear() |
There was a problem hiding this comment.
This is unnecessary IMO - no need to free memory in shutdown(), that'll be handled by garbage collection. Only need to release other resources here
|
|
||
| # Heartbeat/lease management for D-side (consumer). | ||
| # Single timestamp suffices - heartbeat interval limits overall send rate, | ||
| # not per-engine. New engines get fresh leases on P-side anyway. |
There was a problem hiding this comment.
Scratched my head on this one ... you're saying "there's no need to track last heartbeat per engine". I think a comment like this would be more helpful in send_lease_heartbeats()
| # Check if enough time has passed since last heartbeat. | ||
| if now - self._last_heartbeat_time < self._heartbeat_interval: | ||
| return | ||
|
|
There was a problem hiding this comment.
I think I'd expect to see
self._last_heartbeat_time = now
here
why would we want to come back here again immediately if no notifications are sent?
|
|
closing in favro of #41383 |
This PR adds heartbeat-based lease renewal for NIXL KV cache blocks in disaggregated P/D deployments.
Problem
The Existing Timeout/TTL mechanisms for ensuring KV Cache blocks are eventually cleared on P after D disconnects and/or edge-cases abort scenarios, can lead to severe degradation in perf due to "dead" blocks retention.
With the current default
VLLM_NIXL_ABORT_REQUEST_TIMEOUT=480s, when D crashes, P may retain several GBs of cache for in-flight requests for up to 8 minutes. Subsequent requests hitting P will only have a portion of cache at theirdisposal.
Approach
We augment the TTL mechanism with lease renewal logic:
VLLM_NIXL_KV_LEASE_EXTENSION)VLLM_NIXL_ABORT_REQUEST_TIMEOUT)Key Design Decisions:
We lease at the request/transfer level rather than per-D-instance (e.g., "D0 is alive, refresh all its blocks"). This is because P has no notion of which D its KV blocks belong to. If P tracked per-D ownership, we couldn't defer D
selection until after prefill completes - we'd have to select both P and D upfront in the load balancer.
In practice, D batches lease extensions toward the same P by grouping requests with the same remote_engine_id.
Rather than adding FE API changes or managing additional ZMQ connections, we reuse the existing NIXL notification system (send_notif / get_new_notifs) to send heartbeats.
The notif system is backend-specific, with fallback from IB/RoCE to TCP as medium already taken care of.
Heartbeat sending/processing happens in start_load_kv / get_finished, not a separate background thread. This means timing isn't precise - longer model execution delays heartbeats. However:
A consideration on Heterogeneous TP support: When P TP > D TP (e.g., P TP=4, D TP=2), a single D worker pulls KV blocks from multiple P workers. Each P worker independently tracks lease expiration. Therefore, heartbeats must be sent to ALL P workers for a given engine.
Conversely, when D TP > P TP, a single P will receive notifications from multiple Ds. This causes to refresh TTL multiple times, but with no downside.
Default Configuration
Test With
cc @robertgshaw2-redhat @markmc @ZhanqiuHu