[KV Connector]: Support KV push from Prefill to Decode node using Nixl KV Connector#35264
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a significant feature to support KV cache push from a Prefill node to a Decode node using the Nixl KV Connector, aiming to improve TTFT in disaggregated inference setups. The implementation correctly handles two synchronization scenarios between the nodes and includes necessary changes across the connector, engine, and worker components. The logic for handling block registration and the two push scenarios seems well-thought-out. However, I've identified a critical bug in the _write_blocks method that could lead to an UnboundLocalError for models with local attention, which needs to be addressed.
|
Hi @snadampal, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
yewentao256
left a comment
There was a problem hiding this comment.
Please solve the comments, pre-commit issue.
Also, please reduce diff to make it easier for review.
Adding lm_eval metrics to make sure we have the correct acc
|
@yewentao256 , thanks for the review, will take care of the comments. btw, I'm implementing heterogeneous TP and heterogeneous block_size between P and D nodes (the current version supports only homogeneous) and will update the PR soon. |
e9bc2b5 to
a2aa7a4
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
|
Hi @yewentao256 , I have updated the PR with the features I mentioned above and also I've unified the push and pull modes as much as possible to avoid any redundant code. Still it's a large PR, but this is the best I could condense it to. Please review and let me know your comments. I have started this on yesterday's commit and today it already has conflicts :( |
|
I will write a design doc (in RFC format) to help with the reviews. |
|
I have captured design and implementation details in the following design/RDC doc. I hope this helps with reviewing the PR. cc: @yewentao256 , @NickLucche |
a2aa7a4 to
600c32f
Compare
600c32f to
cbbc457
Compare
|
I have updated the PR to split the implementation into 8 commits; the first 4 commits being just the existing code refactor with zero functionality change, and the last 4 commits are for so that it will be much easier for review and also we can take commit by commit into mainline without disturbing anything. I have rebased it to the latest ( so it's now with hybrid kv feature) and run the benchmarks and accuracy tests, the performance numbers are same as those I had posted above. As per Nicolo's suggestion, I have also hosted it as an out-of-tree connector as a plugin for people to try out quickly on top of vllm. |
cbbc457 to
f071ab2
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
NickLucche
left a comment
There was a problem hiding this comment.
thanks for the work @snadampal !
Will try to run the numbers on my side too and then we can iron out a few details with this PR
|
Hi @NickLucche, I have done some more deep dive into Push mode performance gains and got some interesting findings:
Please let me know your observations. |
|
This pull request has merge conflicts that must be resolved before it can be |
NickLucche
left a comment
There was a problem hiding this comment.
Mega job here @snadampal !
Code is clean and the docs diagram goes a long way into explaining the flow.
I also feel pretty good about the pull separation in terms of maintainability.
I left some comments, none of them is major, so feel free to address them as you see fit, we can even defer some of these ideas to a follow up PR (especially the one around thread state separation, if at all).
f706249 to
2958f67
Compare
6ec30d4 to
f253643
Compare
|
Hi @NickLucche , thanks for the great feedback, as always! I have incorporated everything except making a new Writer class. The PR is now has been rebased, all the feedback incorporated, and tested for unit tests, proxy, and benchmarking script. |
f253643 to
d598939
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
Head branch was pushed to by a user without write access
7c0767d to
dc8bdee
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
…NixlPullConnector Refactor the NIXL KV connector into base and pull-specific classes to enable adding push-based KV transfer as a separate subclass. - The monolithic NixlConnector, NixlConnectorScheduler, and NixlConnectorWorker are split into NixlBaseConnector/NixlPullConnector, NixlBaseConnectorScheduler/NixlPullConnectorScheduler, and NixlBaseConnectorWorker/NixlPullConnectorWorker respectively. - Base classes contain shared infrastructure (memory registration, handshakes, TP mapping, transfer completion tracking, heartbeats, post-processing). - Pull classes contain READ-specific logic (start_load_kv, _read_blocks_for_req, _read_blocks, _get_new_notifs, get_num_new_matched_tokens, update_state_after_alloc, request_finished). - Backward-compatible shim files preserve existing imports (NixlConnector, NixlConnectorScheduler, NixlConnectorWorker). - No functional changes. Signed-off-by: Sunita Nadampalli <nadampal@amazon.com>
Signed-off-by: Sunita Nadampalli <nadampal@amazon.com>
Scheduler stages registrations / finished blocks via metadata Worker runs a dedicated nixl-push-writer thread that owns all push NIXL ops (get_new_notifs, send_notif for PUSH_REG, make_prepped_xfer, transfer) off the engine main thread. Writer wakes on engine signals from start_load_kv / get_finished, self-polls only while waiting for unmatched PUSH_REG notifs, and clears state on lease expiration. Adds has_pending_push_work() to keep EngineCore stepping while pushes are in flight, plus a soft per-registration watchdog. D-registers-first and P-finishes-first cases are handled symmetrically; request-id matching falls back to UUID extraction to tolerate retries. Co-authored-by: Sunita Nadampalli <nadampal@amazon.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> Signed-off-by: Sunita Nadampalli <nadampal@amazon.com>
Add tests/v1/kv_connector/unit/test_nixl_push_connector.py covering the push (WRITE) connector's scheduler and writer-thread mechanics without requiring a real NIXL agent or network. Coverage (30 tests): - TestPushScheduler (6): D-side update_state_after_alloc staging, P-side request_finished staging, build_connector_meta drain on both sides, has_pending_push_work lifecycle, update_connector_output cleanup, registration watchdog expiration. - TestPushWriterMatching (4): scenario 1 (P finished first) and scenario 2 (D registered first), fuzzy base-request-id matching for retried requests, malformed PUSH_REG payload drops. - TestPushWriterStartLoadKv (2): finished-blocks inbox match against stashed D registration, start_load_kv enqueueing to writer queues. - TestPushWriterNotifs (2): forwarded completion notif processing, get_finished eviction enqueue + writer wake. - TestPushSchedulerNegative (6): no kv_transfer_params, invalid remote_block_ids payload, num_external_tokens=0, RUNNING status, empty block groups, unknown-request update_connector_output. - TestPushWriterNegative (10): empty pop helpers, no-fuzzy when base UUIDs differ, non-dict payload, non-string request_id, idempotent duplicate PUSH_REG, eviction cardinality, no-op get_finished still wakes the writer, unknown completion notif, empty-metadata start_load_kv. Signed-off-by: Sunita Nadampalli <nadampal@amazon.com>
Add examples/disaggregated/disaggregated_serving/disagg_proxy_pushconnector_demo.py, the push-mode counterpart to disagg_proxy_demo.py. Same client-facing API (/v1/completions, /v1/chat/completions, /status, /instances/add, xPyD round-robin scheduling, runtime instance add) but speaks the NixlPushConnector wire protocol on D's side: kv_transfer_params carries P's coordinates (engine_id, host, side-channel port, tp_size). P's prefill leg is issued first (max_tokens=1, do_remote_decode=True) and drained, then the decode leg is streamed to the client while D registers blocks with P and waits for the NIXL WRITE. Required CLI: --prefill-engine-id, --prefill-kv-host, --prefill-side-channel-port, --prefill-tp-size (P coordinates that pull mode would learn from P's response but push mode needs up-front). Signed-off-by: Sunita Nadampalli <nadampal@amazon.com>
Adds docs/design/nixl_kv_push_connector.md covering the push-mode threading model, wake sources, writer-local matching tables, PUSH_REG wire format, scheduler responsibilities, watchdogs/leases, and failure handling. Includes a Mermaid sequence diagram of the end-to-end P/D flow. Signed-off-by: Sunita Nadampalli <nadampal@amazon.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
dc8bdee to
6f7aa2a
Compare
RFC #36923
Implemented KV push feature where Prefill node pushes
its KV blocks to Decode node as soon as the model executor
completes the forward pass and finishes request.
The implementation supports heterogeneous TP and
heterogeneous block sizes between P and D nodes
And it covers both the scenarios:
Scenario 1: D registers blocks with P before P finishes generating KV
Scenario 2: P has the KV ready before D registers
Purpose
To improve TTFT for P-D disaggregated inference deployment.
Test Plan
Manually tested Inference on P-D disaggregated setup
Test Result
Tested 1p1d configuration for pushing KV from prefill to decode node and validated accuracy of the results.
Benchmarking results
I ran
vllm bench servewithsonnet datasetfor different input and output token lengths. KV push mode (this PR) showed 1.2x - 3.0x improvements over pull mode across different input and out lengths. Following are the performance numbers measured on AWS P5en instance and similar improvements are observed on AWS Trn2 instances as well where the feature was originally developed.TODO:
Phase2:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.