[P/D] Prefill compute optimizations with bi-directional KV cache transfers between P and D nodes#32553
[P/D] Prefill compute optimizations with bi-directional KV cache transfers between P and D nodes#32553snadampal wants to merge 2 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces significant enhancements for KV cache management, enabling query-only requests and remote KV cache pulling for prefill workers. These features will improve performance for multi-turn conversations and cache eviction recovery. The implementation is well-structured, with logical changes across the scheduler and the NIXL KV connector. I've identified one potential robustness issue that could lead to a crash under specific configurations, and I've provided a suggestion to address it.
vllm/v1/core/sched/scheduler.py
Outdated
| kv_transfer_params["remote_block_ids"] = \ | ||
| cached_computed_blocks.get_block_ids()[0] |
There was a problem hiding this comment.
The code cached_computed_blocks.get_block_ids()[0] assumes that get_block_ids() will always return a non-empty tuple, allowing it to be indexed at [0]. However, if a model is configured with a KV connector but has no KV cache groups (e.g., an encoder-only model), get_block_ids() could return an empty tuple (), leading to an IndexError and a crash. While this might be a misconfiguration, the code should be more robust to prevent this.
kv_transfer_params["remote_block_ids"] = (cached_computed_blocks.get_block_ids() or [[]])[0]|
I think this feature is a good idea, however Im a bit worried about having an implicit implementation that overloads the meaning of the fields. I wonder if we can come up with a design that enables this explicitly. Ping me on slack if you want to discuss further. |
|
@robertgshaw2-redhat , thanks for the prompt review! Sure, will discuss with you on slack. |
|
It would be very useful to have a much more detailed description of the problem being solved, background context including which project will use this new API, how this approach solves the problem, and alternatives considered. Perhaps the context is obvious to others, but not to me at least. For inspiration on what I mean by more details, take a look at #24256 and #24520 which aims to:
|
|
Hi @robertgshaw2-redhat , @markmc , I have created the following RFC with more details on use cases, challenges, and proposed solution. |
|
Hi @snadampal, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
d988d3b to
e7b030c
Compare
|
Hi @snadampal, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
e7b030c to
d170329
Compare
|
During benchmarking, I identified that the additional round-trip to the Decode node to query KV block metadata introduces latency overhead. To address this, I added support for returning kv_transfer_params as part of the streaming response from the Decode node, eliminating the need for a separate query. The KV params query feature is still added, as it will be useful for initial node pairing with vllm engine specific details. |
|
Hi @snadampal, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
d170329 to
0754908
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
0754908 to
ee248b9
Compare
|
I've rebased the PR to the latest commit which brings in HMA support. So far tested full attention layer models and next testing the models with hybrid KV (full attention + sliding window attention). |
Prefill worker can pull KV cache from remote engines to eliminate redundant prefill computation Benefits: (1) Multi-turn conversations: Prefill worker loads larger KV cache from decode worker (2) Cache eviction recovery: Prefill worker retrieves evicted blocks still available on decode worker Signed-off-by: Sunita Nadampalli <nadampal@amazon.com>
ee248b9 to
a8905ec
Compare
|
Hi @NickLucche I have added support for hybrid kv (full+ sliding window attention) for bi-directional pull, fixed the pre-commit warnings, and also ran unit tests (added reports to the PR description test section above) Please let me know if there are any other pre-requisites to get this PR reviewed and merged. thank you! |
NickLucche
left a comment
There was a problem hiding this comment.
@snadampal One alternative to modifying public API interface
class CompletionStreamResponse(OpenAIBaseModel):
id: str = Field(default_factory=lambda: f"cmpl-{random_uuid()}")
object: str = "text_completion"
created: int = Field(default_factory=lambda: int(time.time()))
model: str
choices: list[CompletionResponseStreamChoice]
usage: UsageInfo | None = Field(default=None)
+ # vLLM-specific: KV transfer params returned with the final chunk
+ kv_transfer_params: dict[str, Any] | None = None
could be to use our inner tokens-in-tokens-out interface (/inference/v1/generate with kv_transfer support #38094).
tlrmchlsmth
left a comment
There was a problem hiding this comment.
Pinging a couple of folks on the llm-d and dynamo teams: @vMaroon @alec-flowers FYI this PR has some interface changes to the kv transfer interface to allow for bi-directional transfer, so would be great if you could take a look (and forward to relevant teammates). Thank you!
…eature Signed-off-by: Sunita Nadampalli <nadampal@amazon.com>
|
I have added a new unit test for this feature: Next, I will update the inner tokens-in-tokens-out interface (/inference/v1/generate) along with the public API. |
|
Thanks for the front-end API suggestion @NickLucche. I see the token generator interface ( @vMaroon , @alec-flowers, appreciate your review/feedback on this PR. Please find the design details along with router here: #32733 |
Exploratory work for a from-scratch bidirectional KV transfer implementation, before discovering PR vllm-project#32553 already ships the feature. Kept on this branch for record only. The active benchmarking work lives on the blackwell-benchmark branch. - DESIGN.md: component diagrams and interface design doc - nixl_connector.py: adds remote_num_tokens support in get_num_new_matched_tokens and a report_decode_kv path in request_finished Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Multi-turn proxy, TTFT sweep, NVLink-vs-recompute crossover, SLURM jobs for PACE ICE, analysis script, and PR comment template.
NickLucche
left a comment
There was a problem hiding this comment.
Left some comments, thanks @snadampal !
| delay_free = any(len(group) > 0 for group in block_ids) | ||
| if delay_free: | ||
| # Track in _reqs_in_batch so the worker adds it to | ||
| # _reqs_to_process, which is required for _reqs_to_send | ||
| # to be accepted by the worker. | ||
| self._reqs_in_batch.add(request.request_id) | ||
| self._reqs_need_send[request.request_id] = ( | ||
| time.perf_counter() + envs.VLLM_NIXL_ABORT_REQUEST_TIMEOUT |
There was a problem hiding this comment.
I think holding blocks on D complicates workflow for when P dies or if a lot of requests end up delaying the execution of the read back from P to D.
Can we make this feature optional for now using self.kv_transfer_config.get_from_extra_config?
| do_remote_prefill=False, | ||
| do_remote_decode=True, |
There was a problem hiding this comment.
I am not sure these do_remote* matter to the router at this point, given we're adding remote-num-tokens here to signal a "special" D to P request..
| logger.info( | ||
| "Skipping remote pull for %s: %d remote tokens < threshold %d", | ||
| request.request_id, | ||
| count, | ||
| self.remote_pull_threshold, | ||
| ) |
|
|
||
| # Remote pull threshold: minimum number of remote tokens | ||
| # before P pulls KV from D instead of recomputing locally. | ||
| self.remote_pull_threshold: int = int( |
There was a problem hiding this comment.
nit: what other names could we use for this variable to make it more informative?
| elif all( | ||
| p in params | ||
| for p in ("remote_engine_id", "remote_host", "remote_port") | ||
| ): | ||
| unhashed_local_block_ids_p: BlockIds = ( | ||
| blocks.get_unhashed_block_ids_all_groups() | ||
| ) | ||
| local_block_ids = self.get_sw_clipped_blocks( | ||
| unhashed_local_block_ids_p | ||
| ) | ||
|
|
||
| # Get unhashed blocks to pull into from remote. | ||
| self._reqs_need_recv[request.request_id] = ( | ||
| request, | ||
| local_block_ids, | ||
| ) |
There was a problem hiding this comment.
I think this branch can be unified with what's already happening for D, so that ideally we have the same code but two conditions: if D or if P with multiturn pull mode
Prefill worker can pull KV cache from remote engines to eliminate redundant prefill computation
Benefits:
(1) Multi-turn conversations: Prefill worker loads larger KV cache from decode worker
(2) Cache eviction recovery: Prefill worker retrieves evicted blocks still available on decode worker
Purpose
To avoid redundant compute cycles on Prefill nodes and hence improve prefill throughput.
Test Plan
Test Result
Unit test report:
HMA unit tests:
Performance benchmarking report:
Tested the following multi-turn conversations benchmark on AWS Trainium (the platform where the feature was initially developed) and AWS p5en.
https://github.com/vllm-project/vllm/tree/main/benchmarks/multi_turn#benchmark-kv-cache-offloading-with-multi-turn-conversations
Here is the data from p5en and the similar gains are observed on AWS Trainium as well.
--no-enable-prefix-caching), there is a > 2x improvement for TTFT(i) without the PR
(ii) with the PR
TTFT median improved by 28.6% at p50: 219.61ms (KV pull) vs 307.86ms (recompute)
TTFT p75 improved by 28.6%: 244.65ms vs 342.86ms
TTFT min improved by 41%: 143.76ms vs 243.50ms
(i) without the PR
(ii) with the PR
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.