[KV Connector] Implement on_new_request for LMCacheMPConnector#42321
Open
chfeng-cs wants to merge 2 commits into
Open
[KV Connector] Implement on_new_request for LMCacheMPConnector#42321chfeng-cs wants to merge 2 commits into
chfeng-cs wants to merge 2 commits into
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request implements the on_new_request method in LMCacheMPConnector to trigger early KV cache lookups for non-resumable requests. It also includes unit tests to verify that lookups are submitted correctly, trackers are created, and resumable requests are skipped. I have no feedback to provide.
Fixes vllm-project#41784. Under high load the scheduler's running queue fills the token budget on every step, so the waiting-queue loop is never entered and get_num_new_matched_tokens is never called. As a result the KV connector learns about waiting requests only when the scheduler finally processes them, causing disk/remote KV fetches to start too late and the GPU to stall waiting for data. The on_new_request hook (added to KVConnectorBase_V1 in vllm-project#41383) is called by the scheduler the moment a request enters add_request(), before any scheduling loop runs. Implementing it in LMCacheMPConnector lets the connector submit the async lookup immediately via maybe_submit_lookup_request, so disk fetches are already in flight by the time the scheduler processes the request. Resumable (streaming-input) sessions are skipped because their token IDs are incomplete at add_request time; get_num_new_matched_tokens handles those when the full prompt is available. maybe_submit_lookup_request is idempotent (guarded by lookup_futures), so the subsequent call from get_num_new_matched_tokens is a safe no-op. The early-prefetch path is opt-in via lmcache.mp.early_prefetch in kv_connector_extra_config (default false), preserving prior behavior for existing deployments. Users who want to hide disk lookup latency behind ongoing GPU work set: "kv_connector_extra_config": { "lmcache.mp.early_prefetch": true } Signed-off-by: Ethan Feng <ethan.fengch@gmail.com>
benchmarks/benchmark_lmcache_prefetch.py measures TTFT improvement under queue-saturated load when lmcache.mp.early_prefetch is enabled. Signed-off-by: Ethan Feng <ethan.fengch@gmail.com>
5f6558b to
0af100f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Fixes #41784.
Under high load the scheduler's running queue fills the token budget on every step, so the waiting-queue loop is never entered and get_num_new_matched_tokens is never called. The KV connector
only learns about a waiting request when the scheduler finally processes it — which can be many iterations later. By that point a disk or remote KV fetch that should have started seconds ago is
only just kicking off, causing GPU stalls.
#41383 added an on_new_request hook to KVConnectorBase_V1, called by the scheduler the moment a request enters add_request(). This PR implements that hook in LMCacheMPConnector: it submits the
async lookup via maybe_submit_lookup_request immediately, so disk fetches are already in flight by the time the scheduler processes the request.
Resumable (streaming-input) sessions are skipped because their token IDs are incomplete at add_request time; get_num_new_matched_tokens handles those when the full prompt is available.
maybe_submit_lookup_request is idempotent (guarded by lookup_futures), so the subsequent call from get_num_new_matched_tokens is a safe no-op.
Not duplicating an existing PR: #42170 adds a new notify_new_request hook in the scheduler, duplicating what #41383 already landed. This PR instead implements the existing on_new_request hook in
LMCacheMPConnector, which is the missing piece.
Design note: why no budget control is needed
The earlier approach (#42086) used a per-step scheduler pass (_early_prefetch_waiting_kv) with a token-budget cap to rate-limit how many waiting requests were hinted per scheduling iteration.
That budget was necessary because the pass iterated over the entire waiting queue on every step — without a cap it could submit O(queue_size) lookups per step repeatedly.
on_new_request makes the budget unnecessary:
The push model (on_new_request) is self-limiting by construction; the poll model (maybe_prefetch_request) required an explicit budget to compensate for repeated iteration.
Test plan
No GPU, LMCache server, or NIXL needed — uses a mock connector. Tests are skipped automatically when lmcache is not installed.
Test results:
Benchmark
Preliminary results under queue-saturated load (benchmarks/benchmark_lmcache_prefetch.py):
Setup: Qwen3.5-0.8B, LMCacheMPConnector with disk L2 backend,
--prompt-tokens 200, 48 measurement requests after warmup.Note: GTX 1660 Super (Turing, CC 7.5) has no FlashAttention2 and slower PCIe bandwidth than datacenter GPUs, so absolute TTFT numbers are high. Results on A10/A100 will be added once the PAI instance is available — expect the relative improvement to hold or increase under higher throughput.
Benchmark results on A100/A10 to be updated.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.