[Feat][v1] Simple yet General CPU KV Cache Offloading#37160
[Feat][v1] Simple yet General CPU KV Cache Offloading#37160njhill merged 58 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a significant and well-designed refactoring of the CPU KV cache offloading mechanism. The new SimpleCPUOffloadConnector simplifies the architecture by reusing existing components like BlockPool and KVCacheCoordinator, and it introduces efficient Triton-based copy operations. The code is generally well-structured and clear. My review identified two potential high-severity issues in the new scheduler manager related to state management during request preemption and CPU cache eviction, which could lead to a memory leak and incorrect behavior respectively. These issues are noted with FIXMEs in the code, but I've provided detailed comments on their potential impact and suggestions for resolution.
8a59f1f to
b2dbc9b
Compare
|
Hi @ivanium, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
1 similar comment
|
Hi @ivanium, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
| # kv_cache_manager is constructed so block_pool is available. | ||
| if self.connector is not None and hasattr( | ||
| self.connector, "bind_gpu_block_pool" | ||
| ): | ||
| self.connector.bind_gpu_block_pool(self.kv_cache_manager.block_pool) |
There was a problem hiding this comment.
I am not really happy with this sort of API changes given the maturity we're trying to reach with the Connector interface contract.
Not to mention this won't work with MultiConnector.
There was a problem hiding this comment.
to clarify, editing the Connector interface is ok but this is a hack
There was a problem hiding this comment.
Yeah this is intentional to avoid changing the Connector interface for now and keep this simple CPU offload backend as an experimental feature without confusing the other connector backends. I know we have some ongoing plans for Connector API v2, and I think we can discuss/finalize API changes then.
| if self.connector is not None and hasattr( | ||
| self.connector, "has_pending_transfers" | ||
| ): | ||
| return self.connector.has_pending_transfers() |
| ) | ||
|
|
||
| if hit_length > 0: | ||
| return hit_length, True |
There was a problem hiding this comment.
if we have n tokens, we can hit at most n-1 tokens. need to recompute the last one to get the first logprob
There was a problem hiding this comment.
That's fine, the scheduler will make the reduction when the load is done.
There was a problem hiding this comment.
maybe I miss the code. Can you give me the code pointer?
And why can scheduler make this reduction? for swa with window size 100, a cache hit of 1000 tokens means tokens [900, 1000] are cached, it doesn't indicate cache hit of 999 tokens, which needs kv of token [899, 999], because we don't know whether token 899 is cached.
There was a problem hiding this comment.
vllm/vllm/v1/core/sched/scheduler.py
Line 2063 in 3e802e8
for swa with window size 100, a cache hit of 1000 tokens means tokens [900, 1000] are cached, it doesn't indicate cache hit of 999 tokens, which needs kv of token [899, 999], because we don't know whether token 899 is cached.
Good point!
I think this code in the scheduler existed before we supported sliding windows.
Anyhow, I think the correct fix is possibly reducing max_hit_len by 1 BEFORE calling self.cpu_coordinator.find_longest_cache_hit. WDYT?
There was a problem hiding this comment.
Nice catch. I think I can change it to max_hit_len = request.num_tokens - 1 - num_computed_tokens?
| return 0, False | ||
|
|
||
| max_hit_len = len(remaining_hashes) * self.block_size | ||
| _, hit_length = self.cpu_coordinator.find_longest_cache_hit( |
There was a problem hiding this comment.
are you implementing lazy offloading? if yes, the common prefix of all prompts will never be offloaded and you can get 0 cache hit in cpu.
njhill
left a comment
There was a problem hiding this comment.
Thanks @ivanium for the great work and thanks @heheda12345 and @orozery for the really good reviews.
| # FIXME(yifan): local_cache_hit can go negative after preemption. | ||
| # num_cached_tokens is a one-time snapshot from first scheduling and | ||
| # is never reset on preemption, while num_external_computed_tokens is | ||
| # overwritten on re-scheduling. If CPU offload finds more tokens on | ||
| # the second pass than the original total, the subtraction underflows. | ||
| # A fundamental fix is to track the first-time num_external_computed_tokens | ||
| # as a separate metric rather than reusing num_external_computed_tokens | ||
| # for metric directly. | ||
| self.local_cache_hit += max( | ||
| 0, (num_cached_tokens + recomputed - num_external_computed_tokens) |
There was a problem hiding this comment.
I think this temporary hack fix is ok, it will hopefully be addressed properly soon via #37460.
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu> Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai> (cherry picked from commit 91e4521)
…7160) Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu> Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai> Signed-off-by: Rishi Puri <riship@nvidia.com>
…7160) Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu> Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
- Replace native offloading with SimpleCPUOffloadConnector (VLLM_USE_SIMPLE_KV_OFFLOAD=1 + --no-disable-hybrid-kv-cache-manager) for ~10% better throughput and TPOT per vllm-project/vllm#37160 - Remove local_cache_hit and scheduler.py monkey-patches (fixed in vLLM 0.19.0+), replace with version check warning - Add AIPERF_SERVICE_PROFILE_CONFIGURE_TIMEOUT=1800 to H200 and B200 (H100 already had it) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
VLLM_USE_SIMPLE_KV_OFFLOAD=1 routes to SimpleCPUOffloadConnector which imports cuda.bindings (NVIDIA-only, PR vllm-project/vllm#37160). Remove it from MI355X scripts so native offloading uses the ROCm-safe OffloadingConnector. Also update H200 trace dir to use traces_neon with env-var override to match the other trace replay scripts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Purpose
SimpleCPUOffloadConnectoris another design of vLLM's CPU KV cache offloading path. Instead of maintaining a parallel block management stack, it reuses vLLM's existingBlockPoolandKVCacheCoordinatorinfrastructure directly. This gives us HMA support, prefix caching, and LRU eviction for free.The new design is simpler with ~1,400 lines of code, more general with support for hybrid models, lazy offloading, and lower per-step overhead.
Full design doc: https://docs.google.com/document/d/1TDY3eSjv7gsTXAcUjKEu15QTKSZpUpZqmnaKafywpgw/edit?usp=sharing
Note
This PR supports regular models and hybrid models with SWA, but not yet hybrid models with Mamba.
Supporting hybrid models with Mamba needs some scheduler-side fixes, and we will address this in a follow-up PR.
Test Plan
Test Result
Overhead
Workload: random, Input:Output=8k:1k, 2 req/s
Multi-turn with CPU KV cache hits (Llama-3.1-8B):
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.