[Feat][v1] Simple yet General CPU KV Cache Offloading by ivanium · Pull Request #37160 · vllm-project/vllm

ivanium · 2026-03-16T08:50:41Z

Purpose

SimpleCPUOffloadConnector is another design of vLLM's CPU KV cache offloading path. Instead of maintaining a parallel block management stack, it reuses vLLM's existing BlockPool and KVCacheCoordinator infrastructure directly. This gives us HMA support, prefix caching, and LRU eviction for free.

The new design is simpler with ~1,400 lines of code, more general with support for hybrid models, lazy offloading, and lower per-step overhead.

Full design doc: https://docs.google.com/document/d/1TDY3eSjv7gsTXAcUjKEu15QTKSZpUpZqmnaKafywpgw/edit?usp=sharing

Note

This PR supports regular models and hybrid models with SWA, but not yet hybrid models with Mamba.
Supporting hybrid models with Mamba needs some scheduler-side fixes, and we will address this in a follow-up PR.

Test Plan

# NOTE: Need to export this env var to enable it
# NOTE: kv-offloading-size is CPU memory in GB in total (GB_per_rank * world_size), to align with existing offloading connector.
# NOTE: Enable lazy_offload by setting it to true in extra_config
export VLLM_USE_SIMPLE_KV_OFFLOAD=1
MODEL=openai/gpt-oss-20b # meta-llama/Llama-3.1-8B, Qwen3.5-4B
vllm serve $MODEL \
  --no-disable-hybrid-kv-cache-manager \
  --enable-prefix-caching \
  --no-enable-log-requests \
  --kv-offloading-size 80 \
  --kv-offloading-backend native \
  --kv-transfer-config '{"kv_connector_extra_config": {"lazy_offload": false}}'

Test Result

Overhead

Workload: random, Input:Output=8k:1k, 2 req/s

Llama-3.1-8B - GB200

Metric	No Offload	Simple Offload	Delta
Output throughput (tok/s)	2009.76	2009.80	+0.00%
Request throughput (req/s)	1.96	1.96	+0.00%
Mean TTFT (ms)	174.73	176.63	+1.09%
Mean TPOT (ms)	8.48	8.48	+0.00%
P99 TPOT (ms)	11.44	11.42	-0.17%
Mean ITL (ms)	8.48	8.48	+0.00%
P99 ITL (ms)	82.62	76.25	-7.71%

GPT-oss-20b - GB200

Metric	No Offload	Simple Offload	Delta
Output throughput (tok/s)	2020.81	2020.72	-0.00%
Request throughput (req/s)	1.97	1.97	+0.00%
Mean TTFT (ms)	541.71	541.42	-0.05%
Mean TPOT (ms)	5.15	5.18	+0.58%
P99 TPOT (ms)	9.52	9.51	-0.10%
Mean ITL (ms)	5.17	5.19	+0.39%
P99 ITL (ms)	58.04	58.70	+1.14%

Multi-turn with CPU KV cache hits (Llama-3.1-8B):

Llama-3.1-8B - GB200 - 400GB host memory space

Metric	No Offload	Native	Simple Eager	Simple Lazy
Total throughput (tok/s)	35881.52	41403.76	44908.33	45151.15
Output throughput (tok/s)	1504.77	1414.28	1451.68	1442.57
Request tput (req/s)	2.94	2.77	2.84	2.82
Mean TTFT (ms)	6621.90	6509.47	6214.20	6462.97
Mean TPOT (ms)	32.13	24.01	21.48	21.93
P99 TPOT (ms)	59.22	62.49	49.01	50.49
Mean ITL (ms)	32.13	24.02	21.51	21.96
P99 ITL (ms)	166.22	179.63	170.18	176.08

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request introduces a significant and well-designed refactoring of the CPU KV cache offloading mechanism. The new SimpleCPUOffloadConnector simplifies the architecture by reusing existing components like BlockPool and KVCacheCoordinator, and it introduces efficient Triton-based copy operations. The code is generally well-structured and clear. My review identified two potential high-severity issues in the new scheduler manager related to state management during request preemption and CPU cache eviction, which could lead to a memory leak and incorrect behavior respectively. These issues are noted with FIXMEs in the code, but I've provided detailed comments on their potential impact and suggestions for resolution.

mergify · 2026-03-21T08:55:54Z

Hi @ivanium, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

mergify · 2026-03-23T07:21:58Z

Hi @ivanium, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

NickLucche · 2026-03-23T10:51:34Z

+        # kv_cache_manager is constructed so block_pool is available.
+        if self.connector is not None and hasattr(
+            self.connector, "bind_gpu_block_pool"
+        ):
+            self.connector.bind_gpu_block_pool(self.kv_cache_manager.block_pool)


I am not really happy with this sort of API changes given the maturity we're trying to reach with the Connector interface contract.
Not to mention this won't work with MultiConnector.

to clarify, editing the Connector interface is ok but this is a hack

Yeah this is intentional to avoid changing the Connector interface for now and keep this simple CPU offload backend as an experimental feature without confusing the other connector backends. I know we have some ongoing plans for Connector API v2, and I think we can discuss/finalize API changes then.

NickLucche · 2026-03-23T10:52:04Z

+        if self.connector is not None and hasattr(
+            self.connector, "has_pending_transfers"
+        ):
+            return self.connector.has_pending_transfers()


heheda12345 · 2026-03-31T04:02:42Z

+        )
+
+        if hit_length > 0:
+            return hit_length, True


if we have n tokens, we can hit at most n-1 tokens. need to recompute the last one to get the first logprob

That's fine, the scheduler will make the reduction when the load is done.

maybe I miss the code. Can you give me the code pointer?

And why can scheduler make this reduction? for swa with window size 100, a cache hit of 1000 tokens means tokens [900, 1000] are cached, it doesn't indicate cache hit of 999 tokens, which needs kv of token [899, 999], because we don't know whether token 899 is cached.

vllm/vllm/v1/core/sched/scheduler.py

Line 2063 in 3e802e8

request.num_computed_tokens = request.num_tokens - 1

for swa with window size 100, a cache hit of 1000 tokens means tokens [900, 1000] are cached, it doesn't indicate cache hit of 999 tokens, which needs kv of token [899, 999], because we don't know whether token 899 is cached.

Good point!
I think this code in the scheduler existed before we supported sliding windows.

Anyhow, I think the correct fix is possibly reducing max_hit_len by 1 BEFORE calling self.cpu_coordinator.find_longest_cache_hit. WDYT?

Nice catch. I think I can change it to max_hit_len = request.num_tokens - 1 - num_computed_tokens?

heheda12345 · 2026-03-31T04:05:54Z

+            return 0, False
+
+        max_hit_len = len(remaining_hashes) * self.block_size
+        _, hit_length = self.cpu_coordinator.find_longest_cache_hit(


are you implementing lazy offloading? if yes, the common prefix of all prompts will never be offloaded and you can get 0 cache hit in cpu.

njhill

Thanks @ivanium for the great work and thanks @heheda12345 and @orozery for the really good reviews.

njhill · 2026-03-31T16:25:28Z

+        # FIXME(yifan): local_cache_hit can go negative after preemption.
+        # num_cached_tokens is a one-time snapshot from first scheduling and
+        # is never reset on preemption, while num_external_computed_tokens is
+        # overwritten on re-scheduling. If CPU offload finds more tokens on
+        # the second pass than the original total, the subtraction underflows.
+        # A fundamental fix is to track the first-time num_external_computed_tokens
+        # as a separate metric rather than reusing num_external_computed_tokens
+        # for metric directly.
+        self.local_cache_hit += max(
+            0, (num_cached_tokens + recomputed - num_external_computed_tokens)


I think this temporary hack fix is ok, it will hopefully be addressed properly soon via #37460.

Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu> Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai> (cherry picked from commit 91e4521)

…7160) Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu> Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai> Signed-off-by: Rishi Puri <riship@nvidia.com>

…7160) Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu> Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>

- Replace native offloading with SimpleCPUOffloadConnector (VLLM_USE_SIMPLE_KV_OFFLOAD=1 + --no-disable-hybrid-kv-cache-manager) for ~10% better throughput and TPOT per vllm-project/vllm#37160 - Remove local_cache_hit and scheduler.py monkey-patches (fixed in vLLM 0.19.0+), replace with version check warning - Add AIPERF_SERVICE_PROFILE_CONFIGURE_TIMEOUT=1800 to H200 and B200 (H100 already had it) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

VLLM_USE_SIMPLE_KV_OFFLOAD=1 routes to SimpleCPUOffloadConnector which imports cuda.bindings (NVIDIA-only, PR vllm-project/vllm#37160). Remove it from MI355X scripts so native offloading uses the ROCm-safe OffloadingConnector. Also update H200 trace dir to use traces_neon with env-var override to match the other trace replay scripts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

mergify Bot added v1 kv-connector labels Mar 16, 2026

gemini-code-assist Bot reviewed Mar 16, 2026

View reviewed changes

Comment thread vllm/v1/simple_kv_offload/manager.py Outdated

Comment thread vllm/v1/simple_kv_offload/manager.py

ivanium force-pushed the feat/simple-cpu-offload-cleanup branch 3 times, most recently from 8a59f1f to b2dbc9b Compare March 21, 2026 08:31

ivanium marked this pull request as ready for review March 21, 2026 08:31

ivanium requested review from ApostaC, NickLucche, ProExpertProg, WoosukKwon, alexm-redhat, heheda12345, hmellor, houseroad, mgoin, njhill, orozery, robertgshaw2-redhat, tlrmchlsmth, yewentao256, youkaichao and ywang96 as code owners March 21, 2026 08:31

NickLucche reviewed Mar 23, 2026

View reviewed changes

ivanium requested review from DarkLight1337 and russellb as code owners March 23, 2026 17:17

mergify Bot added frontend performance Performance-related issues labels Mar 23, 2026

heheda12345 reviewed Mar 31, 2026

View reviewed changes

orozery mentioned this pull request Mar 31, 2026

[ROCm][Engine] Fix GPU memory leaks in engine shutdown and test workaround for async KV prefix cache reset #38503

Merged

2 tasks

njhill approved these changes Mar 31, 2026

View reviewed changes

github-project-automation Bot moved this to Ready in NVIDIA Mar 31, 2026

njhill and others added 2 commits March 31, 2026 10:54

Merge branch 'main' into feat/simple-cpu-offload-cleanup

859974a

fix (manager): max_hit_len - 1 to enforce recomputing the last token

8bb9d8f

Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>

njhill merged commit 91e4521 into vllm-project:main Apr 1, 2026
67 checks passed

github-project-automation Bot moved this from Ready to Done in NVIDIA Apr 1, 2026

khluu added this to the v0.19.0 cherry picks milestone Apr 1, 2026

khluu pushed a commit that referenced this pull request Apr 1, 2026

[Feat][v1] Simple yet General CPU KV Cache Offloading (#37160)

1dbbafd

Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu> Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai> (cherry picked from commit 91e4521)

markmc mentioned this pull request Apr 1, 2026

[Bugfix][Core] Fix negative prompt token counter increments with external KV cache accounting #38712

Open

5 tasks

mtparet pushed a commit to blackfuel-ai/vllm that referenced this pull request Apr 9, 2026

[Feat][v1] Simple yet General CPU KV Cache Offloading (vllm-project#3…

fb0e663

…7160) Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu> Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>

go-bai mentioned this pull request Apr 13, 2026

[Bug]: SimpleCPUOffloadScheduler crashes with AssertionError: Expected N hit tokens, got 0 (TOCTOU race in update_state_after_alloc) #39702

Open

ywang96 mentioned this pull request Apr 26, 2026

[Bug]: v0.19.1 Crash with CUDA invalid argument / Segfault when using KV Offloading + EAGLE3 + Expert Parallel (on 8x H20 141GB) #40259

Open

1 task

HF-001 mentioned this pull request Apr 27, 2026

[Feature] Simple yet General CPU KV Cache Offloading vllm-project/vllm-ascend#8743

Open

markmc mentioned this pull request Apr 28, 2026

[Feat][KVConnector] Add bind_gpu_block_pool() to KVConnectorBase_V1 #39654

Open

5 tasks

Uh oh!

Conversation

ivanium commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Overhead

Multi-turn with CPU KV cache hits (Llama-3.1-8B):

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

mergify Bot commented Mar 21, 2026

Uh oh!

mergify Bot commented Mar 23, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

ivanium commented Mar 16, 2026 •

edited

Loading