[kv_offload] Add multi-tier KV cache offloading framework by ronensc · Pull Request #40020 · vllm-project/vllm

ronensc · 2026-04-16T13:50:22Z

Purpose

Adds a hierarchical (tiered) KV cache offloading framework under vllm/v1/kv_offload/tiering/, extending the existing single-tier CPU offloading with support for chained secondary tiers (e.g., storage, network).

Implements the design proposed in #38260 [RFC]: Multi-tier KV offloading via the vLLM offloading connector.

Key additions:

SecondaryTierManager ABC (abstract.py) — interface for secondary tier backends, defining async store/load/lookup primitives and a JobResult protocol for polling completions
CPUPrimaryTierOffloadingManager (tiering/manager.py) — wraps CPUOffloadingManager and exposes a secondary-facing read/write alias API (prepare_read/complete_read, prepare_write/complete_write) to clarify directionality when called from the cascade/promotion paths
TieringOffloadingManager (tiering/manager.py) — orchestrates GPU↔CPU (primary) and CPU→secondary tier transfers:
- Cascade on store: blocks written by GPU are fanned out to all secondary tiers
- Staged promotion on load: blocks missing from primary are fetched from secondary → primary before the GPU can access them; lookup() returns None while promotion is in flight to signal "retry later"
- ref_cnt protection: prepare_read() increments ref_cnt to protect blocks from eviction during async transfers
TieringOffloadingSpec (tiering/spec.py) — entry point for the tiered stack; a CPUOffloadingSpec subclass that reads secondary_tiers from kv_connector_extra_config and assembles the TieringOffloadingManager
DummySecondaryTier (secondary_tiers/dummy.py) — in-memory secondary tier for testing, with optional async simulation
SharedOffloadRegion integration — CPUPrimaryTierOffloadingManager accepts the existing SharedOffloadRegion so secondary tiers can memoryview primary tier buffers zero-copy

Test Plan

.venv/bin/python -m pytest tests/v1/kv_offload/test_tiering_offloading.py -v

Test Result


tests/v1/kv_offload/test_tiering_offloading.py::TestDummySecondaryTier::test_basic_store_and_lookup PASSED                                           [  6%]
tests/v1/kv_offload/test_tiering_offloading.py::TestDummySecondaryTier::test_in_flight_blocks_return_none PASSED                                     [ 12%]
tests/v1/kv_offload/test_tiering_offloading.py::TestDummySecondaryTier::test_lru_eviction PASSED                                                     [ 18%]
tests/v1/kv_offload/test_tiering_offloading.py::TestDummySecondaryTier::test_async_simulation PASSED                                                 [ 25%]
tests/v1/kv_offload/test_tiering_offloading.py::TestTieringOffloadingManager::test_basic_store_to_primary PASSED                                     [ 31%]
tests/v1/kv_offload/test_tiering_offloading.py::TestTieringOffloadingManager::test_cascade_to_all_secondary_tiers PASSED                             [ 37%]
tests/v1/kv_offload/test_tiering_offloading.py::TestTieringOffloadingManager::test_ref_cnt_protection_during_cascade PASSED                          [ 43%]
tests/v1/kv_offload/test_tiering_offloading.py::TestTieringOffloadingManager::test_lookup_from_primary PASSED                                        [ 50%]
tests/v1/kv_offload/test_tiering_offloading.py::TestTieringOffloadingManager::test_promotion_from_secondary PASSED                                   [ 56%]
tests/v1/kv_offload/test_tiering_offloading.py::TestTieringOffloadingManager::test_partial_lookup PASSED                                             [ 62%]
tests/v1/kv_offload/test_tiering_offloading.py::TestTieringOffloadingManager::test_eviction_in_primary_tier PASSED                                   [ 68%]
tests/v1/kv_offload/test_tiering_offloading.py::TestTieringOffloadingManager::test_touch_propagates_to_all_tiers PASSED                              [ 75%]
tests/v1/kv_offload/test_tiering_offloading.py::TestTieringOffloadingManager::test_failed_store_no_cascade PASSED                                    [ 81%]
tests/v1/kv_offload/test_tiering_offloading.py::TestTieringOffloadingManager::test_multiple_secondary_tiers_independent_eviction PASSED              [ 87%]
tests/v1/kv_offload/test_tiering_offloading.py::TestTieringOffloadingManager::test_prepare_store_processes_finished_jobs_first PASSED                [ 93%]
tests/v1/kv_offload/test_tiering_offloading.py::TestTieringOffloadingWithoutSecondaryTiers::test_works_without_secondary_tiers PASSED                [100%]

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

gemini-code-assist

Code Review

This pull request introduces a multi-tier KV cache offloading system, featuring a TieringOffloadingManager that orchestrates data movement between a primary CPU tier and multiple secondary tiers. The implementation includes logic for cascading offloads to all tiers and staged promotion of blocks back to the primary tier for GPU access. Feedback identifies potential memory safety risks when using memoryview with PyTorch-backed NumPy arrays and suggests performance optimizations regarding the allocation of key lists during lookup operations.

gemini-code-assist · 2026-04-16T13:55:46Z

+        self._secondary_views: list[memoryview] = []
+        cpu_tensor = primary_tier.get_primary_kv_tensor()
+        for tier in self.secondary_tiers:
+            view = memoryview(cpu_tensor.numpy())


Creating a memoryview from a NumPy array that is a view of a PyTorch tensor can lead to undefined behavior if the underlying PyTorch tensor is reallocated or if its storage is modified in a way that NumPy doesn't track. While _mmap_region._base is expected to be stable, it is safer to ensure the tensor is contiguous and its storage is explicitly kept alive. Additionally, verify that cpu_tensor.numpy() does not create a copy, which would defeat the zero-copy objective.

gemini-code-assist · 2026-04-16T13:55:46Z

+        # are finalized and available in the primary tier
+        self._process_finished_jobs()
+
+        keys_list = list(keys)


Converting keys to a list here might be inefficient if the number of blocks is large, as it triggers an immediate iteration and allocation. Consider keeping it as an iterable or using a more memory-efficient way to handle the keys if high-throughput offloading is expected.

orozery

Could you please rebase?
Can you extend test_cpu_offloading.py::test_cpu_offloading with a new pytest parameter enable_tiering: bool?

orozery · 2026-04-19T07:54:18Z

+    Abstract interface for managing a single non-primary offloading tier.
+
+    Secondary tiers cannot directly access GPU memory. All data transfers
+    must go through the primary tier (implemented as CPU in current version):


I don't think we have plans to support a primary tier other than CPU.
Maybe we should adapt comments to that?

orozery · 2026-04-19T07:56:21Z

+        transfer job, but does NOT perform the actual data transfer on the
+        calling thread.
+
+        The caller (TieringOffloadingManager) must have already called


The caller type documentation (added throughout this files) is helpful to understand the E2E flow.
However, from the point of view of someone implementing a secondary tier I'm not sure it is interesting, and perhaps even confusing.
I think we should document here the minimum necessary from the secondary tier implementor POV.

orozery · 2026-04-19T07:57:52Z

+    """Result of an async transfer job (successful or failed)."""
+
+    job_id: JobId
+    success: bool


This is good for now.
Later on we will want to add stats as well.

orozery · 2026-04-19T08:08:26Z

+    """Metadata for an in-flight async transfer job."""
+
+    job_id: JobId
+    keys: list[OffloadKey]


I think we want to change to Sequence[OffloadKey].
To make sure we avoid conversion between data types.
This means also changing OffloadingManager to use it instead of Iterable for lookup, prepare_load and prepare_store
Maybe it's better to do it in a prequel PR.

Partly done in 675e056
Will submit a prequel PR and continue once merged

Prequel PR: #41200

orozery · 2026-04-19T08:09:48Z

+
+    job_id: JobId
+    keys: list[OffloadKey]
+    spec: LoadStoreSpec


can we replace this with block_ids: np.ndarray?

Done in a9426da

orozery · 2026-04-19T08:55:17Z

+        # Process any completed async jobs first to ensure promoted blocks
+        # are finalized and available in the primary tier
+        self._process_finished_jobs()


We call secondary tiers get_finished on every lookup, as well as on every prepare_load and prepare_store.
I think that we should call it once per engine step.
Currently, we can detect an engine step finished when OffloadingManager.take_events is called.
So maybe add some _processesd_finished_jobs: bool that will be reset to True on take_events?

orozery · 2026-04-19T08:56:09Z

+        # are finalized and available in the primary tier
+        self._process_finished_jobs()
+
+        keys_list = list(keys)


See my previous comment on moving to Sequence[OffloadKey], then we don't need this list conversion.

After rebasing on main (which now includes #36645), OffloadingManager.lookup() accepts a single key instead of a list.

Should we update SecondaryTierManager.lookup() accordingly to take a single key as well?

Should we update SecondaryTierManager.lookup() accordingly to take a single key as well?

Yep!

Ok, done.
Worth noting that now submit_load() will always receive a JobMetadata with exactly one key, even though it allows a list of keys.

orozery · 2026-04-19T09:05:43Z

+            secondary_hits = tier.lookup(remaining_keys)
+
+            # Skip if tier is busy (None) or has no hits (0)
+            if not secondary_hits:


If tier is busy we should return None as well (not immediately though).

orozery · 2026-04-19T09:07:50Z

+        if primary_store_result is None:
+            # Cannot allocate space in primary tier (full)
+            # The next lookup() will retry
+            return


I think that in this case we want to allow the request to proceed (not return None in lookup).

orozery · 2026-04-19T09:08:54Z

+        if primary_hits == len(keys_list):
+            # All blocks in primary tier
+            return primary_hits


I think it's maybe a good idea to still call lookup on the secondary tiers to allow their index to warm-up for the given keys.

ronensc · 2026-04-20T10:41:45Z

@orozery Thanks for the thorough review! I’ll address all the points and follow up with fixes.

rshavitt · 2026-04-20T12:55:43Z

+        # Track this load job
+        job_metadata = JobMetadata(
+            job_id=job_id,
+            keys=keys,


Since the keys that already exist on the disk are filtered out, we probably want to pass the filtered keys here instead of the original ones:
keys=primary_store_result.keys_to_store.

Thanks! Good catch. fixed in the latest commit.

mergify · 2026-04-20T14:38:04Z

Hi @ronensc, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

…iew() Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

ronensc

I've rebased the PR, resolved conflicts, and addressed part of the feedback. I'll follow up on the remaining comments later.

ronensc · 2026-04-21T08:11:42Z

+        decrementing ref_cnt."""
+        self.complete_load(keys)
+
+    def get_primary_kv_tensor(self) -> torch.Tensor:


no problem. renamed it to create_kv_memoryview()

ronensc · 2026-04-21T08:14:51Z

+        self._secondary_views: list[memoryview] = []
+        cpu_tensor = primary_tier.get_primary_kv_tensor()
+        for tier in self.secondary_tiers:
+            view = memoryview(cpu_tensor.numpy())


Changed to single memoryview to all tiers.
I'll follow up on the numpy() issue.

ronensc · 2026-04-21T08:39:07Z

+  - store_threshold: (optional) How many times a block must appear in lookup()
+    before it is eligible for CPU offloading. Values < 2 disable filtering
+    (default: 0)
+  - max_tracker_size: (optional) Maximum number of blocks tracked for
+    store_threshold filtering (default: 64000)


no problem, removed it.

ronensc · 2026-04-21T08:47:18Z

+        # are finalized and available in the primary tier
+        self._process_finished_jobs()
+
+        keys_list = list(keys)


After rebasing on main (which now includes #36645), OffloadingManager.lookup() accepts a single key instead of a list.

Should we update SecondaryTierManager.lookup() accordingly to take a single key as well?

ronensc · 2026-04-21T09:12:32Z

+        """
+        return
+
+    @abstractmethod


IIUC, the goal here is to remove the need for tier_name in the config https://github.com/ronensc/vllm/blob/84c6059c55bf43854212609e740075c3717a7c68/vllm/v1/kv_offload/tiering/spec.py#L34
If so, this would prevent defining multiple secondary tiers of the same type.
Is that intentional?

orozery · 2026-04-21T09:44:49Z

@ronensc regarding file structure.
I have general ideas for re-organizing files.
See here:
#33689

With this, I think we should have everything under tiering/ as follows:

tiering/
  manager.py
  spec.py
  base.py (move your changes from abstract.py in here)
  factory.py (of secondary tiers)
  dummy/
  <future secondary tier>/

…multiple keys Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

ronensc · 2026-04-23T13:10:10Z

@ronensc regarding file structure. I have general ideas for re-organizing files. See here: #33689

With this, I think we should have everything under tiering/ as follows:
tiering/
  manager.py
  spec.py
  base.py (move your changes from abstract.py in here)
  factory.py (of secondary tiers)
  dummy/
  <future secondary tier>/

Re-organized the file tree accordingly.

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

mergify · 2026-04-29T07:24:00Z

Hi @ronensc, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

mergify · 2026-04-29T11:52:54Z

Hi @ronensc, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

liranschour · 2026-04-29T12:57:02Z

+        assert isinstance(store_spec, CPULoadStoreSpec)
+        job_metadata = JobMetadata(
+            job_id=job_id,
+            keys=primary_store_result.keys_to_store,


Length of keys might be 0. In such a case avoid calling submit_load()

liranschour · 2026-04-29T12:59:17Z

+        )
+        self._load_jobs[job_id] = job_metadata
+
+        tier.submit_load(job_metadata)


Looks like submit_load() will always call maximum 1 block. This is sub optimal.
Can we try to batch multiple blocks before calling submit_load()?
That will reduce the number of transfer operations

Correct. This stems from the change in https://github.com/vllm-project/vllm/pull/36645/changes#diff-d8f2304e5d54e4b60670dcea11b7d5e33d006ec71db6e6cd22526501e915e4a5L94-L98
, where lookup() was updated to accept a single key instead of a batch.

Batching multiple blocks before calling submit_load() could indeed reduce the number of transfer operations.
The tradeoff is that it would introduce some delay in submitting blocks to the secondary managers, which may add latency.
@orozery WDYT?

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

mergify · 2026-05-01T04:16:34Z

Hi @ronensc, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

ronensc requested review from ApostaC and orozery as code owners April 16, 2026 13:50

mergify Bot added the v1 label Apr 16, 2026

gemini-code-assist Bot reviewed Apr 16, 2026

View reviewed changes

orozery reviewed Apr 19, 2026

View reviewed changes

orozery mentioned this pull request Apr 20, 2026

[Demo] Add Support for File-Based KV Cache Offloading #40330

Closed

4 tasks

rshavitt reviewed Apr 20, 2026

View reviewed changes

ronensc force-pushed the tier-offload branch 2 times, most recently from d2a72d1 to d6e3154 Compare April 20, 2026 14:31

ronensc added 4 commits April 20, 2026 17:43

[kv_offload] Add tiered KV offloading framework

84c6059

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

Fix rebase issues: remove block_size

cc1ae59

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

Fix rebase issues: add ReqContext and update lookup()

c3d6cb6

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

Fix rebase issues: make pre-commit happy

f05638b

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

ronensc force-pushed the tier-offload branch from b8e7c47 to f05638b Compare April 20, 2026 14:43

ronensc added 3 commits April 21, 2026 09:01

Address review: pass filtered keys to submit_load

7d3798d

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

propagate request context to submit_load()

7374bf2

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

Addredd review: Refactor get_primary_kv_tensor() -> create_kv_memoryv…

51654d9

…iew() Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

ronensc force-pushed the tier-offload branch from 64ddfd0 to 51654d9 Compare April 21, 2026 08:16

Address review: remove support for store_threshold

9dc1757

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

ronensc commented Apr 21, 2026

View reviewed changes

ronensc added 4 commits April 23, 2026 11:22

Change SecondaryTierManager.lookup() to accept single key instead of …

93a2a67

…multiple keys Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

Add ReqContext to SecondaryTierManager.lookup()

f1975be

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

Change _initiate_promotion() to handle single key

70bfa59

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

Address review: re-organize file tree structure

f8a0438

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

ronensc added 2 commits April 28, 2026 10:20

Address review: remove tier_name from config

98ffa48

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

Address review: rename dummy -> example

b8897c3

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

ronensc added 2 commits April 28, 2026 10:46

Remove unused ExampleLoadStoreSpec class

56c3735

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

Address review: widen JobMetadata.keys from list to Sequence

675e056

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

ronensc mentioned this pull request Apr 29, 2026

[KV Offload] Tighten keys type from Iterable to Sequence in OffloadingManager #41200

Merged

4 tasks

Address review: replace LoadStoreSpec with block_ids: np.ndarray

a9426da

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

ronensc added 2 commits April 29, 2026 11:37

Make pre-commit happy

2ba6523

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

Merge remote-tracking branch 'origin/main' into tier-offload

4d63460

Fix merge issues: Iterable to Sequence

bafc17e

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

liranschour reviewed Apr 29, 2026

View reviewed changes

ronensc added 3 commits April 30, 2026 12:42

Merge remote-tracking branch 'origin/main' into tier-offload

34605d7

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

Fix merge issues: update imports

ad714be

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

Merge branch 'main' into tier-offload

4efc9e5

Uh oh!

Conversation

ronensc commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

orozery left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ronensc commented Apr 20, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented Apr 20, 2026

Uh oh!

ronensc left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

orozery commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ronensc commented Apr 23, 2026

Uh oh!

mergify Bot commented Apr 29, 2026

ronensc commented Apr 16, 2026 •

edited

Loading

orozery commented Apr 21, 2026 •

edited

Loading