Skip to content

[kv_offload] Add multi-tier KV cache offloading framework#40020

Open
ronensc wants to merge 23 commits intovllm-project:mainfrom
ronensc:tier-offload
Open

[kv_offload] Add multi-tier KV cache offloading framework#40020
ronensc wants to merge 23 commits intovllm-project:mainfrom
ronensc:tier-offload

Conversation

@ronensc
Copy link
Copy Markdown
Contributor

@ronensc ronensc commented Apr 16, 2026

Purpose

Adds a hierarchical (tiered) KV cache offloading framework under vllm/v1/kv_offload/tiering/, extending the existing single-tier CPU offloading with support for chained secondary tiers (e.g., storage, network).

Implements the design proposed in #38260 [RFC]: Multi-tier KV offloading via the vLLM offloading connector.

Key additions:

  • SecondaryTierManager ABC (abstract.py) — interface for secondary tier backends, defining async store/load/lookup primitives and a JobResult protocol for polling completions
  • CPUPrimaryTierOffloadingManager (tiering/manager.py) — wraps CPUOffloadingManager and exposes a secondary-facing read/write alias API (prepare_read/complete_read, prepare_write/complete_write) to clarify directionality when called from the cascade/promotion paths
  • TieringOffloadingManager (tiering/manager.py) — orchestrates GPU↔CPU (primary) and CPU→secondary tier transfers:
    • Cascade on store: blocks written by GPU are fanned out to all secondary tiers
    • Staged promotion on load: blocks missing from primary are fetched from secondary → primary before the GPU can access them; lookup() returns None while promotion is in flight to signal "retry later"
    • ref_cnt protection: prepare_read() increments ref_cnt to protect blocks from eviction during async transfers
  • TieringOffloadingSpec (tiering/spec.py) — entry point for the tiered stack; a CPUOffloadingSpec subclass that reads secondary_tiers from kv_connector_extra_config and assembles the TieringOffloadingManager
  • DummySecondaryTier (secondary_tiers/dummy.py) — in-memory secondary tier for testing, with optional async simulation
  • SharedOffloadRegion integration — CPUPrimaryTierOffloadingManager accepts the existing SharedOffloadRegion so secondary tiers can memoryview primary tier buffers zero-copy

Test Plan

.venv/bin/python -m pytest tests/v1/kv_offload/test_tiering_offloading.py -v

Test Result


tests/v1/kv_offload/test_tiering_offloading.py::TestDummySecondaryTier::test_basic_store_and_lookup PASSED                                           [  6%]
tests/v1/kv_offload/test_tiering_offloading.py::TestDummySecondaryTier::test_in_flight_blocks_return_none PASSED                                     [ 12%]
tests/v1/kv_offload/test_tiering_offloading.py::TestDummySecondaryTier::test_lru_eviction PASSED                                                     [ 18%]
tests/v1/kv_offload/test_tiering_offloading.py::TestDummySecondaryTier::test_async_simulation PASSED                                                 [ 25%]
tests/v1/kv_offload/test_tiering_offloading.py::TestTieringOffloadingManager::test_basic_store_to_primary PASSED                                     [ 31%]
tests/v1/kv_offload/test_tiering_offloading.py::TestTieringOffloadingManager::test_cascade_to_all_secondary_tiers PASSED                             [ 37%]
tests/v1/kv_offload/test_tiering_offloading.py::TestTieringOffloadingManager::test_ref_cnt_protection_during_cascade PASSED                          [ 43%]
tests/v1/kv_offload/test_tiering_offloading.py::TestTieringOffloadingManager::test_lookup_from_primary PASSED                                        [ 50%]
tests/v1/kv_offload/test_tiering_offloading.py::TestTieringOffloadingManager::test_promotion_from_secondary PASSED                                   [ 56%]
tests/v1/kv_offload/test_tiering_offloading.py::TestTieringOffloadingManager::test_partial_lookup PASSED                                             [ 62%]
tests/v1/kv_offload/test_tiering_offloading.py::TestTieringOffloadingManager::test_eviction_in_primary_tier PASSED                                   [ 68%]
tests/v1/kv_offload/test_tiering_offloading.py::TestTieringOffloadingManager::test_touch_propagates_to_all_tiers PASSED                              [ 75%]
tests/v1/kv_offload/test_tiering_offloading.py::TestTieringOffloadingManager::test_failed_store_no_cascade PASSED                                    [ 81%]
tests/v1/kv_offload/test_tiering_offloading.py::TestTieringOffloadingManager::test_multiple_secondary_tiers_independent_eviction PASSED              [ 87%]
tests/v1/kv_offload/test_tiering_offloading.py::TestTieringOffloadingManager::test_prepare_store_processes_finished_jobs_first PASSED                [ 93%]
tests/v1/kv_offload/test_tiering_offloading.py::TestTieringOffloadingWithoutSecondaryTiers::test_works_without_secondary_tiers PASSED                [100%]

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

@ronensc ronensc requested review from ApostaC and orozery as code owners April 16, 2026 13:50
@mergify mergify Bot added the v1 label Apr 16, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a multi-tier KV cache offloading system, featuring a TieringOffloadingManager that orchestrates data movement between a primary CPU tier and multiple secondary tiers. The implementation includes logic for cascading offloads to all tiers and staged promotion of blocks back to the primary tier for GPU access. Feedback identifies potential memory safety risks when using memoryview with PyTorch-backed NumPy arrays and suggests performance optimizations regarding the allocation of key lists during lookup operations.

Comment thread vllm/v1/kv_offload/tiering/manager.py Outdated
self._secondary_views: list[memoryview] = []
cpu_tensor = primary_tier.get_primary_kv_tensor()
for tier in self.secondary_tiers:
view = memoryview(cpu_tensor.numpy())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Creating a memoryview from a NumPy array that is a view of a PyTorch tensor can lead to undefined behavior if the underlying PyTorch tensor is reallocated or if its storage is modified in a way that NumPy doesn't track. While _mmap_region._base is expected to be stable, it is safer to ensure the tensor is contiguous and its storage is explicitly kept alive. Additionally, verify that cpu_tensor.numpy() does not create a copy, which would defeat the zero-copy objective.

Comment thread vllm/v1/kv_offload/tiering/manager.py Outdated
# are finalized and available in the primary tier
self._process_finished_jobs()

keys_list = list(keys)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Converting keys to a list here might be inefficient if the number of blocks is large, as it triggers an immediate iteration and allocation. Consider keeping it as an iterable or using a more memory-efficient way to handle the keys if high-throughput offloading is expected.

Copy link
Copy Markdown
Collaborator

@orozery orozery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please rebase?
Can you extend test_cpu_offloading.py::test_cpu_offloading with a new pytest parameter enable_tiering: bool?

Comment thread vllm/v1/kv_offload/abstract.py Outdated
Abstract interface for managing a single non-primary offloading tier.

Secondary tiers cannot directly access GPU memory. All data transfers
must go through the primary tier (implemented as CPU in current version):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have plans to support a primary tier other than CPU.
Maybe we should adapt comments to that?

Comment thread vllm/v1/kv_offload/abstract.py Outdated
transfer job, but does NOT perform the actual data transfer on the
calling thread.

The caller (TieringOffloadingManager) must have already called
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The caller type documentation (added throughout this files) is helpful to understand the E2E flow.
However, from the point of view of someone implementing a secondary tier I'm not sure it is interesting, and perhaps even confusing.
I think we should document here the minimum necessary from the secondary tier implementor POV.

Comment thread vllm/v1/kv_offload/abstract.py Outdated
"""Result of an async transfer job (successful or failed)."""

job_id: JobId
success: bool
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good for now.
Later on we will want to add stats as well.

Comment thread vllm/v1/kv_offload/abstract.py Outdated
"""Metadata for an in-flight async transfer job."""

job_id: JobId
keys: list[OffloadKey]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to change to Sequence[OffloadKey].
To make sure we avoid conversion between data types.
This means also changing OffloadingManager to use it instead of Iterable for lookup, prepare_load and prepare_store
Maybe it's better to do it in a prequel PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partly done in 675e056
Will submit a prequel PR and continue once merged

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prequel PR: #41200

Comment thread vllm/v1/kv_offload/abstract.py Outdated

job_id: JobId
keys: list[OffloadKey]
spec: LoadStoreSpec
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we replace this with block_ids: np.ndarray?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in a9426da

Comment thread vllm/v1/kv_offload/tiering/manager.py Outdated
Comment on lines +221 to +223
# Process any completed async jobs first to ensure promoted blocks
# are finalized and available in the primary tier
self._process_finished_jobs()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We call secondary tiers get_finished on every lookup, as well as on every prepare_load and prepare_store.
I think that we should call it once per engine step.
Currently, we can detect an engine step finished when OffloadingManager.take_events is called.
So maybe add some _processesd_finished_jobs: bool that will be reset to True on take_events?

Comment thread vllm/v1/kv_offload/tiering/manager.py Outdated
# are finalized and available in the primary tier
self._process_finished_jobs()

keys_list = list(keys)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my previous comment on moving to Sequence[OffloadKey], then we don't need this list conversion.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After rebasing on main (which now includes #36645), OffloadingManager.lookup() accepts a single key instead of a list.

Should we update SecondaryTierManager.lookup() accordingly to take a single key as well?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we update SecondaryTierManager.lookup() accordingly to take a single key as well?

Yep!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, done.
Worth noting that now submit_load() will always receive a JobMetadata with exactly one key, even though it allows a list of keys.

Comment thread vllm/v1/kv_offload/tiering/manager.py Outdated
secondary_hits = tier.lookup(remaining_keys)

# Skip if tier is busy (None) or has no hits (0)
if not secondary_hits:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If tier is busy we should return None as well (not immediately though).

Comment on lines +286 to +289
if primary_store_result is None:
# Cannot allocate space in primary tier (full)
# The next lookup() will retry
return
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that in this case we want to allow the request to proceed (not return None in lookup).

Comment thread vllm/v1/kv_offload/tiering/manager.py Outdated
Comment on lines +234 to +236
if primary_hits == len(keys_list):
# All blocks in primary tier
return primary_hits
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's maybe a good idea to still call lookup on the secondary tiers to allow their index to warm-up for the given keys.

@ronensc
Copy link
Copy Markdown
Contributor Author

ronensc commented Apr 20, 2026

@orozery Thanks for the thorough review! I’ll address all the points and follow up with fixes.

Comment thread vllm/v1/kv_offload/tiering/manager.py Outdated
# Track this load job
job_metadata = JobMetadata(
job_id=job_id,
keys=keys,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the keys that already exist on the disk are filtered out, we probably want to pass the filtered keys here instead of the original ones:
keys=primary_store_result.keys_to_store.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Good catch. fixed in the latest commit.

@ronensc ronensc force-pushed the tier-offload branch 2 times, most recently from d2a72d1 to d6e3154 Compare April 20, 2026 14:31
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 20, 2026

Hi @ronensc, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

ronensc added 4 commits April 20, 2026 17:43
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
ronensc added 3 commits April 21, 2026 09:01
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
…iew()

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
Copy link
Copy Markdown
Contributor Author

@ronensc ronensc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've rebased the PR, resolved conflicts, and addressed part of the feedback. I'll follow up on the remaining comments later.

Comment thread vllm/v1/kv_offload/tiering/manager.py Outdated
decrementing ref_cnt."""
self.complete_load(keys)

def get_primary_kv_tensor(self) -> torch.Tensor:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no problem. renamed it to create_kv_memoryview()

Comment thread vllm/v1/kv_offload/tiering/manager.py Outdated
self._secondary_views: list[memoryview] = []
cpu_tensor = primary_tier.get_primary_kv_tensor()
for tier in self.secondary_tiers:
view = memoryview(cpu_tensor.numpy())
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to single memoryview to all tiers.
I'll follow up on the numpy() issue.

Comment thread vllm/v1/kv_offload/tiering/spec.py Outdated
Comment on lines +14 to +18
- store_threshold: (optional) How many times a block must appear in lookup()
before it is eligible for CPU offloading. Values < 2 disable filtering
(default: 0)
- max_tracker_size: (optional) Maximum number of blocks tracked for
store_threshold filtering (default: 64000)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no problem, removed it.

Comment thread vllm/v1/kv_offload/tiering/manager.py Outdated
# are finalized and available in the primary tier
self._process_finished_jobs()

keys_list = list(keys)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After rebasing on main (which now includes #36645), OffloadingManager.lookup() accepts a single key instead of a list.

Should we update SecondaryTierManager.lookup() accordingly to take a single key as well?

Comment thread vllm/v1/kv_offload/abstract.py Outdated
"""
return

@abstractmethod
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the goal here is to remove the need for tier_name in the config https://github.com/ronensc/vllm/blob/84c6059c55bf43854212609e740075c3717a7c68/vllm/v1/kv_offload/tiering/spec.py#L34
If so, this would prevent defining multiple secondary tiers of the same type.
Is that intentional?

@orozery
Copy link
Copy Markdown
Collaborator

orozery commented Apr 21, 2026

@ronensc regarding file structure.
I have general ideas for re-organizing files.
See here:
#33689

With this, I think we should have everything under tiering/ as follows:

tiering/
  manager.py
  spec.py
  base.py (move your changes from abstract.py in here)
  factory.py (of secondary tiers)
  dummy/
  <future secondary tier>/

ronensc added 4 commits April 23, 2026 11:22
…multiple keys

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
@ronensc
Copy link
Copy Markdown
Contributor Author

ronensc commented Apr 23, 2026

@ronensc regarding file structure. I have general ideas for re-organizing files. See here: #33689

With this, I think we should have everything under tiering/ as follows:

tiering/
  manager.py
  spec.py
  base.py (move your changes from abstract.py in here)
  factory.py (of secondary tiers)
  dummy/
  <future secondary tier>/

Re-organized the file tree accordingly.

ronensc added 2 commits April 28, 2026 10:20
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
ronensc added 2 commits April 28, 2026 10:46
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 29, 2026

Hi @ronensc, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 29, 2026

Hi @ronensc, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
assert isinstance(store_spec, CPULoadStoreSpec)
job_metadata = JobMetadata(
job_id=job_id,
keys=primary_store_result.keys_to_store,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Length of keys might be 0. In such a case avoid calling submit_load()

)
self._load_jobs[job_id] = job_metadata

tier.submit_load(job_metadata)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like submit_load() will always call maximum 1 block. This is sub optimal.
Can we try to batch multiple blocks before calling submit_load()?
That will reduce the number of transfer operations

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. This stems from the change in https://github.com/vllm-project/vllm/pull/36645/changes#diff-d8f2304e5d54e4b60670dcea11b7d5e33d006ec71db6e6cd22526501e915e4a5L94-L98
, where lookup() was updated to accept a single key instead of a batch.

Batching multiple blocks before calling submit_load() could indeed reduce the number of transfer operations.
The tradeoff is that it would introduce some delay in submitting blocks to the secondary managers, which may add latency.
@orozery WDYT?

ronensc added 3 commits April 30, 2026 12:42
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 1, 2026

Hi @ronensc, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants