[Demo] Add Support for File-Based KV Cache Offloading by chaunceyjiang · Pull Request #40330 · vllm-project/vllm

chaunceyjiang · 2026-04-20T07:55:29Z

Purpose

Add Support for File-Based KV Cache Offloading

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

gemini-code-assist

Code Review

This pull request implements file-based KV cache offloading for vLLM, introducing a manager for disk storage and a handler for asynchronous data transfers between GPU memory and files. The review identifies a critical bug in the block eviction logic where keys are removed from the tracking dictionary before their block IDs are retrieved, which would lead to data corruption. Additionally, feedback highlights performance bottlenecks in the GPU-to-CPU transfer path, redundant directory creation calls within I/O loops, and a logic error in file opening modes that would cause unintended data truncation when using block offsets.

gemini-code-assist · 2026-04-20T07:56:53Z

+            for key, _ in candidates:
+                self._blocks.pop(key)
+                self._free_block(self._blocks.get(key) or FileBlockStatus(-1))


There is a critical logic error in the eviction process. The code pops the key from self._blocks and then immediately tries to retrieve it again using self._blocks.get(key), which will always return None. This results in a new FileBlockStatus with block_id=-1 being passed to _free_block, causing the _free_list to be populated with -1. This will lead to data corruption as multiple offloaded blocks will eventually be assigned to the same physical index -1 in the GPU cache.

Suggested change

for key, _ in candidates:

self._blocks.pop(key)

self._free_block(self._blocks.get(key) or FileBlockStatus(-1))

for key, block in candidates:

self._blocks.pop(key)

self._free_block(block)

gemini-code-assist · 2026-04-20T07:56:54Z

+                    gpu_slice = gpu_tensor[int(block_id)].cpu()
+                    src_bytes = gpu_slice.numpy().tobytes()


The current implementation performs a synchronous GPU-to-CPU copy using .cpu() and then creates multiple copies of the data via numpy().tobytes(). This is inefficient and blocks the worker thread. You should use the pre-allocated pinned memory buffers (self._cpu_buffers) and torch.Tensor.copy_ with non_blocking=True to allow for overlapped computation and I/O, which is the standard practice in vLLM for offloading.

gemini-code-assist · 2026-04-20T07:56:54Z

+                    src_bytes = gpu_slice.numpy().tobytes()
+
+                    # Write to file (create if not exists)
+                    os.makedirs(os.path.dirname(file_path) or ".", exist_ok=True)


Calling os.makedirs inside the block loop is redundant and inefficient. The storage_dir is already created in the FileOffloadingManager constructor. Since all file paths are derived from this directory, this check adds unnecessary overhead to every block transfer.

gemini-code-assist · 2026-04-20T07:56:54Z

+                    with open(file_path, "wb") as f:
+                        f.seek(offset)
+                        f.write(src_bytes)


Opening the file with "wb" mode will truncate the file. If the FileLoadStoreSpec specifies a non-zero offset (as suggested by the docstring for mmap-style access), any existing data in the file will be lost. To support multiple blocks per file, you should use "rb+" (creating the file first if necessary) or handle the truncation logic more carefully.

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

mergify · 2026-04-20T08:44:38Z

Hi @chaunceyjiang, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

orozery · 2026-04-20T09:09:27Z

@chaunceyjiang We actually plan to support a file-based offloading on-top CPU offloading, instead of directly from GPU.
It will base on #40020 and should be both performant and much simpler.

I think that for GPU->FILE we want to implement a GDS-based backend.

orozery · 2026-04-20T09:13:18Z

I think that for GPU->FILE we want to implement a GDS-based backend.

I think a Nixl-based solution should be interesting as it supports both GDS (file) and Direct S3 (object store).

effi-ofer · 2026-04-20T19:15:21Z

I think that for GPU->FILE we want to implement a GDS-based backend.

I think a Nixl-based solution should be interesting as it supports both GDS (file) and Direct S3 (object store).

@chaunceyjiang You should be aware that we have a nixl based object store backend PR for llm-d and plan to make it available as a secondary tier for vllm multi-tier #40020 .

chaunceyjiang requested review from ApostaC and orozery as code owners April 20, 2026 07:55

claude Bot reviewed Apr 20, 2026

View reviewed changes

mergify Bot added the v1 label Apr 20, 2026

chaunceyjiang changed the title ~~[Demo] add File kv_cache offloading supports~~ [Demo] Add Support for File-Based KV Cache Offloading Apr 20, 2026

gemini-code-assist Bot reviewed Apr 20, 2026

View reviewed changes

chaunceyjiang mentioned this pull request Apr 20, 2026

[Bugfix] Fix prevent duplicate Prometheus metrics when using MultiConnector with multiple OffloadingConnector instances #40112

Closed

4 tasks

chaunceyjiang added 2 commits April 20, 2026 16:25

[Demo] add File kv_cache offloading supports

f28187a

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

[Demo] add File kv_cache offloading supports

d2d0604

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

chaunceyjiang force-pushed the file-offloading-connector branch from bbd8ce3 to d2d0604 Compare April 20, 2026 08:39

chaunceyjiang closed this Apr 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Demo] Add Support for File-Based KV Cache Offloading#40330

[Demo] Add Support for File-Based KV Cache Offloading#40330
chaunceyjiang wants to merge 2 commits intovllm-project:mainfrom
chaunceyjiang:file-offloading-connector

chaunceyjiang commented Apr 20, 2026 •

edited by github-actions Bot

Loading

Uh oh!

claude Bot left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 20, 2026

Uh oh!

gemini-code-assist Bot Apr 20, 2026

Uh oh!

gemini-code-assist Bot Apr 20, 2026

Uh oh!

gemini-code-assist Bot Apr 20, 2026

Uh oh!

mergify Bot commented Apr 20, 2026

Uh oh!

orozery commented Apr 20, 2026

Uh oh!

orozery commented Apr 20, 2026

Uh oh!

effi-ofer commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		gpu_slice = gpu_tensor[int(block_id)].cpu()
		src_bytes = gpu_slice.numpy().tobytes()

Uh oh!

Conversation

chaunceyjiang commented Apr 20, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented Apr 20, 2026

Uh oh!

orozery commented Apr 20, 2026

Uh oh!

orozery commented Apr 20, 2026

Uh oh!

effi-ofer commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chaunceyjiang commented Apr 20, 2026 •

edited by github-actions Bot

Loading