Skip to content

[Demo] Add Support for File-Based KV Cache Offloading#40330

Closed
chaunceyjiang wants to merge 2 commits intovllm-project:mainfrom
chaunceyjiang:file-offloading-connector
Closed

[Demo] Add Support for File-Based KV Cache Offloading#40330
chaunceyjiang wants to merge 2 commits intovllm-project:mainfrom
chaunceyjiang:file-offloading-connector

Conversation

@chaunceyjiang
Copy link
Copy Markdown
Collaborator

@chaunceyjiang chaunceyjiang commented Apr 20, 2026

Purpose

Add Support for File-Based KV Cache Offloading

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@mergify mergify Bot added the v1 label Apr 20, 2026
@chaunceyjiang chaunceyjiang changed the title [Demo] add File kv_cache offloading supports [Demo] Add Support for File-Based KV Cache Offloading Apr 20, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements file-based KV cache offloading for vLLM, introducing a manager for disk storage and a handler for asynchronous data transfers between GPU memory and files. The review identifies a critical bug in the block eviction logic where keys are removed from the tracking dictionary before their block IDs are retrieved, which would lead to data corruption. Additionally, feedback highlights performance bottlenecks in the GPU-to-CPU transfer path, redundant directory creation calls within I/O loops, and a logic error in file opening modes that would cause unintended data truncation when using block offsets.

Comment on lines +172 to +174
for key, _ in candidates:
self._blocks.pop(key)
self._free_block(self._blocks.get(key) or FileBlockStatus(-1))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There is a critical logic error in the eviction process. The code pops the key from self._blocks and then immediately tries to retrieve it again using self._blocks.get(key), which will always return None. This results in a new FileBlockStatus with block_id=-1 being passed to _free_block, causing the _free_list to be populated with -1. This will lead to data corruption as multiple offloaded blocks will eventually be assigned to the same physical index -1 in the GPU cache.

Suggested change
for key, _ in candidates:
self._blocks.pop(key)
self._free_block(self._blocks.get(key) or FileBlockStatus(-1))
for key, block in candidates:
self._blocks.pop(key)
self._free_block(block)

Comment on lines +151 to +152
gpu_slice = gpu_tensor[int(block_id)].cpu()
src_bytes = gpu_slice.numpy().tobytes()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation performs a synchronous GPU-to-CPU copy using .cpu() and then creates multiple copies of the data via numpy().tobytes(). This is inefficient and blocks the worker thread. You should use the pre-allocated pinned memory buffers (self._cpu_buffers) and torch.Tensor.copy_ with non_blocking=True to allow for overlapped computation and I/O, which is the standard practice in vLLM for offloading.

src_bytes = gpu_slice.numpy().tobytes()

# Write to file (create if not exists)
os.makedirs(os.path.dirname(file_path) or ".", exist_ok=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Calling os.makedirs inside the block loop is redundant and inefficient. The storage_dir is already created in the FileOffloadingManager constructor. Since all file paths are derived from this directory, this check adds unnecessary overhead to every block transfer.

Comment on lines +156 to +158
with open(file_path, "wb") as f:
f.seek(offset)
f.write(src_bytes)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Opening the file with "wb" mode will truncate the file. If the FileLoadStoreSpec specifies a non-zero offset (as suggested by the docstring for mmap-style access), any existing data in the file will be lost. To support multiple blocks per file, you should use "rb+" (creating the file first if necessary) or handle the truncation logic more carefully.

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
@chaunceyjiang chaunceyjiang force-pushed the file-offloading-connector branch from bbd8ce3 to d2d0604 Compare April 20, 2026 08:39
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 20, 2026

Hi @chaunceyjiang, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@orozery
Copy link
Copy Markdown
Collaborator

orozery commented Apr 20, 2026

@chaunceyjiang We actually plan to support a file-based offloading on-top CPU offloading, instead of directly from GPU.
It will base on #40020 and should be both performant and much simpler.

I think that for GPU->FILE we want to implement a GDS-based backend.

@orozery
Copy link
Copy Markdown
Collaborator

orozery commented Apr 20, 2026

I think that for GPU->FILE we want to implement a GDS-based backend.

I think a Nixl-based solution should be interesting as it supports both GDS (file) and Direct S3 (object store).

@effi-ofer
Copy link
Copy Markdown

I think that for GPU->FILE we want to implement a GDS-based backend.

I think a Nixl-based solution should be interesting as it supports both GDS (file) and Direct S3 (object store).

@chaunceyjiang You should be aware that we have a nixl based object store backend PR for llm-d and plan to make it available as a secondary tier for vllm multi-tier #40020 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants