[Demo] Add Support for File-Based KV Cache Offloading#40330
[Demo] Add Support for File-Based KV Cache Offloading#40330chaunceyjiang wants to merge 2 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements file-based KV cache offloading for vLLM, introducing a manager for disk storage and a handler for asynchronous data transfers between GPU memory and files. The review identifies a critical bug in the block eviction logic where keys are removed from the tracking dictionary before their block IDs are retrieved, which would lead to data corruption. Additionally, feedback highlights performance bottlenecks in the GPU-to-CPU transfer path, redundant directory creation calls within I/O loops, and a logic error in file opening modes that would cause unintended data truncation when using block offsets.
| for key, _ in candidates: | ||
| self._blocks.pop(key) | ||
| self._free_block(self._blocks.get(key) or FileBlockStatus(-1)) |
There was a problem hiding this comment.
There is a critical logic error in the eviction process. The code pops the key from self._blocks and then immediately tries to retrieve it again using self._blocks.get(key), which will always return None. This results in a new FileBlockStatus with block_id=-1 being passed to _free_block, causing the _free_list to be populated with -1. This will lead to data corruption as multiple offloaded blocks will eventually be assigned to the same physical index -1 in the GPU cache.
| for key, _ in candidates: | |
| self._blocks.pop(key) | |
| self._free_block(self._blocks.get(key) or FileBlockStatus(-1)) | |
| for key, block in candidates: | |
| self._blocks.pop(key) | |
| self._free_block(block) |
| gpu_slice = gpu_tensor[int(block_id)].cpu() | ||
| src_bytes = gpu_slice.numpy().tobytes() |
There was a problem hiding this comment.
The current implementation performs a synchronous GPU-to-CPU copy using .cpu() and then creates multiple copies of the data via numpy().tobytes(). This is inefficient and blocks the worker thread. You should use the pre-allocated pinned memory buffers (self._cpu_buffers) and torch.Tensor.copy_ with non_blocking=True to allow for overlapped computation and I/O, which is the standard practice in vLLM for offloading.
| src_bytes = gpu_slice.numpy().tobytes() | ||
|
|
||
| # Write to file (create if not exists) | ||
| os.makedirs(os.path.dirname(file_path) or ".", exist_ok=True) |
There was a problem hiding this comment.
| with open(file_path, "wb") as f: | ||
| f.seek(offset) | ||
| f.write(src_bytes) |
There was a problem hiding this comment.
Opening the file with "wb" mode will truncate the file. If the FileLoadStoreSpec specifies a non-zero offset (as suggested by the docstring for mmap-style access), any existing data in the file will be lost. To support multiple blocks per file, you should use "rb+" (creating the file first if necessary) or handle the truncation logic more carefully.
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
bbd8ce3 to
d2d0604
Compare
|
Hi @chaunceyjiang, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
@chaunceyjiang We actually plan to support a file-based offloading on-top CPU offloading, instead of directly from GPU. I think that for GPU->FILE we want to implement a GDS-based backend. |
I think a Nixl-based solution should be interesting as it supports both GDS (file) and Direct S3 (object store). |
@chaunceyjiang You should be aware that we have a nixl based object store backend PR for llm-d and plan to make it available as a secondary tier for vllm multi-tier #40020 . |
Purpose
Add Support for File-Based KV Cache Offloading
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.