[BugFix] kv_offloading: Fix bug in loading of partial cpu blocks#28951
Conversation
There was a problem hiding this comment.
Code Review
This pull request addresses a bug in loading partial CPU blocks for KV cache offloading. The core logic is adjusted to correctly skip sub-blocks from the source (CPU) rather than the destination, and the test suite is updated to validate this scenario. However, this change introduces a critical issue where the block mapping array src_to_dst is allocated with an incorrect size. This can lead to reading uninitialized memory and subsequent incorrect data transfers. I have provided a specific code suggestion to rectify this allocation bug.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
b7ad215 to
ee67691
Compare
This commit fixes a bug when trying to load from the middle of a CPU block. This can happen if cpu_block_size > gpu_block_size, and there's both a cpu and gpu (prefix cache) hit, where the gpu hit ends in the middle of a cpu block. Before this commit, the code tried to wrongfully address the other direction, storing to the middle of a cpu block. But this is impossible since the offloading connector always stores full CPU blocks. Signed-off-by: Or Ozeri <oro@il.ibm.com>
ee67691 to
b5a0482
Compare
|
@ApostaC can you please have a look? |
ApostaC
left a comment
There was a problem hiding this comment.
LGTM.
One quick question: is there any chance that the CPU block size is smaller than the GPU block size? In this case, we also need to skip dst (GPU) sub blocks when doing CPU -> GPU transfer.
It's impossible. See here: vllm/vllm/v1/kv_offload/spec.py Line 38 in 4c23690 |
…m-project#28951) Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
…m-project#28951) Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu>
…m-project#28951) Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
…m-project#28951) Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Fixes #28950
This PR fixes a bug when trying to load from the middle of a CPU block. This can happen if cpu_block_size > gpu_block_size, and there's both a cpu and gpu (prefix cache) hit, where the gpu hit ends in the middle of a cpu block. Before this commit, the code tried to wrongfully address the other direction, storing to the middle of a cpu block. But this is impossible since the offloading connector always stores full CPU blocks.