[KV offload][3/N] Add worker-side CPU support#21448
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Code Review
This pull request introduces worker-side support for CPU offloading, a significant feature enhancement. The changes include new abstractions for offloading, a worker queue manager for handling asynchronous transfers, and the necessary CPU-specific logic for tensor creation and data movement. The implementation is accompanied by a comprehensive set of tests that cover both the transfer logic and the asynchronous worker management.
My review has identified a couple of high-severity issues. One is a misleading docstring in the OffloadingManager abstract class that could lead to incorrect implementations. The other is the use of __del__ for resource cleanup in OffloadingQueueManager, which is unreliable and could lead to resource leaks. Addressing these points will improve the robustness and maintainability of the new offloading framework. Overall, this is a well-structured contribution.
4319481 to
1b7c0a0
Compare
3540bd7 to
1b569c5
Compare
2bb5c5a to
670b67a
Compare
|
@njhill As you can see, in this PR I reused the The other option is to create another kernel. |
670b67a to
a63b78f
Compare
a15e2b4 to
0084b78
Compare
ApostaC
left a comment
There was a problem hiding this comment.
Got some questions regarding the skip_blocks semantics and the compatibility with other attention backends.
vllm/v1/offloading/worker/cpu_gpu.py
Outdated
|
|
||
| def block_ids(specs_list: list[LoadStoreSpec], | ||
| block_size_factor: int, | ||
| skip_count: int = 0) -> Iterator[int]: |
There was a problem hiding this comment.
In what case will we have skip_count > 0?
Based on the code, if spec_list = [0, 1, 3], block_size_factor = 4, and skip_count = 1, the output will be: "[1, 2, 3, 5, 6, 7, 13, 14, 15]", which looks a bit weird.
There was a problem hiding this comment.
As you noted below, skip_count > 0 is for the case CPU block size is larger than the GPU block size, AND vLLM's scheduler is requesting to load starting from a middle of a CPU block.
The swap_blocks function assumes block_ids are all given in GPU block size.
It is actually not aware of the CPU block size.
Assuming for example that the GPU block size is 16, and CPU block size is 64 (block_size_factor=4).
So this function (block_ids) translates the given CPU blocks [0,1,3] to blocks of size 16.
The first CPU block 0 matches sub-blocks [0, 1, 2, 3].
The second CPU block 1 matches sub-blocks [4, 5, 6, 7]
The third CPU block 3 matches sub-blocks [12, 13, 14, 15]
Summing up, we get the following sub-blocks:
[0, 1, 2, 3, 4, 5, 6, 7, 12, 13, 14, 15]
Now since skip_count=1, the first sub-block is omitted and we get:
[1, 2, 3, 4, 5, 6, 7, 12, 13, 14, 15]
vllm/v1/offloading/worker/cpu_gpu.py
Outdated
| dst_sub_blocks_to_skip = (-len(src_specs) % dst_block_size_factor) | ||
|
|
||
| assert ( | ||
| len(src_specs) * | ||
| src_block_size_factor == len(dst_specs) * dst_block_size_factor - | ||
| dst_sub_blocks_to_skip) |
There was a problem hiding this comment.
I understand in some cases there will be a "sub block" that needs to be skipped since CPU blocks could be larger than GPU blocks, and the transfer can start in the middle of a CPU block.
However, since CPU could be both src and dst, why we only have dst_sub_blocks_to_skip but no "src_sub_blocks_to_skip"?
There was a problem hiding this comment.
When storing GPU->CPU, the offloading connector always aligns writes to the CPU block size.
Thus we don't need to support skip_count > 0 for the source specs.
vllm/v1/offloading/worker/cpu_gpu.py
Outdated
| block_ids(dst_specs, dst_block_size_factor, | ||
| dst_sub_blocks_to_skip))) |
There was a problem hiding this comment.
Regarding the skip_count, should we only skip the sub block in the first block?
Also, will there be cases that we need to skip the end of a block? Let's say we have 10 GPU blocks, and the block_size_factor = 4, then we probably need to skip the last 2 blocks.
There was a problem hiding this comment.
The only skip scenario is upto block_size_factor GPU blocks from the destination specs.
Skipping the last blocks is not needed since the lookup function (get_num_new_matched_tokens) will not lookup partial CPU blocks.
vllm/v1/offloading/worker/cpu_gpu.py
Outdated
| else torch.cuda.Event() | ||
| with torch.cuda.stream(stream): | ||
| for src_cache, dst_cache in zip(src_caches, dst_caches): | ||
| self.attn_backend.swap_blocks(src_cache, dst_cache, src_to_dst) |
There was a problem hiding this comment.
Do other attention backends also have this swap_blocks implemented? (such as FlashInfer and FlashMLA)
There was a problem hiding this comment.
In this PR I only targeted the default backend which is FlashAttention.
I think that extending support for CPU offloading to other backends, and more generally to other accelerators (TPU) should be done in follow-up PRs.
There was a problem hiding this comment.
swap_blocks was only used in V0 and isn't implemented for the V1 attention back-ends. TBD if we want to add this to the V1 attention implementations just for this. cc @WoosukKwon
I think it's as important to support other backends like MLA and FlashInfer ... that is the default for Blackwell.
Perhaps for now we can just call the kernel directly i.e. _custom_ops.swap_blocks(). However whether we need to do this for k and v separately depends on the backend. It could be determined via something like:
attn_backend.get_kv_cache_shape(1, 16, 1, 1)[0] == 2
we could also compare performance of NIXL for this
There was a problem hiding this comment.
Changed to using _custom_ops directly, so this should work with any attention backend with shape (2, num_blocks, ...)
vllm/v1/offloading/worker/cpu_gpu.py
Outdated
| else torch.cuda.Event() | ||
| with torch.cuda.stream(stream): | ||
| for src_cache, dst_cache in zip(src_caches, dst_caches): | ||
| self.attn_backend.swap_blocks(src_cache, dst_cache, src_to_dst) |
There was a problem hiding this comment.
swap_blocks was only used in V0 and isn't implemented for the V1 attention back-ends. TBD if we want to add this to the V1 attention implementations just for this. cc @WoosukKwon
I think it's as important to support other backends like MLA and FlashInfer ... that is the default for Blackwell.
Perhaps for now we can just call the kernel directly i.e. _custom_ops.swap_blocks(). However whether we need to do this for k and v separately depends on the backend. It could be determined via something like:
attn_backend.get_kv_cache_shape(1, 16, 1, 1)[0] == 2
we could also compare performance of NIXL for this
vllm/v1/offloading/worker/cpu_gpu.py
Outdated
| src_block_size_factor == len(dst_specs) * dst_block_size_factor - | ||
| dst_sub_blocks_to_skip) | ||
|
|
||
| src_to_dst_list: list[tuple[int, int]] = list( |
There was a problem hiding this comment.
Instead of constructing this list, it would probably be more efficient to allocate and populate a num_blocks x 2 numpy array which can then be viewed as the cpu mapping tensor directly.
There was a problem hiding this comment.
I switched to using tensors directly
0084b78 to
c219067
Compare
|
No ciflow labels are configured for this repo. |
vllm/v1/offloading/worker/cpu_gpu.py
Outdated
| src_key_cache = src_cache[0] | ||
| dst_key_cache = dst_cache[0] | ||
| ops.swap_blocks(src_key_cache, dst_key_cache, src_to_dst) | ||
| src_value_cache = src_cache[1] | ||
| dst_value_cache = dst_cache[1] | ||
| ops.swap_blocks(src_value_cache, dst_value_cache, src_to_dst) |
There was a problem hiding this comment.
We should handle the case block dimension is the first dimension too (includes FlashInfer). Per my earlier comment we can detect that from the attention backend and have a flag to control the behavior here.
There was a problem hiding this comment.
Thanks!
I pushed a new implementation which I think should work for both FlashInfer and MLA (also added a test).
Also, now supporting multiple attention backends (for hybrid memory allocator).
vllm/v1/offloading/worker/cpu_gpu.py
Outdated
| src_to_dst = torch.column_stack( | ||
| (expand_block_ids(src_blocks, src_block_size_factor), | ||
| expand_block_ids(dst_blocks, | ||
| dst_block_size_factor)[dst_sub_blocks_to_skip:])) |
There was a problem hiding this comment.
This looks good!
We could potentially optimize further by creating the empty final tensor first and having the expand_block_ids methods write into that rather than each creating an intermediate concatenated tensor.
There was a problem hiding this comment.
I implemented something like this now.
c219067 to
16a5ac9
Compare
vllm/v1/offloading/worker/cpu_gpu.py
Outdated
| for block_id in block_ids: | ||
| base_block_id = block_id * block_size_factor | ||
| for i in range(skip_count, block_size_factor): | ||
| output[output_idx] = base_block_id + i | ||
| output_idx += 1 | ||
|
|
||
| # finished skipping | ||
| skip_count = 0 |
There was a problem hiding this comment.
updating tensor element-wise is actually very slow, I was thinking more like what you had before but passing out arg to torch.cat.
Alternatively could keep as numpy array (which I think block_ids should be already per other comment) ... which are much more efficient to manipulate.
By the way I don't think these specific optimizations should hold up getting the functionality in, can always be follow-on updates.
There was a problem hiding this comment.
I switched to using numpy
16a5ac9 to
a427e75
Compare
| for src_tensor, dst_tensor, kv_dim in zip( | ||
| src_tensors, dst_tensors, self.kv_dim_before_num_blocks): | ||
| if kv_dim: | ||
| src_key_cache = src_tensor[0] | ||
| dst_key_cache = dst_tensor[0] | ||
| ops.swap_blocks(src_key_cache, dst_key_cache, | ||
| src_to_dst_tensor) | ||
| src_value_cache = src_tensor[1] | ||
| dst_value_cache = dst_tensor[1] | ||
| ops.swap_blocks(src_value_cache, dst_value_cache, | ||
| src_to_dst_tensor) | ||
| else: | ||
| ops.swap_blocks(src_tensor, dst_tensor, src_to_dst_tensor) |
There was a problem hiding this comment.
Not for this PR but I think we could potentially abstract this in a follow-on via
with #24690
This commit adds worker-side support for CPU offloading. It uses the swap_blocks function to perform the actual copying between GPU and CPU. Supports any CPU block size which is devisable by the GPU block size. Signed-off-by: Or Ozeri <oro@il.ibm.com>
a427e75 to
a9ceb96
Compare
Signed-off-by: Or Ozeri <oro@il.ibm.com>
Signed-off-by: Or Ozeri <oro@il.ibm.com>
Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: Or Ozeri <oro@il.ibm.com>
Signed-off-by: Or Ozeri <oro@il.ibm.com>
This PR adds worker-side support for CPU offloading.
It uses the
swap_blocksfunction to perform the actual copying between CPU and GPU.Supports any
cpu_block_sizewhich is divided bygpu_block_size.Part of the work described in RFC #19854
Depends on #19848