Skip to content

[KV offload][3/N] Add worker-side CPU support#21448

Merged
njhill merged 1 commit intovllm-project:mainfrom
orozery:cpu-offloading-worker
Sep 19, 2025
Merged

[KV offload][3/N] Add worker-side CPU support#21448
njhill merged 1 commit intovllm-project:mainfrom
orozery:cpu-offloading-worker

Conversation

@orozery
Copy link
Collaborator

@orozery orozery commented Jul 23, 2025

This PR adds worker-side support for CPU offloading.
It uses the swap_blocks function to perform the actual copying between CPU and GPU.
Supports any cpu_block_size which is divided by gpu_block_size.

Part of the work described in RFC #19854
Depends on #19848

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces worker-side support for CPU offloading, a significant feature enhancement. The changes include new abstractions for offloading, a worker queue manager for handling asynchronous transfers, and the necessary CPU-specific logic for tensor creation and data movement. The implementation is accompanied by a comprehensive set of tests that cover both the transfer logic and the asynchronous worker management.

My review has identified a couple of high-severity issues. One is a misleading docstring in the OffloadingManager abstract class that could lead to incorrect implementations. The other is the use of __del__ for resource cleanup in OffloadingQueueManager, which is unreliable and could lead to resource leaks. Addressing these points will improve the robustness and maintainability of the new offloading framework. Overall, this is a well-structured contribution.

@orozery orozery force-pushed the cpu-offloading-worker branch 4 times, most recently from 4319481 to 1b7c0a0 Compare July 27, 2025 11:14
@orozery orozery force-pushed the cpu-offloading-worker branch 3 times, most recently from 3540bd7 to 1b569c5 Compare August 6, 2025 14:25
@orozery orozery force-pushed the cpu-offloading-worker branch 3 times, most recently from 2bb5c5a to 670b67a Compare August 11, 2025 19:13
@orozery
Copy link
Collaborator Author

orozery commented Aug 14, 2025

@njhill As you can see, in this PR I reused the swap_blocks from V0.
However, looking into the actual cuda at cache_kernels.cu, I see that it uses cudaMemcpyAsync.
This means that when the call returns, I cannot be sure that the transfer actually completed.
This violates my design assumption that my TransferFunction should be synchronous.
I was thinking on modifying the current kernel to use cudaMemcpy instead.
This will change it for V0 as well, so maybe it's not a good idea?

The other option is to create another kernel.
I would actually prefer a kernel that handles the swap for all layers at once.
This can improve performance since it will reduce the number of python->cuda context switches by a factor of num_layers.
The question then is how should I expose this new kernel to the code? Should I add another API function to AttentionBackend?

@orozery orozery force-pushed the cpu-offloading-worker branch from 670b67a to a63b78f Compare August 14, 2025 12:38
@orozery orozery force-pushed the cpu-offloading-worker branch 2 times, most recently from a15e2b4 to 0084b78 Compare September 4, 2025 13:05
Copy link
Collaborator

@ApostaC ApostaC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got some questions regarding the skip_blocks semantics and the compatibility with other attention backends.


def block_ids(specs_list: list[LoadStoreSpec],
block_size_factor: int,
skip_count: int = 0) -> Iterator[int]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what case will we have skip_count > 0?

Based on the code, if spec_list = [0, 1, 3], block_size_factor = 4, and skip_count = 1, the output will be: "[1, 2, 3, 5, 6, 7, 13, 14, 15]", which looks a bit weird.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you noted below, skip_count > 0 is for the case CPU block size is larger than the GPU block size, AND vLLM's scheduler is requesting to load starting from a middle of a CPU block.
The swap_blocks function assumes block_ids are all given in GPU block size.
It is actually not aware of the CPU block size.

Assuming for example that the GPU block size is 16, and CPU block size is 64 (block_size_factor=4).
So this function (block_ids) translates the given CPU blocks [0,1,3] to blocks of size 16.
The first CPU block 0 matches sub-blocks [0, 1, 2, 3].
The second CPU block 1 matches sub-blocks [4, 5, 6, 7]
The third CPU block 3 matches sub-blocks [12, 13, 14, 15]
Summing up, we get the following sub-blocks:
[0, 1, 2, 3, 4, 5, 6, 7, 12, 13, 14, 15]
Now since skip_count=1, the first sub-block is omitted and we get:
[1, 2, 3, 4, 5, 6, 7, 12, 13, 14, 15]

Comment on lines +105 to +110
dst_sub_blocks_to_skip = (-len(src_specs) % dst_block_size_factor)

assert (
len(src_specs) *
src_block_size_factor == len(dst_specs) * dst_block_size_factor -
dst_sub_blocks_to_skip)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand in some cases there will be a "sub block" that needs to be skipped since CPU blocks could be larger than GPU blocks, and the transfer can start in the middle of a CPU block.

However, since CPU could be both src and dst, why we only have dst_sub_blocks_to_skip but no "src_sub_blocks_to_skip"?

Copy link
Collaborator Author

@orozery orozery Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When storing GPU->CPU, the offloading connector always aligns writes to the CPU block size.
Thus we don't need to support skip_count > 0 for the source specs.

Comment on lines +115 to +116
block_ids(dst_specs, dst_block_size_factor,
dst_sub_blocks_to_skip)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the skip_count, should we only skip the sub block in the first block?

Also, will there be cases that we need to skip the end of a block? Let's say we have 10 GPU blocks, and the block_size_factor = 4, then we probably need to skip the last 2 blocks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only skip scenario is upto block_size_factor GPU blocks from the destination specs.
Skipping the last blocks is not needed since the lookup function (get_num_new_matched_tokens) will not lookup partial CPU blocks.

else torch.cuda.Event()
with torch.cuda.stream(stream):
for src_cache, dst_cache in zip(src_caches, dst_caches):
self.attn_backend.swap_blocks(src_cache, dst_cache, src_to_dst)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do other attention backends also have this swap_blocks implemented? (such as FlashInfer and FlashMLA)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this PR I only targeted the default backend which is FlashAttention.
I think that extending support for CPU offloading to other backends, and more generally to other accelerators (TPU) should be done in follow-up PRs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

swap_blocks was only used in V0 and isn't implemented for the V1 attention back-ends. TBD if we want to add this to the V1 attention implementations just for this. cc @WoosukKwon

I think it's as important to support other backends like MLA and FlashInfer ... that is the default for Blackwell.

Perhaps for now we can just call the kernel directly i.e. _custom_ops.swap_blocks(). However whether we need to do this for k and v separately depends on the backend. It could be determined via something like:

    attn_backend.get_kv_cache_shape(1, 16, 1, 1)[0] == 2

we could also compare performance of NIXL for this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to using _custom_ops directly, so this should work with any attention backend with shape (2, num_blocks, ...)

Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @orozery for all of the hard work on this

else torch.cuda.Event()
with torch.cuda.stream(stream):
for src_cache, dst_cache in zip(src_caches, dst_caches):
self.attn_backend.swap_blocks(src_cache, dst_cache, src_to_dst)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

swap_blocks was only used in V0 and isn't implemented for the V1 attention back-ends. TBD if we want to add this to the V1 attention implementations just for this. cc @WoosukKwon

I think it's as important to support other backends like MLA and FlashInfer ... that is the default for Blackwell.

Perhaps for now we can just call the kernel directly i.e. _custom_ops.swap_blocks(). However whether we need to do this for k and v separately depends on the backend. It could be determined via something like:

    attn_backend.get_kv_cache_shape(1, 16, 1, 1)[0] == 2

we could also compare performance of NIXL for this

src_block_size_factor == len(dst_specs) * dst_block_size_factor -
dst_sub_blocks_to_skip)

src_to_dst_list: list[tuple[int, int]] = list(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of constructing this list, it would probably be more efficient to allocate and populate a num_blocks x 2 numpy array which can then be viewed as the cpu mapping tensor directly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched to using tensors directly

@orozery orozery force-pushed the cpu-offloading-worker branch from 0084b78 to c219067 Compare September 15, 2025 15:32
@pytorch-bot pytorch-bot bot removed the ci/build label Sep 15, 2025
@mergify mergify bot added the ci/build label Sep 15, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Sep 15, 2025

No ciflow labels are configured for this repo.
For information on how to enable CIFlow bot see this wiki

Comment on lines +120 to +125
src_key_cache = src_cache[0]
dst_key_cache = dst_cache[0]
ops.swap_blocks(src_key_cache, dst_key_cache, src_to_dst)
src_value_cache = src_cache[1]
dst_value_cache = dst_cache[1]
ops.swap_blocks(src_value_cache, dst_value_cache, src_to_dst)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should handle the case block dimension is the first dimension too (includes FlashInfer). Per my earlier comment we can detect that from the attention backend and have a flag to control the behavior here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
I pushed a new implementation which I think should work for both FlashInfer and MLA (also added a test).
Also, now supporting multiple attention backends (for hybrid memory allocator).

Comment on lines +111 to +114
src_to_dst = torch.column_stack(
(expand_block_ids(src_blocks, src_block_size_factor),
expand_block_ids(dst_blocks,
dst_block_size_factor)[dst_sub_blocks_to_skip:]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good!

We could potentially optimize further by creating the empty final tensor first and having the expand_block_ids methods write into that rather than each creating an intermediate concatenated tensor.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented something like this now.

@orozery orozery force-pushed the cpu-offloading-worker branch from c219067 to 16a5ac9 Compare September 16, 2025 09:27
Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @orozery, looks great

Comment on lines +37 to +44
for block_id in block_ids:
base_block_id = block_id * block_size_factor
for i in range(skip_count, block_size_factor):
output[output_idx] = base_block_id + i
output_idx += 1

# finished skipping
skip_count = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updating tensor element-wise is actually very slow, I was thinking more like what you had before but passing out arg to torch.cat.

Alternatively could keep as numpy array (which I think block_ids should be already per other comment) ... which are much more efficient to manipulate.

By the way I don't think these specific optimizations should hold up getting the functionality in, can always be follow-on updates.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched to using numpy

@orozery orozery force-pushed the cpu-offloading-worker branch from 16a5ac9 to a427e75 Compare September 17, 2025 09:33
Comment on lines +143 to +155
for src_tensor, dst_tensor, kv_dim in zip(
src_tensors, dst_tensors, self.kv_dim_before_num_blocks):
if kv_dim:
src_key_cache = src_tensor[0]
dst_key_cache = dst_tensor[0]
ops.swap_blocks(src_key_cache, dst_key_cache,
src_to_dst_tensor)
src_value_cache = src_tensor[1]
dst_value_cache = dst_tensor[1]
ops.swap_blocks(src_value_cache, dst_value_cache,
src_to_dst_tensor)
else:
ops.swap_blocks(src_tensor, dst_tensor, src_to_dst_tensor)
Copy link
Member

@njhill njhill Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR but I think we could potentially abstract this in a follow-on via

def set_host_xfer_buffer_ops(self, copy_operation: CopyBlocksOp):

with #24690

@njhill njhill changed the title v1/offloading: Add worker-side CPU support [KV offload][3/N] Add worker-side CPU support Sep 18, 2025
This commit adds worker-side support for CPU offloading.
It uses the swap_blocks function to perform the actual copying between GPU and CPU.
Supports any CPU block size which is devisable by the GPU block size.

Signed-off-by: Or Ozeri <oro@il.ibm.com>
@orozery orozery force-pushed the cpu-offloading-worker branch from a427e75 to a9ceb96 Compare September 18, 2025 21:07
@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 19, 2025
@njhill njhill enabled auto-merge (squash) September 19, 2025 02:18
Copy link
Collaborator

@ApostaC ApostaC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@njhill njhill merged commit 7ac67ea into vllm-project:main Sep 19, 2025
51 checks passed
debroy-rh pushed a commit to debroy-rh/vllm that referenced this pull request Sep 19, 2025
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
charlifu pushed a commit to ROCm/vllm that referenced this pull request Sep 25, 2025
Signed-off-by: Or Ozeri <oro@il.ibm.com>
Signed-off-by: charlifu <charlifu@amd.com>
choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants