[KV offload][3/N] Add worker-side CPU support by orozery · Pull Request #21448 · vllm-project/vllm

orozery · 2025-07-23T10:02:51Z

This PR adds worker-side support for CPU offloading.
It uses the swap_blocks function to perform the actual copying between CPU and GPU.
Supports any cpu_block_size which is divided by gpu_block_size.

Part of the work described in RFC #19854
Depends on #19848

github-actions · 2025-07-23T10:03:27Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request introduces worker-side support for CPU offloading, a significant feature enhancement. The changes include new abstractions for offloading, a worker queue manager for handling asynchronous transfers, and the necessary CPU-specific logic for tensor creation and data movement. The implementation is accompanied by a comprehensive set of tests that cover both the transfer logic and the asynchronous worker management.

My review has identified a couple of high-severity issues. One is a misleading docstring in the OffloadingManager abstract class that could lead to incorrect implementations. The other is the use of __del__ for resource cleanup in OffloadingQueueManager, which is unreliable and could lead to resource leaks. Addressing these points will improve the robustness and maintainability of the new offloading framework. Overall, this is a well-structured contribution.

vllm/v1/offloading/abstract.py

vllm/v1/offloading/worker/worker.py

orozery · 2025-08-14T12:06:09Z

@njhill As you can see, in this PR I reused the swap_blocks from V0.
However, looking into the actual cuda at cache_kernels.cu, I see that it uses cudaMemcpyAsync.
This means that when the call returns, I cannot be sure that the transfer actually completed.
This violates my design assumption that my TransferFunction should be synchronous.
I was thinking on modifying the current kernel to use cudaMemcpy instead.
This will change it for V0 as well, so maybe it's not a good idea?

The other option is to create another kernel.
I would actually prefer a kernel that handles the swap for all layers at once.
This can improve performance since it will reduce the number of python->cuda context switches by a factor of num_layers.
The question then is how should I expose this new kernel to the code? Should I add another API function to AttentionBackend?

ApostaC

Got some questions regarding the skip_blocks semantics and the compatibility with other attention backends.

ApostaC · 2025-09-11T22:40:35Z

vllm/v1/offloading/worker/cpu_gpu.py

+
+def block_ids(specs_list: list[LoadStoreSpec],
+              block_size_factor: int,
+              skip_count: int = 0) -> Iterator[int]:


In what case will we have skip_count > 0?

Based on the code, if spec_list = [0, 1, 3], block_size_factor = 4, and skip_count = 1, the output will be: "[1, 2, 3, 5, 6, 7, 13, 14, 15]", which looks a bit weird.

As you noted below, skip_count > 0 is for the case CPU block size is larger than the GPU block size, AND vLLM's scheduler is requesting to load starting from a middle of a CPU block.
The swap_blocks function assumes block_ids are all given in GPU block size.
It is actually not aware of the CPU block size.

Assuming for example that the GPU block size is 16, and CPU block size is 64 (block_size_factor=4).
So this function (block_ids) translates the given CPU blocks [0,1,3] to blocks of size 16.
The first CPU block 0 matches sub-blocks [0, 1, 2, 3].
The second CPU block 1 matches sub-blocks [4, 5, 6, 7]
The third CPU block 3 matches sub-blocks [12, 13, 14, 15]
Summing up, we get the following sub-blocks:
[0, 1, 2, 3, 4, 5, 6, 7, 12, 13, 14, 15]
Now since skip_count=1, the first sub-block is omitted and we get:
[1, 2, 3, 4, 5, 6, 7, 12, 13, 14, 15]

ApostaC · 2025-09-11T22:47:25Z

vllm/v1/offloading/worker/cpu_gpu.py

+        dst_sub_blocks_to_skip = (-len(src_specs) % dst_block_size_factor)
+
+        assert (
+            len(src_specs) *
+            src_block_size_factor == len(dst_specs) * dst_block_size_factor -
+            dst_sub_blocks_to_skip)


I understand in some cases there will be a "sub block" that needs to be skipped since CPU blocks could be larger than GPU blocks, and the transfer can start in the middle of a CPU block.

However, since CPU could be both src and dst, why we only have dst_sub_blocks_to_skip but no "src_sub_blocks_to_skip"?

When storing GPU->CPU, the offloading connector always aligns writes to the CPU block size.
Thus we don't need to support skip_count > 0 for the source specs.

ApostaC · 2025-09-11T22:49:24Z

vllm/v1/offloading/worker/cpu_gpu.py

+                block_ids(dst_specs, dst_block_size_factor,
+                          dst_sub_blocks_to_skip)))


Regarding the skip_count, should we only skip the sub block in the first block?

Also, will there be cases that we need to skip the end of a block? Let's say we have 10 GPU blocks, and the block_size_factor = 4, then we probably need to skip the last 2 blocks.

The only skip scenario is upto block_size_factor GPU blocks from the destination specs.
Skipping the last blocks is not needed since the lookup function (get_num_new_matched_tokens) will not lookup partial CPU blocks.

ApostaC · 2025-09-11T22:51:26Z

vllm/v1/offloading/worker/cpu_gpu.py

+            else torch.cuda.Event()
+        with torch.cuda.stream(stream):
+            for src_cache, dst_cache in zip(src_caches, dst_caches):
+                self.attn_backend.swap_blocks(src_cache, dst_cache, src_to_dst)


Do other attention backends also have this swap_blocks implemented? (such as FlashInfer and FlashMLA)

In this PR I only targeted the default backend which is FlashAttention.
I think that extending support for CPU offloading to other backends, and more generally to other accelerators (TPU) should be done in follow-up PRs.

swap_blocks was only used in V0 and isn't implemented for the V1 attention back-ends. TBD if we want to add this to the V1 attention implementations just for this. cc @WoosukKwon

I think it's as important to support other backends like MLA and FlashInfer ... that is the default for Blackwell.

Perhaps for now we can just call the kernel directly i.e. _custom_ops.swap_blocks(). However whether we need to do this for k and v separately depends on the backend. It could be determined via something like:

attn_backend.get_kv_cache_shape(1, 16, 1, 1)[0] == 2

we could also compare performance of NIXL for this

Changed to using _custom_ops directly, so this should work with any attention backend with shape (2, num_blocks, ...)

njhill

Thanks @orozery for all of the hard work on this

vllm/v1/offloading/worker/cpu_gpu.py

njhill · 2025-09-12T23:44:16Z

vllm/v1/offloading/worker/cpu_gpu.py

+            else torch.cuda.Event()
+        with torch.cuda.stream(stream):
+            for src_cache, dst_cache in zip(src_caches, dst_caches):
+                self.attn_backend.swap_blocks(src_cache, dst_cache, src_to_dst)


swap_blocks was only used in V0 and isn't implemented for the V1 attention back-ends. TBD if we want to add this to the V1 attention implementations just for this. cc @WoosukKwon

I think it's as important to support other backends like MLA and FlashInfer ... that is the default for Blackwell.

Perhaps for now we can just call the kernel directly i.e. _custom_ops.swap_blocks(). However whether we need to do this for k and v separately depends on the backend. It could be determined via something like:

attn_backend.get_kv_cache_shape(1, 16, 1, 1)[0] == 2

we could also compare performance of NIXL for this

njhill · 2025-09-12T23:49:41Z

vllm/v1/offloading/worker/cpu_gpu.py

+            src_block_size_factor == len(dst_specs) * dst_block_size_factor -
+            dst_sub_blocks_to_skip)
+
+        src_to_dst_list: list[tuple[int, int]] = list(


Instead of constructing this list, it would probably be more efficient to allocate and populate a num_blocks x 2 numpy array which can then be viewed as the cpu mapping tensor directly.

I switched to using tensors directly

vllm/v1/offloading/worker/cpu_gpu.py

pytorch-bot · 2025-09-15T15:33:03Z

No ciflow labels are configured for this repo.
For information on how to enable CIFlow bot see this wiki

njhill · 2025-09-16T00:48:46Z

vllm/v1/offloading/worker/cpu_gpu.py

+                src_key_cache = src_cache[0]
+                dst_key_cache = dst_cache[0]
+                ops.swap_blocks(src_key_cache, dst_key_cache, src_to_dst)
+                src_value_cache = src_cache[1]
+                dst_value_cache = dst_cache[1]
+                ops.swap_blocks(src_value_cache, dst_value_cache, src_to_dst)


We should handle the case block dimension is the first dimension too (includes FlashInfer). Per my earlier comment we can detect that from the attention backend and have a flag to control the behavior here.

Thanks!
I pushed a new implementation which I think should work for both FlashInfer and MLA (also added a test).
Also, now supporting multiple attention backends (for hybrid memory allocator).

njhill · 2025-09-16T01:13:02Z

vllm/v1/offloading/worker/cpu_gpu.py

+        src_to_dst = torch.column_stack(
+            (expand_block_ids(src_blocks, src_block_size_factor),
+             expand_block_ids(dst_blocks,
+                              dst_block_size_factor)[dst_sub_blocks_to_skip:]))


This looks good!

We could potentially optimize further by creating the empty final tensor first and having the expand_block_ids methods write into that rather than each creating an intermediate concatenated tensor.

I implemented something like this now.

njhill

Thanks @orozery, looks great

njhill · 2025-09-16T15:51:31Z

vllm/v1/offloading/worker/cpu_gpu.py

+    for block_id in block_ids:
+        base_block_id = block_id * block_size_factor
+        for i in range(skip_count, block_size_factor):
+            output[output_idx] = base_block_id + i
+            output_idx += 1
+
+        # finished skipping
+        skip_count = 0


updating tensor element-wise is actually very slow, I was thinking more like what you had before but passing out arg to torch.cat.

Alternatively could keep as numpy array (which I think block_ids should be already per other comment) ... which are much more efficient to manipulate.

By the way I don't think these specific optimizations should hold up getting the functionality in, can always be follow-on updates.

I switched to using numpy

njhill · 2025-09-17T15:53:43Z

vllm/v1/kv_offload/worker/cpu_gpu.py

+            for src_tensor, dst_tensor, kv_dim in zip(
+                    src_tensors, dst_tensors, self.kv_dim_before_num_blocks):
+                if kv_dim:
+                    src_key_cache = src_tensor[0]
+                    dst_key_cache = dst_tensor[0]
+                    ops.swap_blocks(src_key_cache, dst_key_cache,
+                                    src_to_dst_tensor)
+                    src_value_cache = src_tensor[1]
+                    dst_value_cache = dst_tensor[1]
+                    ops.swap_blocks(src_value_cache, dst_value_cache,
+                                    src_to_dst_tensor)
+                else:
+                    ops.swap_blocks(src_tensor, dst_tensor, src_to_dst_tensor)


Not for this PR but I think we could potentially abstract this in a follow-on via

vllm/vllm/distributed/kv_transfer/kv_connector/v1/base.py

Line 143 in 087c6ff

def set_host_xfer_buffer_ops(self, copy_operation: CopyBlocksOp):

with #24690

This commit adds worker-side support for CPU offloading. It uses the swap_blocks function to perform the actual copying between GPU and CPU. Supports any CPU block size which is devisable by the GPU block size. Signed-off-by: Or Ozeri <oro@il.ibm.com>

ApostaC

LGTM!

Signed-off-by: Or Ozeri <oro@il.ibm.com>

Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: charlifu <charlifu@amd.com>

Signed-off-by: Or Ozeri <oro@il.ibm.com>

orozery requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners July 23, 2025 10:02

mergify bot added ci/build v1 labels Jul 23, 2025

gemini-code-assist bot reviewed Jul 23, 2025

View reviewed changes

vllm/v1/offloading/abstract.py Outdated Show resolved Hide resolved

vllm/v1/offloading/worker/worker.py Outdated Show resolved Hide resolved

orozery force-pushed the cpu-offloading-worker branch 4 times, most recently from 4319481 to 1b7c0a0 Compare July 27, 2025 11:14

vMaroon mentioned this pull request Jul 28, 2025

vLLM Native CPU Offloading Connector llm-d/llm-d-kv-cache#67

Closed

orozery force-pushed the cpu-offloading-worker branch 3 times, most recently from 3540bd7 to 1b569c5 Compare August 6, 2025 14:25

orozery mentioned this pull request Aug 10, 2025

[KV offload][4/N] Offloading KV connector #22595

Merged

orozery force-pushed the cpu-offloading-worker branch 3 times, most recently from 2bb5c5a to 670b67a Compare August 11, 2025 19:13

orozery force-pushed the cpu-offloading-worker branch from 670b67a to a63b78f Compare August 14, 2025 12:38

orozery force-pushed the cpu-offloading-worker branch 2 times, most recently from a15e2b4 to 0084b78 Compare September 4, 2025 13:05

orozery mentioned this pull request Sep 4, 2025

[KV offload][5/N] Add CPUOffloadingSpec #24251

Merged

ApostaC requested changes Sep 11, 2025

View reviewed changes

njhill reviewed Sep 12, 2025

View reviewed changes

orozery force-pushed the cpu-offloading-worker branch from 0084b78 to c219067 Compare September 15, 2025 15:32

pytorch-bot bot removed the ci/build label Sep 15, 2025

mergify bot added the ci/build label Sep 15, 2025

njhill reviewed Sep 16, 2025

View reviewed changes

orozery force-pushed the cpu-offloading-worker branch from c219067 to 16a5ac9 Compare September 16, 2025 09:27

njhill approved these changes Sep 16, 2025

View reviewed changes

orozery force-pushed the cpu-offloading-worker branch from 16a5ac9 to a427e75 Compare September 17, 2025 09:33

njhill approved these changes Sep 17, 2025

View reviewed changes

njhill reviewed Sep 17, 2025

View reviewed changes

njhill changed the title ~~v1/offloading: Add worker-side CPU support~~ [KV offload][3/N] Add worker-side CPU support Sep 18, 2025

orozery force-pushed the cpu-offloading-worker branch from a427e75 to a9ceb96 Compare September 18, 2025 21:07

njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 19, 2025

njhill enabled auto-merge (squash) September 19, 2025 02:18

ApostaC approved these changes Sep 19, 2025

View reviewed changes

njhill merged commit 7ac67ea into vllm-project:main Sep 19, 2025
51 checks passed

debroy-rh pushed a commit to debroy-rh/vllm that referenced this pull request Sep 19, 2025

[KV offload][3/N] Add worker-side CPU support (vllm-project#21448)

1015497

Signed-off-by: Or Ozeri <oro@il.ibm.com>

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

[KV offload][3/N] Add worker-side CPU support (vllm-project#21448)

7fa0e8d

Signed-off-by: Or Ozeri <oro@il.ibm.com>

charlifu pushed a commit to ROCm/vllm that referenced this pull request Sep 25, 2025

[KV offload][3/N] Add worker-side CPU support (vllm-project#21448)

56191dc

Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: charlifu <charlifu@amd.com>

HF-001 mentioned this pull request Sep 28, 2025

[RFC]: native kvcache offloading vllm-project/vllm-ascend#3241

Open

choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025

[KV offload][3/N] Add worker-side CPU support (vllm-project#21448)

caefad0

Signed-off-by: Or Ozeri <oro@il.ibm.com>

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

[KV offload][3/N] Add worker-side CPU support (vllm-project#21448)

a2687b0

Signed-off-by: Or Ozeri <oro@il.ibm.com>

This was referenced Oct 23, 2025

[CI/Build] Fix AMD CI: test_cpu_gpu.py #27386

Closed

[CI/Build] Fix AMD CI: test_cpu_gpu.py #27387

Closed

[CI/Build] Fix AMD CI: test_cpu_gpu.py #27388

Merged

		block_ids(dst_specs, dst_block_size_factor,
		dst_sub_blocks_to_skip)))

Uh oh!

Conversation

orozery commented Jul 23, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jul 23, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

orozery commented Aug 14, 2025

Uh oh!

ApostaC left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

orozery Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

njhill Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ApostaC left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

orozery commented Jul 23, 2025 •

edited by github-actions bot

Loading

orozery Sep 12, 2025 •

edited

Loading

njhill Sep 17, 2025 •

edited

Loading