[KVConnector][Core] Support cross-layer KV blocks by orozery · Pull Request #27743 · vllm-project/vllm

orozery · 2025-10-29T12:27:23Z

Core of RFC #27742.
Following this PR, connectors can turn-on and adapt to the new layout.

This PR enables the GPU model runner to allocate the KV cache tensors, so that the KV data for all layers will be contiguous per block. This can yield a significant speed up the transfer time of KV transfers (e.g. X4), such in the case of using NixlConnector or OffloadingConnector. Currently, this new layout is disabled by default, and will only be enabled when using a connector which explicitly prefers this new layout. Also, this new layout is currently only supported for uniform (non HMA) models.

mergify · 2025-10-29T12:28:05Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @orozery.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request introduces an optimization for KV cache transfers by allowing contiguous allocation of KV data across layers. The changes are well-structured, introducing a new allocation path in GPUModelRunner that is conditionally enabled for uniform models and specific connectors. The interface changes in attention backends are appropriate. The implementation includes a graceful fallback to the existing allocation mechanism, ensuring backward compatibility. I've identified a minor issue with incorrect comments in vllm/v1/attention/backends/flashinfer.py which could be misleading for future maintenance.

vllm/v1/attention/backends/flashinfer.py

vllm/v1/worker/gpu_model_runner.py

ApostaC · 2025-10-29T17:43:09Z

Quick question:

How should the connector know whether the layout is the new one or the old one?
Should the connector have 2 different code paths to support both layouts? This will potentially increase the connector code by a lot.

orozery · 2025-10-29T17:53:14Z

How should the connector know whether the layout is the new one or the old one?

My initial thought is to check on register_kv_caches whether two layer tensors have the same .storage().data_ptr().
Of course we can also expose a new API.

Should the connector have 2 different code paths to support both layouts? This will potentially increase the connector code by a lot.

Connectors don't need to support the new layout if they don't want to.
For the offloading connector and nixl it does not seem like a big change.
For the offloading connector for example this is just writing a new cuda transfer function.

ApostaC

Like the high-level idea, but I think we need to further discuss the design.

Also, the correctness of this PR is not tested. Will the new code have any correctness & performance impact on the underlying attention modules?

ApostaC · 2025-10-29T18:22:54Z

vllm/v1/worker/gpu_model_runner.py

+        buffer = (
+            torch.zeros(total_size, dtype=torch.int8, device=self.device)
+            .view(kv_cache_spec.dtype)
+            .view(kv_cache_shape)
+        ).permute(*inv_order)


By calling view + permute, we basically create a non-contiguous view of the raw KV cache buffer. This may have an unexpected impact (both performance and correctness) on other modules.

For example, calling reshape on a non-contiguous tensor may introduce an extra memory copy. Therefore, this code adds implicit pitfalls for all the other logics (either first-party or third-party) that may need to do reshape.

Quick code snippet:

import torch x = torch.randn((2,3,5)) y = x.permute(2,0,1) z = y.reshape(30) print("X -- is contiguous:", x.is_contiguous(), "\tdata_ptr:", x.data_ptr()) print("Y -- is contiguous:", y.is_contiguous(), "\tdata_ptr:", y.data_ptr()) print("Z -- is contiguous:", z.is_contiguous(), "\tdata_ptr:", z.data_ptr())

And the output:

X -- is contiguous: True data_ptr: 1076219648 Y -- is contiguous: False data_ptr: 1076219648 Z -- is contiguous: True data_ptr: 986565184 ############ Y is being implicitly copied to Z.

Correct.
This is already done (making non-contiguous view) when using HND layout (note that a permute flow already exists today in _reshape_kv_cache_tensors.

Note that this layout is currently applied when getting the agreement of 3 parties:

The KV cache spec (no HMA)

The KV connector (prefer_cross_layer_blocks)

The attention backend (get_kv_cache_stride_order)

ApostaC · 2025-10-29T18:27:50Z

vllm/v1/worker/gpu_model_runner.py

+        kv_caches = {}
+        for i, kv_cache_tensor in enumerate(kv_cache_config.kv_cache_tensors):
+            tensor = buffer[i]
+            for layer_name in kv_cache_tensor.shared_by:
+                kv_caches[layer_name] = tensor
+
+        return kv_caches
+


Although I like the high-level idea, I strongly feel like we should directly expose the raw KV cache tensor to the connector rather than the permuted KV caches dictionary.

Otherwise, it's very hard to let the connector know what the layout is before the permutation, and this may introduce a huge debug pain, especially when we need to write some C/CUDA code that directly uses the raw pointers.

The current API exposes dict[str, pytorch.Tensor] to the connector (via register_kv_caches).
The tensors themselves could be permuted.
The physical layout of the tensors can be self-determined using pytorch APIs.
But I agree this is somehow "hidden".
If we want to be more explicit we can add an auxiliary variable to register_kv_caches that will hold the physical layout spec for the tensors. What do you think?

I like the idea of adding the auxiliary variable to register kv caches function!

orozery · 2025-10-30T06:44:46Z

Like the high-level idea, but I think we need to further discuss the design.

Also, the correctness of this PR is not tested. Will the new code have any correctness & performance impact on the underlying attention modules?

Good point.
I was not sure myself this will work, so I just ran some "how much is two plus two" prompt to verify this works (for flash attention and flash infer).
Is there a way to make a CI test out of it? Or some other way?

ApostaC · 2025-10-30T20:29:46Z

Is there a way to make a CI test out of it? Or some other way?

I think lm_eval can be used for correctness benchmarks.

Signed-off-by: Or Ozeri <oro@il.ibm.com>

Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu>

Signed-off-by: Or Ozeri <oro@il.ibm.com>

…che support Enables KVBM to correctly detect and configure FullyContiguous layout when vLLM uses cross-layer KV cache blocks (vllm-project/vllm#27743). Changes: - Add LayoutType::auto_detect() to detect FullyContiguous vs LayerSeparate based on tensor count and shape pattern - Update worker auto-detection to use new auto_detect() function - Export PyLayoutType enum to Python for explicit layout configuration - Add layout detection in Python wrapper to pass layout type explicitly Previously, KVBM always auto-detected LayerSeparate even when vLLM provided FullyContiguous tensors, causing incorrect memory access patterns during block transfers. This fix ensures proper layout configuration for accurate performance benchmarking of the 4x transfer speedup improvement. Related: vllm-project/vllm#27742, vllm-project/vllm#27743

### What this PR does / why we need it? Last month the interface of `OffloadingSpec` has changed(vllm-project/vllm#27743). This PR fixes this bug and adds e2e test for cpu offloading. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? CI passed with new added test. - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: whx-sjtu <2952154980@qq.com>

TheTinyTeddy · 2026-01-22T02:16:47Z

@orozery Could I ask what performance impact will the contiguous kv block (layer-wise and K&V-wise) have on the main GPU computation (e.g. attention) in the case of not using KV offload?

I was wondering why not use this as default in the first place, was it purely due to coding convenience?

orozery · 2026-01-22T07:07:07Z

@orozery Could I ask what performance impact will the contiguous kv block (layer-wise and K&V-wise) have on the main GPU computation (e.g. attention) in the case of not using KV offload?

From my experiments, it does not effect compute.
See here:
#27743 (comment)

I was wondering why not use this as default in the first place, was it purely due to coding convenience?

I also asked myself the same question :)
I guess it was coding convenience.
This per-layer allocation was actually introduced in one of the first commits of vLLM:
e7bee2a
So I guess @WoosukKwon can answer that.

) ### What this PR does / why we need it? Last month the interface of `OffloadingSpec` has changed(vllm-project/vllm#27743). This PR fixes this bug and adds e2e test for cpu offloading. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? CI passed with new added test. - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: whx-sjtu <2952154980@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

) ### What this PR does / why we need it? Last month the interface of `OffloadingSpec` has changed(vllm-project/vllm#27743). This PR fixes this bug and adds e2e test for cpu offloading. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? CI passed with new added test. - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: whx-sjtu <2952154980@qq.com>

) ### What this PR does / why we need it? Last month the interface of `OffloadingSpec` has changed(vllm-project/vllm#27743). This PR fixes this bug and adds e2e test for cpu offloading. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? CI passed with new added test. - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: whx-sjtu <2952154980@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

orozery requested review from ApostaC, LucasWilkinson, NickLucche, WoosukKwon, alexm-redhat, comaniac, mgoin, njhill, pavanimajety, youkaichao and zhuohan123 as code owners October 29, 2025 12:27

mergify bot added the v1 label Oct 29, 2025

mergify bot added needs-rebase kv-connector labels Oct 29, 2025

gemini-code-assist bot reviewed Oct 29, 2025

View reviewed changes

vllm/v1/attention/backends/flashinfer.py Outdated Show resolved Hide resolved

orozery force-pushed the contiguous-kv-layers branch from 4fa8cec to b466bf3 Compare October 29, 2025 12:41

mergify bot removed the needs-rebase label Oct 29, 2025

orozery force-pushed the contiguous-kv-layers branch from b466bf3 to 5392229 Compare October 29, 2025 12:46

njhill reviewed Oct 29, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

orozery force-pushed the contiguous-kv-layers branch from 5392229 to 6e68176 Compare October 29, 2025 14:46

orozery changed the title ~~GPUModelRunner: Support contiguous KV data across layers~~ GPUModelRunner: Support cross-layer KV blocks Oct 29, 2025

ApostaC requested changes Oct 29, 2025

View reviewed changes

orozery mentioned this pull request Oct 30, 2025

[RFC]: KV cache layout combining all layers per block #27742

Open

1 task

orozery force-pushed the contiguous-kv-layers branch from d429780 to 85842a0 Compare November 2, 2025 09:13

mergify bot removed the needs-rebase label Nov 2, 2025

NickLucche merged commit 6474647 into vllm-project:main Nov 20, 2025
55 checks passed

github-project-automation bot moved this from In review to Done in NVIDIA Nov 20, 2025

lpapavassiliou pushed a commit to lpapavassiliou/vllm that referenced this pull request Nov 24, 2025

[KVConnector][Core] Support cross-layer KV blocks (vllm-project#27743)

d0f174c

Signed-off-by: Or Ozeri <oro@il.ibm.com>

RunkaiTao pushed a commit to RunkaiTao/vllm that referenced this pull request Nov 24, 2025

[KVConnector][Core] Support cross-layer KV blocks (vllm-project#27743)

e7e9f00

Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu>

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[KVConnector][Core] Support cross-layer KV blocks (vllm-project#27743)

da537c4

Signed-off-by: Or Ozeri <oro@il.ibm.com>

kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025

[KVConnector][Core] Support cross-layer KV blocks (vllm-project#27743)

6db0f26

Signed-off-by: Or Ozeri <oro@il.ibm.com>

This was referenced Dec 2, 2025

[Bug]: Wrong / Repetitive Generation Under High Concurrency When Using LMCache CPU Offload #29781

Open

[KVConnector] OffloadingConnector: Fix bug in handling of preemptions #29870

Merged

liranschour mentioned this pull request Dec 7, 2025

Enable Cross layers KV cache layout at NIXL Connector #30207

Merged

HF-001 mentioned this pull request Dec 8, 2025

[feat] cross layer kvcache vllm-project/vllm-ascend#4768

Closed

kfirtoledo mentioned this pull request Dec 16, 2025

[KVConnector]: Enable Cross-layers KV cache layout for MultiConnector #30761

Merged

whx-sjtu mentioned this pull request Dec 23, 2025

[BugFix] Fix npu-cpu offloading interface change bug. vllm-project/vllm-ascend#5290

Merged

ApostaC mentioned this pull request Jan 13, 2026

fix: 'the multi_layer_kv_transfer function has an incorrect addressing calculation' issuefix: kv layout issue. LMCache/LMCache#2403

Closed

liranschour mentioned this pull request Jan 29, 2026

Enable Cross layers KV cache layout at NIXL Connector V2 #33339

Merged

5 tasks

This was referenced Jan 30, 2026

[KVConnector][LMCache] Enable Support for cross-layer Layout #33395

Open

[Core] Enable LMCache connector support for cross-layer KV cache layout LMCache/LMCache#2498

Open

NickLucche mentioned this pull request Feb 3, 2026

[Roadmap]: PD Disaggregation with NixlConnector Roadmap #33702

Open

44 tasks

ZhanqiuHu mentioned this pull request Feb 18, 2026

Revert "[Bugfix] Disable TRTLLM attention with KV transfer enabled (#33192)" #34832

Merged

5 tasks

dannyharnik mentioned this pull request Mar 26, 2026

[RFC]: Multi-tier KV offloading via the vLLM offloading connector #38260

Open

1 task

Uh oh!

Conversation

orozery commented Oct 29, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Oct 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

ApostaC commented Oct 29, 2025

Uh oh!

orozery commented Oct 29, 2025

Uh oh!

ApostaC left a comment

Choose a reason for hiding this comment

Uh oh!

ApostaC Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

orozery Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

ApostaC Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

orozery Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

ApostaC Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

orozery commented Oct 30, 2025

Uh oh!

ApostaC commented Oct 30, 2025

Uh oh!

Uh oh!

TheTinyTeddy commented Jan 22, 2026

Uh oh!

orozery commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

orozery commented Oct 29, 2025 •

edited by github-actions bot

Loading