[KVConnector][Core] Support cross-layer KV blocks#27743
[KVConnector][Core] Support cross-layer KV blocks#27743NickLucche merged 1 commit intovllm-project:mainfrom
Conversation
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Code Review
This pull request introduces an optimization for KV cache transfers by allowing contiguous allocation of KV data across layers. The changes are well-structured, introducing a new allocation path in GPUModelRunner that is conditionally enabled for uniform models and specific connectors. The interface changes in attention backends are appropriate. The implementation includes a graceful fallback to the existing allocation mechanism, ensuring backward compatibility. I've identified a minor issue with incorrect comments in vllm/v1/attention/backends/flashinfer.py which could be misleading for future maintenance.
4fa8cec to
b466bf3
Compare
b466bf3 to
5392229
Compare
5392229 to
6e68176
Compare
|
Quick question:
|
My initial thought is to check on
Connectors don't need to support the new layout if they don't want to. |
ApostaC
left a comment
There was a problem hiding this comment.
Like the high-level idea, but I think we need to further discuss the design.
Also, the correctness of this PR is not tested. Will the new code have any correctness & performance impact on the underlying attention modules?
vllm/v1/worker/gpu_model_runner.py
Outdated
| buffer = ( | ||
| torch.zeros(total_size, dtype=torch.int8, device=self.device) | ||
| .view(kv_cache_spec.dtype) | ||
| .view(kv_cache_shape) | ||
| ).permute(*inv_order) |
There was a problem hiding this comment.
By calling view + permute, we basically create a non-contiguous view of the raw KV cache buffer. This may have an unexpected impact (both performance and correctness) on other modules.
For example, calling reshape on a non-contiguous tensor may introduce an extra memory copy. Therefore, this code adds implicit pitfalls for all the other logics (either first-party or third-party) that may need to do reshape.
Quick code snippet:
import torch
x = torch.randn((2,3,5))
y = x.permute(2,0,1)
z = y.reshape(30)
print("X -- is contiguous:", x.is_contiguous(), "\tdata_ptr:", x.data_ptr())
print("Y -- is contiguous:", y.is_contiguous(), "\tdata_ptr:", y.data_ptr())
print("Z -- is contiguous:", z.is_contiguous(), "\tdata_ptr:", z.data_ptr())And the output:
X -- is contiguous: True data_ptr: 1076219648
Y -- is contiguous: False data_ptr: 1076219648
Z -- is contiguous: True data_ptr: 986565184 ############ Y is being implicitly copied to Z.
There was a problem hiding this comment.
Correct.
This is already done (making non-contiguous view) when using HND layout (note that a permute flow already exists today in _reshape_kv_cache_tensors.
Note that this layout is currently applied when getting the agreement of 3 parties:
- The KV cache spec (no HMA)
- The KV connector (
prefer_cross_layer_blocks) - The attention backend (
get_kv_cache_stride_order)
vllm/v1/worker/gpu_model_runner.py
Outdated
| kv_caches = {} | ||
| for i, kv_cache_tensor in enumerate(kv_cache_config.kv_cache_tensors): | ||
| tensor = buffer[i] | ||
| for layer_name in kv_cache_tensor.shared_by: | ||
| kv_caches[layer_name] = tensor | ||
|
|
||
| return kv_caches | ||
|
|
There was a problem hiding this comment.
Although I like the high-level idea, I strongly feel like we should directly expose the raw KV cache tensor to the connector rather than the permuted KV caches dictionary.
Otherwise, it's very hard to let the connector know what the layout is before the permutation, and this may introduce a huge debug pain, especially when we need to write some C/CUDA code that directly uses the raw pointers.
There was a problem hiding this comment.
The current API exposes dict[str, pytorch.Tensor] to the connector (via register_kv_caches).
The tensors themselves could be permuted.
The physical layout of the tensors can be self-determined using pytorch APIs.
But I agree this is somehow "hidden".
If we want to be more explicit we can add an auxiliary variable to register_kv_caches that will hold the physical layout spec for the tensors. What do you think?
There was a problem hiding this comment.
I like the idea of adding the auxiliary variable to register kv caches function!
Good point. |
I think lm_eval can be used for correctness benchmarks. |
d429780 to
85842a0
Compare
Signed-off-by: Or Ozeri <oro@il.ibm.com>
Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu>
Signed-off-by: Or Ozeri <oro@il.ibm.com>
Signed-off-by: Or Ozeri <oro@il.ibm.com>
…che support Enables KVBM to correctly detect and configure FullyContiguous layout when vLLM uses cross-layer KV cache blocks (vllm-project/vllm#27743). Changes: - Add LayoutType::auto_detect() to detect FullyContiguous vs LayerSeparate based on tensor count and shape pattern - Update worker auto-detection to use new auto_detect() function - Export PyLayoutType enum to Python for explicit layout configuration - Add layout detection in Python wrapper to pass layout type explicitly Previously, KVBM always auto-detected LayerSeparate even when vLLM provided FullyContiguous tensors, causing incorrect memory access patterns during block transfers. This fix ensures proper layout configuration for accurate performance benchmarking of the 4x transfer speedup improvement. Related: vllm-project/vllm#27742, vllm-project/vllm#27743
### What this PR does / why we need it? Last month the interface of `OffloadingSpec` has changed(vllm-project/vllm#27743). This PR fixes this bug and adds e2e test for cpu offloading. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? CI passed with new added test. - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: whx-sjtu <2952154980@qq.com>
|
@orozery Could I ask what performance impact will the contiguous kv block (layer-wise and K&V-wise) have on the main GPU computation (e.g. attention) in the case of not using KV offload? I was wondering why not use this as default in the first place, was it purely due to coding convenience? |
From my experiments, it does not effect compute.
I also asked myself the same question :) |
) ### What this PR does / why we need it? Last month the interface of `OffloadingSpec` has changed(vllm-project/vllm#27743). This PR fixes this bug and adds e2e test for cpu offloading. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? CI passed with new added test. - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: whx-sjtu <2952154980@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
) ### What this PR does / why we need it? Last month the interface of `OffloadingSpec` has changed(vllm-project/vllm#27743). This PR fixes this bug and adds e2e test for cpu offloading. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? CI passed with new added test. - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: whx-sjtu <2952154980@qq.com>
) ### What this PR does / why we need it? Last month the interface of `OffloadingSpec` has changed(vllm-project/vllm#27743). This PR fixes this bug and adds e2e test for cpu offloading. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? CI passed with new added test. - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: whx-sjtu <2952154980@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
Core of RFC #27742.
Following this PR, connectors can turn-on and adapt to the new layout.
This PR enables the GPU model runner to allocate the KV cache tensors, so that the KV data for all layers will be contiguous per block. This can yield a significant speed up the transfer time of KV transfers (e.g. X4), such in the case of using NixlConnector or OffloadingConnector. Currently, this new layout is disabled by default, and will only be enabled when using a connector which explicitly prefers this new layout. Also, this new layout is currently only supported for uniform (non HMA) models.