Skip to content

Canonical KV cache allocation + offloading integration#1

Open
Etelis wants to merge 17 commits intoorozery:kv-offload-hmafrom
Etelis:canonical-on-offload-hma
Open

Canonical KV cache allocation + offloading integration#1
Etelis wants to merge 17 commits intoorozery:kv-offload-hmafrom
Etelis:canonical-on-offload-hma

Conversation

@Etelis
Copy link
Copy Markdown

@Etelis Etelis commented Apr 12, 2026

Summary

  • Cherry-picks all commits from vllm-project/vllm#37885 (canonical KV cache allocation for HMA models)
  • Adds initialize_worker_connector to OffloadingConnector and OffloadingConnectorWorker to bridge canonical KV caches with the offloading pipeline
  • Converts base.py CanonicalKVCaches to spec.py format and registers handlers

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
…nector

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Combine the kv_caches population, block tensor splitting, and
layer-to-position mapping into a single pass over positions.
Remove the unique kernel block size assertion.

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Call initialize_worker_connector unconditionally so connectors like
CacheBlend can use it regardless of the allocation path taken.

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Move the uniform tensor size check into use_canonical_kv_caches
so the precondition is validated before entering the allocation
path, keeping the assert in allocate_canonical_kv_caches as a
safety net.

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Use contiguous_buffer.select(group_dim, i) to obtain per-position
canonical block tensors where num_blocks is always the leading
dimension. This eliminates the block_dim splitting loop and
multi-dimensional index arithmetic.

Also strengthen use_canonical_kv_caches to explicitly verify
num_blocks is the leading physical dimension, and restore the
single-group rejection (< 2) so single-group models correctly
use the uniform path.

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Replace per-position KVCacheBlockTensor objects with a single
(num_blocks, cross_layer_page_size) int8 tensor. This avoids
recomputing block tensors per position and matches the pattern
used by the offloading connector's register_cross_layers_kv_cache.

Also use per-group kernel_block_sizes[gid] inside the loop instead
of hardcoded kernel_block_sizes[0].

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Add initialize_worker_connector to OffloadingConnector and
OffloadingConnectorWorker to bridge the canonical KV cache
allocation with the offloading pipeline. Converts base.py
CanonicalKVCaches to spec.py format and registers handlers.

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
@Etelis Etelis requested a review from orozery as a code owner April 12, 2026 15:31
@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants