Canonical KV cache allocation + offloading integration by Etelis · Pull Request #1 · orozery/vllm

Etelis · 2026-04-12T15:31:18Z

Summary

Cherry-picks all commits from vllm-project/vllm#37885 (canonical KV cache allocation for HMA models)
Adds initialize_worker_connector to OffloadingConnector and OffloadingConnectorWorker to bridge canonical KV caches with the offloading pipeline
Converts base.py CanonicalKVCaches to spec.py format and registers handlers

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

…nector Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Combine the kv_caches population, block tensor splitting, and layer-to-position mapping into a single pass over positions. Remove the unique kernel block size assertion. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Call initialize_worker_connector unconditionally so connectors like CacheBlend can use it regardless of the allocation path taken. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Move the uniform tensor size check into use_canonical_kv_caches so the precondition is validated before entering the allocation path, keeping the assert in allocate_canonical_kv_caches as a safety net. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Use contiguous_buffer.select(group_dim, i) to obtain per-position canonical block tensors where num_blocks is always the leading dimension. This eliminates the block_dim splitting loop and multi-dimensional index arithmetic. Also strengthen use_canonical_kv_caches to explicitly verify num_blocks is the leading physical dimension, and restore the single-group rejection (< 2) so single-group models correctly use the uniform path. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Replace per-position KVCacheBlockTensor objects with a single (num_blocks, cross_layer_page_size) int8 tensor. This avoids recomputing block tensors per position and matches the pattern used by the offloading connector's register_cross_layers_kv_cache. Also use per-group kernel_block_sizes[gid] inside the loop instead of hardcoded kernel_block_sizes[0]. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Add initialize_worker_connector to OffloadingConnector and OffloadingConnectorWorker to bridge the canonical KV cache allocation with the offloading pipeline. Converts base.py CanonicalKVCaches to spec.py format and registers handlers. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

github-actions · 2026-04-12T15:31:29Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

EtelisIBM added 17 commits April 12, 2026 18:30

Add CanonicalKVCaches data classes for HMA KV cache representation

26cbb1e

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Add WorkerConnectorInitializationData and initialize_worker_connector

544b689

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Add canonical KV cache allocation for HMA models

2268cf4

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Wire up canonical KV cache allocation in gpu_model_runner

7b6771b

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Add unit tests for canonical KV cache allocation

ac68b33

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Fix mypy error: rename shadowed variable in use_canonical_kv_caches

ccdeea4

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Move canonical KV cache dataclasses to connector base

77f75b6

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Address CR: relax group count check and use per-group spec

69d011a

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Address CR: prioritize canonical path and scope initialize_worker_con…

ab66101

…nector Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Address CR: merge loops in allocate_canonical_kv_caches

cc3f747

Combine the kv_caches population, block tensor splitting, and layer-to-position mapping into a single pass over positions. Remove the unique kernel block size assertion. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Address CR: always call initialize_worker_connector

3cc7485

Call initialize_worker_connector unconditionally so connectors like CacheBlend can use it regardless of the allocation path taken. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Refactor canonical KV cache allocation into single-pass loop

0c05217

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Move per-layer reshape logic into inner loop

d257b23

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Etelis requested a review from orozery as a code owner April 12, 2026 15:31

Etelis mentioned this pull request Apr 12, 2026

Canonical KV Cache Allocation for HMA Models vllm-project/vllm#37885

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Canonical KV cache allocation + offloading integration#1

Canonical KV cache allocation + offloading integration#1
Etelis wants to merge 17 commits intoorozery:kv-offload-hmafrom
Etelis:canonical-on-offload-hma

Etelis commented Apr 12, 2026 •

edited by github-actions Bot

Loading

Uh oh!

github-actions Bot commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Etelis commented Apr 12, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

github-actions Bot commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Etelis commented Apr 12, 2026 •

edited by github-actions Bot

Loading