[kv_offload+HMA][1/N]: Support multiple KV groups in OffloadingSpec by orozery · Pull Request #36610 · vllm-project/vllm

orozery · 2026-03-10T08:33:47Z

This PR extends OffloadingSpec to support multiple KV cache groups,
each with its own possible block size.
We now distinguish between

The block size of each KV cache group (determined by the KVCacheConfig).
The block size used by vLLM for hashing request tokens (determined by cache_config.block_size).

This will allow the offloading connector to correctly map request tokens to:

KVCacheBlocks (using the per-group block sizes from KVCacheConfig)
Request.block_hashes (using the hash block size cache_config.block_size)

For now, we keep the offloading connector using the hash_block_size
as the block size. Later on, we will modify the offloading connector to use
the group-specific block sizes.

gemini-code-assist

Code Review

This pull request extends OffloadingSpec to support multiple KV cache groups, which is a valuable enhancement for flexible memory management. The changes primarily adapt the offloading mechanism to handle a tuple of GPU block sizes instead of a single one. However, I've identified a critical logical error in vllm/v1/kv_offload/spec.py concerning the calculation of block_size_factor when a custom offloaded block size is specified. This bug could lead to an incorrect offloaded block size being used. My review includes a suggested fix for this issue. I also noted a design limitation that currently prevents using custom offloaded block sizes when KV groups have different GPU block sizes.

orozery · 2026-03-10T09:35:51Z

/gemini review

gemini-code-assist

Code Review

This pull request extends OffloadingSpec to support multiple KV cache groups, which is a good step towards more flexible KV cache management. The changes primarily involve refactoring OffloadingSpec and updating its consumers to handle the new structure. Overall, the changes are well-structured for this refactoring. However, I've identified a critical inconsistency in vllm/v1/kv_offload/cpu.py where the offloaded block size is calculated differently in get_manager and get_handlers. This could lead to serious issues if hash_block_size and gpu_block_size are not identical. Please see the detailed comment.

orozery · 2026-03-10T11:22:55Z

/gemini review

gemini-code-assist

Code Review

This pull request extends OffloadingSpec to support multiple KV cache groups, each with its own block size. This is a preparatory step for more advanced offloading strategies. The changes introduce a distinction between the per-group gpu_block_size and a global hash_block_size. For now, functionality is limited to a single GPU block size across all groups via assertions, which is a reasonable approach for a multi-part feature. My review focuses on ensuring robustness against edge cases introduced by these changes. I've identified a potential issue with error handling when no KV cache groups are present, which could lead to confusing assertion failures.

NickLucche

LGTM

NickLucche · 2026-03-11T13:55:45Z

+                "If 'block_size' is specified in kv_connector_extra_config, "
+                "there must be at least one KV cache group, "
+                "and all groups must have the same block size."


nit:

Suggested change

"If 'block_size' is specified in kv_connector_extra_config, "

"there must be at least one KV cache group, "

"and all groups must have the same block size."

"If 'block_size' is specified in kv_connector_extra_config "

"all groups must have the same block size."

it should never be empty regardless

you're basically reverting gemini's suggestion :)

But in theory it can be empty if we're running a model without KV cache (encoder model?)

This commit extends OffloadingSpec to support multiple KV cache groups, each with its own possible block size. We now distinguish between 1. The block size of each KV cache group (determined by the KVCacheConfig). 2. The block size used by vLLM for hashing request tokens (determined by cache_config.block_size). This will allow the offloading connector to correctly map request tokens to: 1. KVCacheBlocks (using the per-group block sizes from KVCacheConfig) 2. Request.block_hashes (using the hash block size cache_config.block_size) For now, we keep the offloading connector using the hash_block_size as the block size. Later on, we will modify the offloading connector to use the group-specific block sizes. Signed-off-by: Or Ozeri <oro@il.ibm.com>

…llm-project#36610) Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: whycoming <120623296@qq.com>

…llm-project#36610) Signed-off-by: Or Ozeri <oro@il.ibm.com>

### What this PR does / why we need it? 1.fix "TypeError: get_attn_backend() remove variable": [Refactor `check_and_update_config`](vllm-project/vllm#35122) 2.fix [Rename `compile_ranges_split_points` to `compile_ranges_endpoints`](vllm-project/vllm#36027) 3.fix "RuntimeError: device_allocator not a DeviceAllocator":[Replace memory related torch.cuda APIs"](vllm-project/vllm#37031) 4.fix [Support multiple KV groups in OffloadingSpec ](vllm-project/vllm#36610) removed self.offloaded_block_size and changed self.gpu_block_size from a scalar to a tuple of per-group block sizes, adding block_size_factor. 5.fix [Consolidate SupportsEagle](vllm-project/vllm#36063) renamed get_eagle3_aux_hidden_state_layers() to get_eagle3_default_aux_hidden_state_layers() and added a supports_eagle3() guard before calling it. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? E2E - vLLM version: v0.17.0 - vLLM main: vllm-project/vllm@8a68046 --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: Claude Code <noreply@anthropic.com>

…llm-project#36610) Signed-off-by: Or Ozeri <oro@il.ibm.com>

### What this PR does / why we need it? 1.fix "TypeError: get_attn_backend() remove variable": [Refactor `check_and_update_config`](vllm-project/vllm#35122) 2.fix [Rename `compile_ranges_split_points` to `compile_ranges_endpoints`](vllm-project/vllm#36027) 3.fix "RuntimeError: device_allocator not a DeviceAllocator":[Replace memory related torch.cuda APIs"](vllm-project/vllm#37031) 4.fix [Support multiple KV groups in OffloadingSpec ](vllm-project/vllm#36610) removed self.offloaded_block_size and changed self.gpu_block_size from a scalar to a tuple of per-group block sizes, adding block_size_factor. 5.fix [Consolidate SupportsEagle](vllm-project/vllm#36063) renamed get_eagle3_aux_hidden_state_layers() to get_eagle3_default_aux_hidden_state_layers() and added a supports_eagle3() guard before calling it. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? E2E - vLLM version: v0.17.0 - vLLM main: vllm-project/vllm@8a68046 --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: Claude Code <noreply@anthropic.com>

…llm-project#36610) Signed-off-by: Or Ozeri <oro@il.ibm.com>

orozery requested review from ApostaC and NickLucche as code owners March 10, 2026 08:33

mergify Bot added v1 kv-connector labels Mar 10, 2026

gemini-code-assist Bot reviewed Mar 10, 2026

View reviewed changes

Comment thread vllm/v1/kv_offload/spec.py

orozery force-pushed the kv-offload-spec-multiple-groups branch 2 times, most recently from d2fdae9 to 8aee797 Compare March 10, 2026 09:35

gemini-code-assist Bot reviewed Mar 10, 2026

View reviewed changes

Comment thread vllm/v1/kv_offload/cpu.py Outdated

orozery force-pushed the kv-offload-spec-multiple-groups branch from 8aee797 to e21470b Compare March 10, 2026 11:22

gemini-code-assist Bot reviewed Mar 10, 2026

View reviewed changes

Comment thread vllm/v1/kv_offload/spec.py

orozery force-pushed the kv-offload-spec-multiple-groups branch from e21470b to a007a63 Compare March 11, 2026 07:20

NickLucche approved these changes Mar 11, 2026

View reviewed changes

orozery added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 12, 2026

orozery enabled auto-merge (squash) March 12, 2026 19:40

orozery force-pushed the kv-offload-spec-multiple-groups branch from 96c433a to 6af4887 Compare March 13, 2026 05:30

orozery merged commit cfaf466 into vllm-project:main Mar 13, 2026
47 checks passed

tzielinski-habana mentioned this pull request Mar 13, 2026

[FIX_FOR_VLLM_LATEST] adapt to upstream OffloadingSpec kv_cache_config refactor (GAUDISW-247307) vllm-project/vllm-gaudi#1159

Closed

whycoming pushed a commit to whycoming/vllm that referenced this pull request Mar 13, 2026

[kv_offload+HMA][1/N]: Support multiple KV groups in OffloadingSpec (v…

b6ccc2d

…llm-project#36610) Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: whycoming <120623296@qq.com>

Lucaskabela pushed a commit to Lucaskabela/vllm that referenced this pull request Mar 17, 2026

[kv_offload+HMA][1/N]: Support multiple KV groups in OffloadingSpec (v…

f45e016

…llm-project#36610) Signed-off-by: Or Ozeri <oro@il.ibm.com>

leo-pony mentioned this pull request Mar 18, 2026

Main2main upgrade to vllm 0317 afternoon vllm-project/vllm-ascend#7409

Merged

wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026

[kv_offload+HMA][1/N]: Support multiple KV groups in OffloadingSpec (v…

9426a84

…llm-project#36610) Signed-off-by: Or Ozeri <oro@il.ibm.com>

khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 27, 2026

[kv_offload+HMA][1/N]: Support multiple KV groups in OffloadingSpec (v…

af0ec5b

…llm-project#36610) Signed-off-by: Or Ozeri <oro@il.ibm.com>

mtparet pushed a commit to blackfuel-ai/vllm that referenced this pull request Apr 9, 2026

[kv_offload+HMA][1/N]: Support multiple KV groups in OffloadingSpec (v…

9696559

…llm-project#36610) Signed-off-by: Or Ozeri <oro@il.ibm.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[kv_offload+HMA][1/N]: Support multiple KV groups in OffloadingSpec#36610

[kv_offload+HMA][1/N]: Support multiple KV groups in OffloadingSpec#36610
orozery merged 1 commit intovllm-project:mainfrom
orozery:kv-offload-spec-multiple-groups

orozery commented Mar 10, 2026 •

edited by github-actions Bot

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

orozery commented Mar 10, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

orozery commented Mar 10, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

NickLucche left a comment

Uh oh!

NickLucche Mar 11, 2026

Uh oh!

NickLucche Mar 11, 2026

Uh oh!

orozery Mar 11, 2026

Uh oh!

orozery Mar 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

orozery commented Mar 10, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

orozery commented Mar 10, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

orozery commented Mar 10, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

NickLucche Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

NickLucche Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

orozery Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

orozery Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

orozery commented Mar 10, 2026 •

edited by github-actions Bot

Loading