[kv_offload+HMA][1/N]: Support multiple KV groups in OffloadingSpec#36610
Conversation
There was a problem hiding this comment.
Code Review
This pull request extends OffloadingSpec to support multiple KV cache groups, which is a valuable enhancement for flexible memory management. The changes primarily adapt the offloading mechanism to handle a tuple of GPU block sizes instead of a single one. However, I've identified a critical logical error in vllm/v1/kv_offload/spec.py concerning the calculation of block_size_factor when a custom offloaded block size is specified. This bug could lead to an incorrect offloaded block size being used. My review includes a suggested fix for this issue. I also noted a design limitation that currently prevents using custom offloaded block sizes when KV groups have different GPU block sizes.
d2fdae9 to
8aee797
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request extends OffloadingSpec to support multiple KV cache groups, which is a good step towards more flexible KV cache management. The changes primarily involve refactoring OffloadingSpec and updating its consumers to handle the new structure. Overall, the changes are well-structured for this refactoring. However, I've identified a critical inconsistency in vllm/v1/kv_offload/cpu.py where the offloaded block size is calculated differently in get_manager and get_handlers. This could lead to serious issues if hash_block_size and gpu_block_size are not identical. Please see the detailed comment.
8aee797 to
e21470b
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request extends OffloadingSpec to support multiple KV cache groups, each with its own block size. This is a preparatory step for more advanced offloading strategies. The changes introduce a distinction between the per-group gpu_block_size and a global hash_block_size. For now, functionality is limited to a single GPU block size across all groups via assertions, which is a reasonable approach for a multi-part feature. My review focuses on ensuring robustness against edge cases introduced by these changes. I've identified a potential issue with error handling when no KV cache groups are present, which could lead to confusing assertion failures.
e21470b to
a007a63
Compare
| "If 'block_size' is specified in kv_connector_extra_config, " | ||
| "there must be at least one KV cache group, " | ||
| "and all groups must have the same block size." |
There was a problem hiding this comment.
nit:
| "If 'block_size' is specified in kv_connector_extra_config, " | |
| "there must be at least one KV cache group, " | |
| "and all groups must have the same block size." | |
| "If 'block_size' is specified in kv_connector_extra_config " | |
| "all groups must have the same block size." |
There was a problem hiding this comment.
it should never be empty regardless
There was a problem hiding this comment.
you're basically reverting gemini's suggestion :)
There was a problem hiding this comment.
But in theory it can be empty if we're running a model without KV cache (encoder model?)
This commit extends OffloadingSpec to support multiple KV cache groups, each with its own possible block size. We now distinguish between 1. The block size of each KV cache group (determined by the KVCacheConfig). 2. The block size used by vLLM for hashing request tokens (determined by cache_config.block_size). This will allow the offloading connector to correctly map request tokens to: 1. KVCacheBlocks (using the per-group block sizes from KVCacheConfig) 2. Request.block_hashes (using the hash block size cache_config.block_size) For now, we keep the offloading connector using the hash_block_size as the block size. Later on, we will modify the offloading connector to use the group-specific block sizes. Signed-off-by: Or Ozeri <oro@il.ibm.com>
96c433a to
6af4887
Compare
…llm-project#36610) Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: whycoming <120623296@qq.com>
…llm-project#36610) Signed-off-by: Or Ozeri <oro@il.ibm.com>
…llm-project#36610) Signed-off-by: Or Ozeri <oro@il.ibm.com>
### What this PR does / why we need it? 1.fix "TypeError: get_attn_backend() remove variable": [Refactor `check_and_update_config`](vllm-project/vllm#35122) 2.fix [Rename `compile_ranges_split_points` to `compile_ranges_endpoints`](vllm-project/vllm#36027) 3.fix "RuntimeError: device_allocator not a DeviceAllocator":[Replace memory related torch.cuda APIs"](vllm-project/vllm#37031) 4.fix [Support multiple KV groups in OffloadingSpec ](vllm-project/vllm#36610) removed self.offloaded_block_size and changed self.gpu_block_size from a scalar to a tuple of per-group block sizes, adding block_size_factor. 5.fix [Consolidate SupportsEagle](vllm-project/vllm#36063) renamed get_eagle3_aux_hidden_state_layers() to get_eagle3_default_aux_hidden_state_layers() and added a supports_eagle3() guard before calling it. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? E2E - vLLM version: v0.17.0 - vLLM main: vllm-project/vllm@8a68046 --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: Claude Code <noreply@anthropic.com>
### What this PR does / why we need it? 1.fix "TypeError: get_attn_backend() remove variable": [Refactor `check_and_update_config`](vllm-project/vllm#35122) 2.fix [Rename `compile_ranges_split_points` to `compile_ranges_endpoints`](vllm-project/vllm#36027) 3.fix "RuntimeError: device_allocator not a DeviceAllocator":[Replace memory related torch.cuda APIs"](vllm-project/vllm#37031) 4.fix [Support multiple KV groups in OffloadingSpec ](vllm-project/vllm#36610) removed self.offloaded_block_size and changed self.gpu_block_size from a scalar to a tuple of per-group block sizes, adding block_size_factor. 5.fix [Consolidate SupportsEagle](vllm-project/vllm#36063) renamed get_eagle3_aux_hidden_state_layers() to get_eagle3_default_aux_hidden_state_layers() and added a supports_eagle3() guard before calling it. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? E2E - vLLM version: v0.17.0 - vLLM main: vllm-project/vllm@8a68046 --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: Claude Code <noreply@anthropic.com>
…llm-project#36610) Signed-off-by: Or Ozeri <oro@il.ibm.com>
### What this PR does / why we need it? 1.fix "TypeError: get_attn_backend() remove variable": [Refactor `check_and_update_config`](vllm-project/vllm#35122) 2.fix [Rename `compile_ranges_split_points` to `compile_ranges_endpoints`](vllm-project/vllm#36027) 3.fix "RuntimeError: device_allocator not a DeviceAllocator":[Replace memory related torch.cuda APIs"](vllm-project/vllm#37031) 4.fix [Support multiple KV groups in OffloadingSpec ](vllm-project/vllm#36610) removed self.offloaded_block_size and changed self.gpu_block_size from a scalar to a tuple of per-group block sizes, adding block_size_factor. 5.fix [Consolidate SupportsEagle](vllm-project/vllm#36063) renamed get_eagle3_aux_hidden_state_layers() to get_eagle3_default_aux_hidden_state_layers() and added a supports_eagle3() guard before calling it. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? E2E - vLLM version: v0.17.0 - vLLM main: vllm-project/vllm@8a68046 --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: Claude Code <noreply@anthropic.com>
### What this PR does / why we need it? 1.fix "TypeError: get_attn_backend() remove variable": [Refactor `check_and_update_config`](vllm-project/vllm#35122) 2.fix [Rename `compile_ranges_split_points` to `compile_ranges_endpoints`](vllm-project/vllm#36027) 3.fix "RuntimeError: device_allocator not a DeviceAllocator":[Replace memory related torch.cuda APIs"](vllm-project/vllm#37031) 4.fix [Support multiple KV groups in OffloadingSpec ](vllm-project/vllm#36610) removed self.offloaded_block_size and changed self.gpu_block_size from a scalar to a tuple of per-group block sizes, adding block_size_factor. 5.fix [Consolidate SupportsEagle](vllm-project/vllm#36063) renamed get_eagle3_aux_hidden_state_layers() to get_eagle3_default_aux_hidden_state_layers() and added a supports_eagle3() guard before calling it. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? E2E - vLLM version: v0.17.0 - vLLM main: vllm-project/vllm@8a68046 --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: Claude Code <noreply@anthropic.com>
…llm-project#36610) Signed-off-by: Or Ozeri <oro@il.ibm.com>
This PR extends OffloadingSpec to support multiple KV cache groups,
each with its own possible block size.
We now distinguish between
This will allow the offloading connector to correctly map request tokens to:
For now, we keep the offloading connector using the hash_block_size
as the block size. Later on, we will modify the offloading connector to use
the group-specific block sizes.