[kv_offload+HMA][2/N]: Support multiple KV groups in GPULoadStoreSpec#36642
Conversation
There was a problem hiding this comment.
Code Review
This pull request extends GPULoadStoreSpec to support multiple KV cache groups by adding group_sizes and block_indices parameters. The changes are well-implemented across the codebase, including updates to tests and connector logic. My main feedback is to improve the robustness of input validation in the GPULoadStoreSpec constructor by using ValueError instead of assert.
NickLucche
left a comment
There was a problem hiding this comment.
@orozery Just curious about the choice of using group_sizes + flattened_block_ids list instead of aligning to KV manager on tuple(list[int])
Good question :) |
NickLucche
left a comment
There was a problem hiding this comment.
So we prefer to use a NumPy array instead of list[int] as it is more compact
I'd say that's fine given it's a very internal interface. LGTM
This commit extends GPULoadStoreSpec to support multiple KV cache groups. Specifically, the block IDs of all groups are concatenated, and we use an auxiliary group_sizes list to determine the number of blocks per each group. Additionally, we add a block_indices parameter which is used to encode the logical block index of the first block in each group. This information is required in order to support loading from offloaded blocks which are larger than GPU blocks. In such cases, the first GPU block per each group may be unaligned to the offloaded block size, and so knowing block_indices[i] allows the worker to correctly skip part of the first matching offloaded block. Signed-off-by: Or Ozeri <oro@il.ibm.com>
d720518 to
7d05848
Compare
…vllm-project#36642) Signed-off-by: Or Ozeri <oro@il.ibm.com>
…vllm-project#36642) Signed-off-by: Or Ozeri <oro@il.ibm.com>
…vllm-project#36642) Signed-off-by: Or Ozeri <oro@il.ibm.com>
…vllm-project#36642) Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
…vllm-project#36642) Signed-off-by: Or Ozeri <oro@il.ibm.com>
…vllm-project#36642) Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: Vinay Damodaran <vrdn@hey.com>
…vllm-project#36642) Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>
This PR extends
GPULoadStoreSpecto support multiple KV cache groups.Specifically, the block IDs of all groups are concatenated, and we use an
auxiliary
group_sizeslist to determine the number of blocks per each group.Additionally, we add a
block_indicesparameter which is used to encodethe logical block index of the first block in each group.
This information is required in order to support loading from offloaded blocks
which are larger than GPU blocks.
In such cases, the first GPU block per each group may be unaligned to the offloaded
block size, and so knowing
block_indices[i]allows the worker to correctlyskip part of the first matching offloaded block.