-
Notifications
You must be signed in to change notification settings - Fork 280
Disk cache and cpu Cache feature #997
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @hiworldwzj, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
I've implemented a comprehensive multi-level caching feature for KV cache, designed to optimize memory utilization and improve inference performance for large language models. This system enables the dynamic offloading of KV cache data from GPU to CPU memory, and potentially to disk, freeing up valuable GPU resources. It leverages high-performance Triton kernels for data transfers and a robust shared memory infrastructure for efficient metadata management. This allows for intelligent reuse of cached KV data across requests, significantly enhancing the overall efficiency of the inference pipeline.
Highlights
- Multi-Level KV Cache System: I've introduced a multi-level caching system for KV cache, allowing it to be stored on CPU memory and potentially disk. This is crucial for managing memory efficiently, especially with large language models, by offloading less frequently used KV cache data from the GPU.
- GPU-CPU KV Cache Offloading with Triton: I've implemented Triton kernels to enable high-performance, asynchronous transfers of KV cache data directly between GPU and CPU memory. This ensures that offloading and loading operations are as fast as possible, minimizing performance overhead.
- Configurable Caching Options: I've added new command-line arguments (
--enable_cpu_cache
,--cpu_cache_storage_size
, etc.) to provide flexible configuration for the CPU and disk caching features, allowing users to control cache behavior and capacity. - Shared Memory Management for Cache Metadata: I've integrated a sophisticated shared memory management system using
ctypes
andmultiprocessing.shared_memory
. This includes shared linked lists, dictionaries, and atomic locks to efficiently manage KV cache pages and their metadata across different processes. - Intelligent KV Cache Reuse: I've enhanced the request handling logic to incorporate token hashing and CPU cache lookup. Incoming requests can now leverage previously cached KV data from CPU, reducing redundant computation and improving inference speed for similar prompts.
- Improved Radix Cache Reference Counting: I've updated the radix cache to include more robust reference counting for shared KV nodes. This ensures that shared memory is correctly managed and freed only when no longer referenced, preventing memory leaks and improving stability.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant new feature for CPU and disk-based KV caching. The implementation is extensive, touching many parts of the system from low-level Triton kernels to high-level server logic. While the overall direction is good, there are several critical issues that need to be addressed, including incorrect kernel calls, logical errors in loops, and improper shared memory allocation. There are also numerous typos (e.g., 'chuncked' instead of 'chunked') that should be corrected for code clarity and consistency.
lightllm/utils/kv_cache_utils.py
Outdated
|
||
# 创建共享内存 | ||
key = args.cpu_kv_cache_shm_id # 共享内存的键 | ||
size = 1024 # 共享内存大小 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.multi_level_cache_manager.cpu_cache_handle_queue.append(trans_task) | ||
else: | ||
true_finished_reqs.append(req) | ||
return true_finished_reqs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return true_finished_reqs
statement is inside the for
loop, which will cause the function to exit after processing only the first finished request. This is incorrect and will lead to other finished requests not being processed for CPU cache offloading. The return statement should be moved outside the loop.
return true_finished_reqs | |
return true_finished_reqs |
while cur_link_item is not None: | ||
if cur_link_item.key == key: | ||
# remove item | ||
pre_item = cur_link_item.get_pre_item() | ||
pre_item.next_index = cur_link_item.next_index | ||
if cur_link_item.next_index != -1: | ||
next_item = cur_link_item.get_next_item() | ||
next_item.pre_index = pre_item.self_index | ||
|
||
self.link_items.add_item_to_tail(index=cur_link_item.self_index) | ||
else: | ||
cur_link_item = cur_link_item.get_next_item() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The remove
method in ShmDict
has a bug. After finding and removing an item, the while
loop continues without updating cur_link_item
, leading to an infinite loop. Assuming keys are unique, the loop should terminate after removal.
while cur_link_item is not None: | |
if cur_link_item.key == key: | |
# remove item | |
pre_item = cur_link_item.get_pre_item() | |
pre_item.next_index = cur_link_item.next_index | |
if cur_link_item.next_index != -1: | |
next_item = cur_link_item.get_next_item() | |
next_item.pre_index = pre_item.self_index | |
self.link_items.add_item_to_tail(index=cur_link_item.self_index) | |
else: | |
cur_link_item = cur_link_item.get_next_item() | |
while cur_link_item is not None: | |
if cur_link_item.key == key: | |
# remove item | |
pre_item = cur_link_item.get_pre_item() | |
pre_item.next_index = cur_link_item.next_index | |
if cur_link_item.next_index != -1: | |
next_item = cur_link_item.get_next_item() | |
next_item.pre_index = pre_item.self_index | |
self.link_items.add_item_to_tail(index=cur_link_item.self_index) | |
return | |
else: | |
cur_link_item = cur_link_item.get_next_item() |
_offload_gpu_kv_to_cpu[grid]( | ||
token_indexes_ptr=mem_indexes, | ||
gpu_kv_cache_ptr=gpu_kv_cache, | ||
gpu_stride0=gpu_kv_cache.stride(0), | ||
gpu_stride1=gpu_kv_cache.stride(1), | ||
gpu_stride2=gpu_kv_cache.stride(2), | ||
gpu_stride3=gpu_kv_cache.stride(3), | ||
cpu_kv_cache_ptr=cpu_kv_cache, | ||
cpu_stride0=cpu_kv_cache.stride(0), | ||
cpu_stride1=cpu_kv_cache.stride(1), | ||
cpu_stride2=cpu_kv_cache.stride(2), | ||
cpu_stride3=cpu_kv_cache.stride(3), | ||
cpu_stride4=cpu_kv_cache.stride(4), | ||
page_indexes_ptr=page_indexes, | ||
layer_num=gpu_kv_cache.shape[0], | ||
head_all_dim=gpu_kv_cache.shape[-1] * gpu_kv_cache.shape[-2], | ||
all_move_token_num=len(mem_indexes), | ||
BLOCK_HEAD_ALL_DIM=BLOCK_HEAD_ALL_DIM, | ||
TOKEN_BLOCK=token_block_size, | ||
num_warps=num_warps, | ||
num_stages=1, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function incorrectly calls the _offload_gpu_kv_to_cpu
kernel instead of _load_cpu_cache_to_gpu
. This is a critical error that will cause data to be moved in the wrong direction. The arguments passed are also incorrect for the intended operation, and the chuncked_size
parameter is missing.
_load_cpu_cache_to_gpu[grid](
token_indexes_ptr=mem_indexes,
gpu_kv_cache_ptr=gpu_kv_cache,
gpu_stride0=gpu_kv_cache.stride(0),
gpu_stride1=gpu_kv_cache.stride(1),
gpu_stride2=gpu_kv_cache.stride(2),
gpu_stride3=gpu_kv_cache.stride(3),
cpu_kv_cache_ptr=cpu_kv_cache,
cpu_stride0=cpu_kv_cache.stride(0),
cpu_stride1=cpu_kv_cache.stride(1),
cpu_stride2=cpu_kv_cache.stride(2),
cpu_stride3=cpu_kv_cache.stride(3),
cpu_stride4=cpu_kv_cache.stride(4),
page_indexes_ptr=page_indexes,
layer_num=gpu_kv_cache.shape[0],
head_all_dim=gpu_kv_cache.shape[-1] * gpu_kv_cache.shape[-2],
all_move_token_num=len(mem_indexes),
chuncked_size=token_num,
BLOCK_HEAD_ALL_DIM=BLOCK_HEAD_ALL_DIM,
TOKEN_BLOCK=token_block_size,
num_warps=num_warps,
num_stages=1,
)
req: Req = req | ||
finded_page_indexes = [] | ||
for token_chuncked_hash_value in req.token_hash_list.get_all(): | ||
page_index = self.cpu_cache_client.query_one_page(token_chuncked_hash_value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method query_one_page
returns a tuple (page_index, is_ready)
, but the result is assigned to a single variable page_index
. This will cause incorrect behavior as page_index
will be a tuple. The tuple should be unpacked. Also, the variable token_chuncked_hash_value
has a typo.
page_index = self.cpu_cache_client.query_one_page(token_chuncked_hash_value) | |
page_index, _ = self.cpu_cache_client.query_one_page(token_chunked_hash_value) |
) -> Optional["TransTask"]: | ||
with torch.cuda.stream(cpu_kv_cache_stream): | ||
all_token_hash_list = req.shm_req.token_hash_list.get_all() | ||
block_size = req.cur_kv_len // self.args.cpu_cache_token_chuncked_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
def fill_cpu_cache_to_reqs(self, reqs: List[InferReq]): | ||
idle_token_num = g_infer_context.get_can_alloc_token_num() | ||
token_chuncked_size = self.args.cpu_cache_token_chuncked_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for req in reqs: | ||
if req.shm_req.group_req_id == req.shm_req.request_id: | ||
page_list = req.shm_req.cpu_cache_match_page_indexes.get_all() | ||
match_tokens = len(page_list) * token_chuncked_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if cpu_page_index == -1: | ||
return | ||
|
||
first_block_start_index = chuncked_size * tl.num_programs(0) - all_move_token_num |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
layer_num, | ||
head_all_dim, | ||
all_move_token_num, | ||
chuncked_size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8a77535
to
ac80a73
Compare
87dc04b
to
0beb3b3
Compare
0beb3b3
to
1cc43d8
Compare
No description provided.