Skip to content

Conversation

hiworldwzj
Copy link
Collaborator

No description provided.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @hiworldwzj, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

I've implemented a comprehensive multi-level caching feature for KV cache, designed to optimize memory utilization and improve inference performance for large language models. This system enables the dynamic offloading of KV cache data from GPU to CPU memory, and potentially to disk, freeing up valuable GPU resources. It leverages high-performance Triton kernels for data transfers and a robust shared memory infrastructure for efficient metadata management. This allows for intelligent reuse of cached KV data across requests, significantly enhancing the overall efficiency of the inference pipeline.

Highlights

  • Multi-Level KV Cache System: I've introduced a multi-level caching system for KV cache, allowing it to be stored on CPU memory and potentially disk. This is crucial for managing memory efficiently, especially with large language models, by offloading less frequently used KV cache data from the GPU.
  • GPU-CPU KV Cache Offloading with Triton: I've implemented Triton kernels to enable high-performance, asynchronous transfers of KV cache data directly between GPU and CPU memory. This ensures that offloading and loading operations are as fast as possible, minimizing performance overhead.
  • Configurable Caching Options: I've added new command-line arguments (--enable_cpu_cache, --cpu_cache_storage_size, etc.) to provide flexible configuration for the CPU and disk caching features, allowing users to control cache behavior and capacity.
  • Shared Memory Management for Cache Metadata: I've integrated a sophisticated shared memory management system using ctypes and multiprocessing.shared_memory. This includes shared linked lists, dictionaries, and atomic locks to efficiently manage KV cache pages and their metadata across different processes.
  • Intelligent KV Cache Reuse: I've enhanced the request handling logic to incorporate token hashing and CPU cache lookup. Incoming requests can now leverage previously cached KV data from CPU, reducing redundant computation and improving inference speed for similar prompts.
  • Improved Radix Cache Reference Counting: I've updated the radix cache to include more robust reference counting for shared KV nodes. This ensures that shared memory is correctly managed and freed only when no longer referenced, preventing memory leaks and improving stability.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant new feature for CPU and disk-based KV caching. The implementation is extensive, touching many parts of the system from low-level Triton kernels to high-level server logic. While the overall direction is good, there are several critical issues that need to be addressed, including incorrect kernel calls, logical errors in loops, and improper shared memory allocation. There are also numerous typos (e.g., 'chuncked' instead of 'chunked') that should be corrected for code clarity and consistency.


# 创建共享内存
key = args.cpu_kv_cache_shm_id # 共享内存的键
size = 1024 # 共享内存大小

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The shared memory size is hardcoded to 1024 bytes, which is far too small for the KV cache. The actual required size should be calculated using calcu_cpu_cache_meta().calcu_size() and used here.

Suggested change
size = 1024 # 共享内存大小
size = calcu_cpu_cache_meta().calcu_size() # 共享内存大小

self.multi_level_cache_manager.cpu_cache_handle_queue.append(trans_task)
else:
true_finished_reqs.append(req)
return true_finished_reqs

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The return true_finished_reqs statement is inside the for loop, which will cause the function to exit after processing only the first finished request. This is incorrect and will lead to other finished requests not being processed for CPU cache offloading. The return statement should be moved outside the loop.

Suggested change
return true_finished_reqs
return true_finished_reqs

Comment on lines 186 to 193
while cur_link_item is not None:
if cur_link_item.key == key:
# remove item
pre_item = cur_link_item.get_pre_item()
pre_item.next_index = cur_link_item.next_index
if cur_link_item.next_index != -1:
next_item = cur_link_item.get_next_item()
next_item.pre_index = pre_item.self_index

self.link_items.add_item_to_tail(index=cur_link_item.self_index)
else:
cur_link_item = cur_link_item.get_next_item()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The remove method in ShmDict has a bug. After finding and removing an item, the while loop continues without updating cur_link_item, leading to an infinite loop. Assuming keys are unique, the loop should terminate after removal.

Suggested change
while cur_link_item is not None:
if cur_link_item.key == key:
# remove item
pre_item = cur_link_item.get_pre_item()
pre_item.next_index = cur_link_item.next_index
if cur_link_item.next_index != -1:
next_item = cur_link_item.get_next_item()
next_item.pre_index = pre_item.self_index
self.link_items.add_item_to_tail(index=cur_link_item.self_index)
else:
cur_link_item = cur_link_item.get_next_item()
while cur_link_item is not None:
if cur_link_item.key == key:
# remove item
pre_item = cur_link_item.get_pre_item()
pre_item.next_index = cur_link_item.next_index
if cur_link_item.next_index != -1:
next_item = cur_link_item.get_next_item()
next_item.pre_index = pre_item.self_index
self.link_items.add_item_to_tail(index=cur_link_item.self_index)
return
else:
cur_link_item = cur_link_item.get_next_item()

Comment on lines 206 to 246
_offload_gpu_kv_to_cpu[grid](
token_indexes_ptr=mem_indexes,
gpu_kv_cache_ptr=gpu_kv_cache,
gpu_stride0=gpu_kv_cache.stride(0),
gpu_stride1=gpu_kv_cache.stride(1),
gpu_stride2=gpu_kv_cache.stride(2),
gpu_stride3=gpu_kv_cache.stride(3),
cpu_kv_cache_ptr=cpu_kv_cache,
cpu_stride0=cpu_kv_cache.stride(0),
cpu_stride1=cpu_kv_cache.stride(1),
cpu_stride2=cpu_kv_cache.stride(2),
cpu_stride3=cpu_kv_cache.stride(3),
cpu_stride4=cpu_kv_cache.stride(4),
page_indexes_ptr=page_indexes,
layer_num=gpu_kv_cache.shape[0],
head_all_dim=gpu_kv_cache.shape[-1] * gpu_kv_cache.shape[-2],
all_move_token_num=len(mem_indexes),
BLOCK_HEAD_ALL_DIM=BLOCK_HEAD_ALL_DIM,
TOKEN_BLOCK=token_block_size,
num_warps=num_warps,
num_stages=1,
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This function incorrectly calls the _offload_gpu_kv_to_cpu kernel instead of _load_cpu_cache_to_gpu. This is a critical error that will cause data to be moved in the wrong direction. The arguments passed are also incorrect for the intended operation, and the chuncked_size parameter is missing.

    _load_cpu_cache_to_gpu[grid](
        token_indexes_ptr=mem_indexes,
        gpu_kv_cache_ptr=gpu_kv_cache,
        gpu_stride0=gpu_kv_cache.stride(0),
        gpu_stride1=gpu_kv_cache.stride(1),
        gpu_stride2=gpu_kv_cache.stride(2),
        gpu_stride3=gpu_kv_cache.stride(3),
        cpu_kv_cache_ptr=cpu_kv_cache,
        cpu_stride0=cpu_kv_cache.stride(0),
        cpu_stride1=cpu_kv_cache.stride(1),
        cpu_stride2=cpu_kv_cache.stride(2),
        cpu_stride3=cpu_kv_cache.stride(3),
        cpu_stride4=cpu_kv_cache.stride(4),
        page_indexes_ptr=page_indexes,
        layer_num=gpu_kv_cache.shape[0],
        head_all_dim=gpu_kv_cache.shape[-1] * gpu_kv_cache.shape[-2],
        all_move_token_num=len(mem_indexes),
        chuncked_size=token_num,
        BLOCK_HEAD_ALL_DIM=BLOCK_HEAD_ALL_DIM,
        TOKEN_BLOCK=token_block_size,
        num_warps=num_warps,
        num_stages=1,
    )

req: Req = req
finded_page_indexes = []
for token_chuncked_hash_value in req.token_hash_list.get_all():
page_index = self.cpu_cache_client.query_one_page(token_chuncked_hash_value)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The method query_one_page returns a tuple (page_index, is_ready), but the result is assigned to a single variable page_index. This will cause incorrect behavior as page_index will be a tuple. The tuple should be unpacked. Also, the variable token_chuncked_hash_value has a typo.

Suggested change
page_index = self.cpu_cache_client.query_one_page(token_chuncked_hash_value)
page_index, _ = self.cpu_cache_client.query_one_page(token_chunked_hash_value)

) -> Optional["TransTask"]:
with torch.cuda.stream(cpu_kv_cache_stream):
all_token_hash_list = req.shm_req.token_hash_list.get_all()
block_size = req.cur_kv_len // self.args.cpu_cache_token_chuncked_size

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a typo in an attribute name. cpu_cache_token_chuncked_size should be cpu_cache_token_chunked_size.

Suggested change
block_size = req.cur_kv_len // self.args.cpu_cache_token_chuncked_size
block_size = req.cur_kv_len // self.args.cpu_cache_token_chunked_size


def fill_cpu_cache_to_reqs(self, reqs: List[InferReq]):
idle_token_num = g_infer_context.get_can_alloc_token_num()
token_chuncked_size = self.args.cpu_cache_token_chuncked_size

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a typo in a variable name. token_chuncked_size should be token_chunked_size.

Suggested change
token_chuncked_size = self.args.cpu_cache_token_chuncked_size
token_chunked_size = self.args.cpu_cache_token_chunked_size

for req in reqs:
if req.shm_req.group_req_id == req.shm_req.request_id:
page_list = req.shm_req.cpu_cache_match_page_indexes.get_all()
match_tokens = len(page_list) * token_chuncked_size

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a typo in a variable name. token_chuncked_size should be token_chunked_size.

Suggested change
match_tokens = len(page_list) * token_chuncked_size
match_tokens = len(page_list) * token_chunked_size

if cpu_page_index == -1:
return

first_block_start_index = chuncked_size * tl.num_programs(0) - all_move_token_num

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a typo in the variable name chuncked_size. It should be chunked_size.

Suggested change
first_block_start_index = chuncked_size * tl.num_programs(0) - all_move_token_num
first_block_start_index = chunked_size * tl.num_programs(0) - all_move_token_num

layer_num,
head_all_dim,
all_move_token_num,
chuncked_size,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a typo in the parameter name chuncked_size. It should be chunked_size for consistency and correctness.

Suggested change
chuncked_size,
chunked_size,

@blueswhen blueswhen force-pushed the disk_cache_feature branch 4 times, most recently from 8a77535 to ac80a73 Compare September 19, 2025 08:46
@blueswhen blueswhen force-pushed the disk_cache_feature branch 6 times, most recently from 87dc04b to 0beb3b3 Compare October 2, 2025 15:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant