Disk cache and cpu Cache feature #997

hiworldwzj · 2025-08-04T09:22:35Z

No description provided.

gemini-code-assist

Summary of Changes

Hello @hiworldwzj, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

I've implemented a comprehensive multi-level caching feature for KV cache, designed to optimize memory utilization and improve inference performance for large language models. This system enables the dynamic offloading of KV cache data from GPU to CPU memory, and potentially to disk, freeing up valuable GPU resources. It leverages high-performance Triton kernels for data transfers and a robust shared memory infrastructure for efficient metadata management. This allows for intelligent reuse of cached KV data across requests, significantly enhancing the overall efficiency of the inference pipeline.

Highlights

Multi-Level KV Cache System: I've introduced a multi-level caching system for KV cache, allowing it to be stored on CPU memory and potentially disk. This is crucial for managing memory efficiently, especially with large language models, by offloading less frequently used KV cache data from the GPU.
GPU-CPU KV Cache Offloading with Triton: I've implemented Triton kernels to enable high-performance, asynchronous transfers of KV cache data directly between GPU and CPU memory. This ensures that offloading and loading operations are as fast as possible, minimizing performance overhead.
Configurable Caching Options: I've added new command-line arguments (--enable_cpu_cache, --cpu_cache_storage_size, etc.) to provide flexible configuration for the CPU and disk caching features, allowing users to control cache behavior and capacity.
Shared Memory Management for Cache Metadata: I've integrated a sophisticated shared memory management system using ctypes and multiprocessing.shared_memory. This includes shared linked lists, dictionaries, and atomic locks to efficiently manage KV cache pages and their metadata across different processes.
Intelligent KV Cache Reuse: I've enhanced the request handling logic to incorporate token hashing and CPU cache lookup. Incoming requests can now leverage previously cached KV data from CPU, reducing redundant computation and improving inference speed for similar prompts.
Improved Radix Cache Reference Counting: I've updated the radix cache to include more robust reference counting for shared KV nodes. This ensures that shared memory is correctly managed and freed only when no longer referenced, preventing memory leaks and improving stability.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a significant new feature for CPU and disk-based KV caching. The implementation is extensive, touching many parts of the system from low-level Triton kernels to high-level server logic. While the overall direction is good, there are several critical issues that need to be addressed, including incorrect kernel calls, logical errors in loops, and improper shared memory allocation. There are also numerous typos (e.g., 'chuncked' instead of 'chunked') that should be corrected for code clarity and consistency.

gemini-code-assist · 2025-08-04T09:25:31Z

lightllm/utils/kv_cache_utils.py

+
+    # 创建共享内存
+    key = args.cpu_kv_cache_shm_id  # 共享内存的键
+    size = 1024  # 共享内存大小


The shared memory size is hardcoded to 1024 bytes, which is far too small for the KV cache. The actual required size should be calculated using calcu_cpu_cache_meta().calcu_size() and used here.

Suggested change

size = 1024 # 共享内存大小

size = calcu_cpu_cache_meta().calcu_size() # 共享内存大小

gemini-code-assist · 2025-08-04T09:25:32Z

lightllm/server/router/model_infer/mode_backend/base_backend.py

+                            self.multi_level_cache_manager.cpu_cache_handle_queue.append(trans_task)
+                else:
+                    true_finished_reqs.append(req)
+                return true_finished_reqs


The return true_finished_reqs statement is inside the for loop, which will cause the function to exit after processing only the first finished request. This is incorrect and will lead to other finished requests not being processed for CPU cache offloading. The return statement should be moved outside the loop.

Suggested change

return true_finished_reqs

return true_finished_reqs

gemini-code-assist · 2025-08-04T09:25:32Z

lightllm/server/multi_level_kv_cache/shm_objs.py

+        while cur_link_item is not None:
+            if cur_link_item.key == key:
+                # remove item
+                pre_item = cur_link_item.get_pre_item()
+                pre_item.next_index = cur_link_item.next_index
+                if cur_link_item.next_index != -1:
+                    next_item = cur_link_item.get_next_item()
+                    next_item.pre_index = pre_item.self_index
+
+                self.link_items.add_item_to_tail(index=cur_link_item.self_index)
+            else:
+                cur_link_item = cur_link_item.get_next_item()


The remove method in ShmDict has a bug. After finding and removing an item, the while loop continues without updating cur_link_item, leading to an infinite loop. Assuming keys are unique, the loop should terminate after removal.

Suggested change

while cur_link_item is not None:

if cur_link_item.key == key:

# remove item

pre_item = cur_link_item.get_pre_item()

pre_item.next_index = cur_link_item.next_index

if cur_link_item.next_index != -1:

next_item = cur_link_item.get_next_item()

next_item.pre_index = pre_item.self_index

self.link_items.add_item_to_tail(index=cur_link_item.self_index)

else:

cur_link_item = cur_link_item.get_next_item()

while cur_link_item is not None:

if cur_link_item.key == key:

# remove item

pre_item = cur_link_item.get_pre_item()

pre_item.next_index = cur_link_item.next_index

if cur_link_item.next_index != -1:

next_item = cur_link_item.get_next_item()

next_item.pre_index = pre_item.self_index

self.link_items.add_item_to_tail(index=cur_link_item.self_index)

return

else:

cur_link_item = cur_link_item.get_next_item()

gemini-code-assist · 2025-08-04T09:25:32Z

lightllm/common/basemodel/triton_kernel/kv_cache_offload.py

+    _offload_gpu_kv_to_cpu[grid](
+        token_indexes_ptr=mem_indexes,
+        gpu_kv_cache_ptr=gpu_kv_cache,
+        gpu_stride0=gpu_kv_cache.stride(0),
+        gpu_stride1=gpu_kv_cache.stride(1),
+        gpu_stride2=gpu_kv_cache.stride(2),
+        gpu_stride3=gpu_kv_cache.stride(3),
+        cpu_kv_cache_ptr=cpu_kv_cache,
+        cpu_stride0=cpu_kv_cache.stride(0),
+        cpu_stride1=cpu_kv_cache.stride(1),
+        cpu_stride2=cpu_kv_cache.stride(2),
+        cpu_stride3=cpu_kv_cache.stride(3),
+        cpu_stride4=cpu_kv_cache.stride(4),
+        page_indexes_ptr=page_indexes,
+        layer_num=gpu_kv_cache.shape[0],
+        head_all_dim=gpu_kv_cache.shape[-1] * gpu_kv_cache.shape[-2],
+        all_move_token_num=len(mem_indexes),
+        BLOCK_HEAD_ALL_DIM=BLOCK_HEAD_ALL_DIM,
+        TOKEN_BLOCK=token_block_size,
+        num_warps=num_warps,
+        num_stages=1,
+    )


This function incorrectly calls the _offload_gpu_kv_to_cpu kernel instead of _load_cpu_cache_to_gpu. This is a critical error that will cause data to be moved in the wrong direction. The arguments passed are also incorrect for the intended operation, and the chuncked_size parameter is missing.

_load_cpu_cache_to_gpu[grid]( token_indexes_ptr=mem_indexes, gpu_kv_cache_ptr=gpu_kv_cache, gpu_stride0=gpu_kv_cache.stride(0), gpu_stride1=gpu_kv_cache.stride(1), gpu_stride2=gpu_kv_cache.stride(2), gpu_stride3=gpu_kv_cache.stride(3), cpu_kv_cache_ptr=cpu_kv_cache, cpu_stride0=cpu_kv_cache.stride(0), cpu_stride1=cpu_kv_cache.stride(1), cpu_stride2=cpu_kv_cache.stride(2), cpu_stride3=cpu_kv_cache.stride(3), cpu_stride4=cpu_kv_cache.stride(4), page_indexes_ptr=page_indexes, layer_num=gpu_kv_cache.shape[0], head_all_dim=gpu_kv_cache.shape[-1] * gpu_kv_cache.shape[-2], all_move_token_num=len(mem_indexes), chuncked_size=token_num, BLOCK_HEAD_ALL_DIM=BLOCK_HEAD_ALL_DIM, TOKEN_BLOCK=token_block_size, num_warps=num_warps, num_stages=1, )

gemini-code-assist · 2025-08-04T09:25:32Z

lightllm/server/multi_level_kv_cache/manager.py

+            req: Req = req
+            finded_page_indexes = []
+            for token_chuncked_hash_value in req.token_hash_list.get_all():
+                page_index = self.cpu_cache_client.query_one_page(token_chuncked_hash_value)


The method query_one_page returns a tuple (page_index, is_ready), but the result is assigned to a single variable page_index. This will cause incorrect behavior as page_index will be a tuple. The tuple should be unpacked. Also, the variable token_chuncked_hash_value has a typo.

Suggested change

page_index = self.cpu_cache_client.query_one_page(token_chuncked_hash_value)

page_index, _ = self.cpu_cache_client.query_one_page(token_chunked_hash_value)

gemini-code-assist · 2025-08-04T09:25:35Z

lightllm/server/router/model_infer/mode_backend/multi_level_cache_manager.py

+    ) -> Optional["TransTask"]:
+        with torch.cuda.stream(cpu_kv_cache_stream):
+            all_token_hash_list = req.shm_req.token_hash_list.get_all()
+            block_size = req.cur_kv_len // self.args.cpu_cache_token_chuncked_size


There is a typo in an attribute name. cpu_cache_token_chuncked_size should be cpu_cache_token_chunked_size.

Suggested change

block_size = req.cur_kv_len // self.args.cpu_cache_token_chuncked_size

block_size = req.cur_kv_len // self.args.cpu_cache_token_chunked_size

gemini-code-assist · 2025-08-04T09:25:35Z

lightllm/server/router/model_infer/mode_backend/multi_level_cache_manager.py

+
+    def fill_cpu_cache_to_reqs(self, reqs: List[InferReq]):
+        idle_token_num = g_infer_context.get_can_alloc_token_num()
+        token_chuncked_size = self.args.cpu_cache_token_chuncked_size


There is a typo in a variable name. token_chuncked_size should be token_chunked_size.

Suggested change

token_chuncked_size = self.args.cpu_cache_token_chuncked_size

token_chunked_size = self.args.cpu_cache_token_chunked_size

gemini-code-assist · 2025-08-04T09:25:35Z

lightllm/server/router/model_infer/mode_backend/multi_level_cache_manager.py

+        for req in reqs:
+            if req.shm_req.group_req_id == req.shm_req.request_id:
+                page_list = req.shm_req.cpu_cache_match_page_indexes.get_all()
+                match_tokens = len(page_list) * token_chuncked_size


There is a typo in a variable name. token_chuncked_size should be token_chunked_size.

Suggested change

match_tokens = len(page_list) * token_chuncked_size

match_tokens = len(page_list) * token_chunked_size

gemini-code-assist · 2025-08-04T09:25:35Z

lightllm/common/basemodel/triton_kernel/kv_cache_offload.py

+    if cpu_page_index == -1:
+        return
+
+    first_block_start_index = chuncked_size * tl.num_programs(0) - all_move_token_num


There is a typo in the variable name chuncked_size. It should be chunked_size.

Suggested change

first_block_start_index = chuncked_size * tl.num_programs(0) - all_move_token_num

first_block_start_index = chunked_size * tl.num_programs(0) - all_move_token_num

gemini-code-assist · 2025-08-04T09:25:36Z

lightllm/common/basemodel/triton_kernel/kv_cache_offload.py

+    layer_num,
+    head_all_dim,
+    all_move_token_num,
+    chuncked_size,


There is a typo in the parameter name chuncked_size. It should be chunked_size for consistency and correctness.

Suggested change

chuncked_size,

chunked_size,

gemini-code-assist bot reviewed Aug 4, 2025

View reviewed changes

feat: support disk cache

1dbab28

blueswhen force-pushed the disk_cache_feature branch 4 times, most recently from 8a77535 to ac80a73 Compare September 19, 2025 08:46

blueswhen force-pushed the disk_cache_feature branch 6 times, most recently from 87dc04b to 0beb3b3 Compare October 2, 2025 15:49

feat: v1.0 cpu cache

1cc43d8

blueswhen force-pushed the disk_cache_feature branch from 0beb3b3 to 1cc43d8 Compare October 9, 2025 01:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Disk cache and cpu Cache feature #997

Disk cache and cpu Cache feature #997

Uh oh!

hiworldwzj commented Aug 4, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Aug 4, 2025

Uh oh!

gemini-code-assist bot Aug 4, 2025

Uh oh!

gemini-code-assist bot Aug 4, 2025

Uh oh!

gemini-code-assist bot Aug 4, 2025

Uh oh!

gemini-code-assist bot Aug 4, 2025

Uh oh!

gemini-code-assist bot Aug 4, 2025

Uh oh!

gemini-code-assist bot Aug 4, 2025

Uh oh!

gemini-code-assist bot Aug 4, 2025

Uh oh!

gemini-code-assist bot Aug 4, 2025

Uh oh!

gemini-code-assist bot Aug 4, 2025

Uh oh!

Uh oh!

	size = 1024 # 共享内存大小
	size = calcu_cpu_cache_meta().calcu_size() # 共享内存大小

	page_index = self.cpu_cache_client.query_one_page(token_chuncked_hash_value)
	page_index, _ = self.cpu_cache_client.query_one_page(token_chunked_hash_value)

	block_size = req.cur_kv_len // self.args.cpu_cache_token_chuncked_size
	block_size = req.cur_kv_len // self.args.cpu_cache_token_chunked_size

	token_chuncked_size = self.args.cpu_cache_token_chuncked_size
	token_chunked_size = self.args.cpu_cache_token_chunked_size

	match_tokens = len(page_list) * token_chuncked_size
	match_tokens = len(page_list) * token_chunked_size

	first_block_start_index = chuncked_size * tl.num_programs(0) - all_move_token_num
	first_block_start_index = chunked_size * tl.num_programs(0) - all_move_token_num

Disk cache and cpu Cache feature #997

Are you sure you want to change the base?

Disk cache and cpu Cache feature #997

Uh oh!

Conversation

hiworldwzj commented Aug 4, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!