Skip to content

[Feature] Simple yet General CPU KV Cache Offloading#8743

Open
HF-001 wants to merge 10 commits intovllm-project:mainfrom
HF-001:simple_kv_offload
Open

[Feature] Simple yet General CPU KV Cache Offloading#8743
HF-001 wants to merge 10 commits intovllm-project:mainfrom
HF-001:simple_kv_offload

Conversation

@HF-001
Copy link
Copy Markdown
Contributor

@HF-001 HF-001 commented Apr 27, 2026

What this PR does / why we need it?

refer to: vllm-project/vllm#37160 , SimpleCPUOffloadConnector is another design of vLLM's CPU KV cache offloading path. Instead of maintaining a parallel block management stack, it reuses vLLM's existing BlockPool and KVCacheCoordinator infrastructure directly. This gives us HMA support, prefix caching, and LRU eviction for free.

HF-001 added 2 commits April 27, 2026 14:23
Signed-off-by: HF-001 <1670186653@qq.com>
Signed-off-by: HF-001 <1670186653@qq.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request implements a native CPU KV cache offloading solution for Ascend NPUs. By integrating with vLLM's existing connector infrastructure, it enables efficient memory management and offloading capabilities, such as prefix caching and LRU eviction, while utilizing NPU-specific stream and event handling for high-performance data movement.

Highlights

  • NPU-Native CPU Offloading: Introduced AscendSimpleCPUOffloadConnector to enable CPU KV cache offloading on NPU, leveraging aclrtMemcpyBatchAsync for efficient data transfers.
  • Architecture Reuse: Reused the existing vLLM SimpleCPUOffloadConnector scheduler logic while swapping the worker-side implementation with an NPU-specific worker.
  • End-to-End Testing: Added comprehensive e2e tests to verify accuracy and stability of the CPU offloading path on Ascend hardware.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Ignored Files
  • Ignored by pattern: .github/workflows/** (1)
    • .github/workflows/scripts/config.yaml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions
Copy link
Copy Markdown
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Suggested PR Title:\n\nmarkdown\n[vllm-ascend][Ops][Feature] Implement Ascend NPU adaptation for SimpleCPUOffloadConnector\n\n\nSuggested PR Summary:\n\nmarkdown\n### What this PR does / why we need it?\nThis PR implements the Ascend NPU adaptation of vLLM's `SimpleCPUOffloadConnector`. It introduces the `SimpleCPUOffloadNPUWorker` and `NPUDmaCopyBackend` to handle KV-cache transfers between NPU and CPU using `aclrtMemcpyBatchAsync`. This enables efficient memory management on Ascend hardware by offloading inactive KV blocks to pinned CPU memory. \n\nFeedback identifies three critical issues: \n1. Redundant transfer submissions in `get_finished` that could cause performance degradation and queue overflows.\n2. A potential hang in request processing because the mapping of events to requests is not persisted across scheduling steps.\n3. A thread-safety race condition on the `events_list` shared between the worker thread and the main thread.\n\n### Does this PR introduce _any_ user-facing change?\nYes, it enables the `SimpleCPUOffloadConnector` for Ascend NPU users, allowing them to utilize CPU offloading for KV caches.\n\n### How was this patch tested?\nThe patch was tested with new end-to-end tests in `tests/e2e/singlecard/test_simple_cpu_offload.py`, covering accuracy and stability (no-crash) scenarios.\n

Comment on lines +244 to +259
if metadata.load_cpu_blocks:
self._backend.launch_copy(
metadata.load_cpu_blocks,
metadata.load_gpu_blocks,
is_store=False,
event_idx=metadata.load_event,
events_list=self._load_events,
)
if metadata.store_gpu_blocks:
self._backend.launch_copy(
metadata.store_gpu_blocks,
metadata.store_cpu_blocks,
is_store=True,
event_idx=metadata.store_event,
events_list=self._store_events,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Redundant transfer submission. get_finished can be called multiple times within a single scheduling step (e.g., while waiting for other requests). Since metadata.load_cpu_blocks and metadata.store_gpu_blocks are not cleared after the first call, the same transfers will be launched repeatedly on every call to get_finished. This can lead to significant performance degradation, race conditions on KV blocks, and potential NPU command queue overflow.

Suggested change
if metadata.load_cpu_blocks:
self._backend.launch_copy(
metadata.load_cpu_blocks,
metadata.load_gpu_blocks,
is_store=False,
event_idx=metadata.load_event,
events_list=self._load_events,
)
if metadata.store_gpu_blocks:
self._backend.launch_copy(
metadata.store_gpu_blocks,
metadata.store_cpu_blocks,
is_store=True,
event_idx=metadata.store_event,
events_list=self._store_events,
)
if metadata.load_cpu_blocks:
self._backend.launch_copy(
metadata.load_cpu_blocks,
metadata.load_gpu_blocks,
is_store=False,
event_idx=metadata.load_event,
events_list=self._load_events,
)
# Clear the lists to prevent redundant submission in subsequent calls
metadata.load_cpu_blocks = []
if metadata.store_gpu_blocks:
self._backend.launch_copy(
metadata.store_gpu_blocks,
metadata.store_cpu_blocks,
is_store=True,
event_idx=metadata.store_event,
events_list=self._store_events,
)
# Clear the lists to prevent redundant submission in subsequent calls
metadata.store_gpu_blocks = []

Comment thread vllm_ascend/simple_kv_offload/worker.py Outdated
Comment on lines +269 to +274
req_ids = (
metadata.load_event_to_reqs.get(j)
if metadata is not None
else None
)
if req_ids:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Potential hang for requests loading from CPU. metadata.load_event_to_reqs only contains the mapping for transfers initiated in the current step. If a load transfer from a previous step completes during the current step, metadata.load_event_to_reqs.get(j) will return None (or the wrong mapping), and the scheduler will never be notified that those requests are ready. The worker must maintain its own persistent mapping of event_idx -> req_ids (e.g., in a dictionary initialized in __init__) and populate it during bind_connector_metadata.

copy_blocks(src_blocks, dst_blocks, params)
event = torch.npu.Event()
event.record(stream)
events_list.append((event_idx, event))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Race condition on events_list. The worker thread appends to events_list while the main thread iterates over or pops from it in _poll_stream_events and _flush_and_sync_all. While CPython lists are generally thread-safe for single operations like append and pop, the multi-step iteration and modification across threads without synchronization is risky and can lead to inconsistent state or RuntimeError. It is recommended to use a thread-safe queue (like queue.SimpleQueue) for completion events, which the main thread can drain into its local list.

Signed-off-by: HF-001 <1670186653@qq.com>
@HF-001 HF-001 force-pushed the simple_kv_offload branch from 64e4791 to a215201 Compare April 27, 2026 06:57
HF-001 added 6 commits April 27, 2026 14:57
Signed-off-by: HF-001 <1670186653@qq.com>
Signed-off-by: HF-001 <1670186653@qq.com>
Signed-off-by: HF-001 <1670186653@qq.com>
Signed-off-by: HF-001 <1670186653@qq.com>
Signed-off-by: HF-001 <1670186653@qq.com>
@HF-001 HF-001 force-pushed the simple_kv_offload branch from d112d57 to 6fbecad Compare April 27, 2026 09:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant