[Feature] Simple yet General CPU KV Cache Offloading by HF-001 · Pull Request #8743 · vllm-project/vllm-ascend

HF-001 · 2026-04-27T06:54:29Z

What this PR does / why we need it?

refer to: vllm-project/vllm#37160 , SimpleCPUOffloadConnector is another design of vLLM's CPU KV cache offloading path. Instead of maintaining a parallel block management stack, it reuses vLLM's existing BlockPool and KVCacheCoordinator infrastructure directly. This gives us HMA support, prefix caching, and LRU eviction for free.

vLLM version: v0.19.0
vLLM main: vllm-project/vllm@6f786f2

Signed-off-by: HF-001 <1670186653@qq.com>

gemini-code-assist · 2026-04-27T06:54:39Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request implements a native CPU KV cache offloading solution for Ascend NPUs. By integrating with vLLM's existing connector infrastructure, it enables efficient memory management and offloading capabilities, such as prefix caching and LRU eviction, while utilizing NPU-specific stream and event handling for high-performance data movement.

Highlights

NPU-Native CPU Offloading: Introduced AscendSimpleCPUOffloadConnector to enable CPU KV cache offloading on NPU, leveraging aclrtMemcpyBatchAsync for efficient data transfers.
Architecture Reuse: Reused the existing vLLM SimpleCPUOffloadConnector scheduler logic while swapping the worker-side implementation with an NPU-specific worker.
End-to-End Testing: Added comprehensive e2e tests to verify accuracy and stability of the CPU offloading path on Ascend hardware.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Ignored Files

Ignored by pattern: .github/workflows/** (1)
- .github/workflows/scripts/config.yaml

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

github-actions · 2026-04-27T06:56:00Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

Suggested PR Title:\n\nmarkdown\n[vllm-ascend][Ops][Feature] Implement Ascend NPU adaptation for SimpleCPUOffloadConnector\n\n\nSuggested PR Summary:\n\nmarkdown\n### What this PR does / why we need it?\nThis PR implements the Ascend NPU adaptation of vLLM's `SimpleCPUOffloadConnector`. It introduces the `SimpleCPUOffloadNPUWorker` and `NPUDmaCopyBackend` to handle KV-cache transfers between NPU and CPU using `aclrtMemcpyBatchAsync`. This enables efficient memory management on Ascend hardware by offloading inactive KV blocks to pinned CPU memory. \n\nFeedback identifies three critical issues: \n1. Redundant transfer submissions in `get_finished` that could cause performance degradation and queue overflows.\n2. A potential hang in request processing because the mapping of events to requests is not persisted across scheduling steps.\n3. A thread-safety race condition on the `events_list` shared between the worker thread and the main thread.\n\n### Does this PR introduce _any_ user-facing change?\nYes, it enables the `SimpleCPUOffloadConnector` for Ascend NPU users, allowing them to utilize CPU offloading for KV caches.\n\n### How was this patch tested?\nThe patch was tested with new end-to-end tests in `tests/e2e/singlecard/test_simple_cpu_offload.py`, covering accuracy and stability (no-crash) scenarios.\n

gemini-code-assist · 2026-04-27T06:56:39Z

+            if metadata.load_cpu_blocks:
+                self._backend.launch_copy(
+                    metadata.load_cpu_blocks,
+                    metadata.load_gpu_blocks,
+                    is_store=False,
+                    event_idx=metadata.load_event,
+                    events_list=self._load_events,
+                )
+            if metadata.store_gpu_blocks:
+                self._backend.launch_copy(
+                    metadata.store_gpu_blocks,
+                    metadata.store_cpu_blocks,
+                    is_store=True,
+                    event_idx=metadata.store_event,
+                    events_list=self._store_events,
+                )


Redundant transfer submission. get_finished can be called multiple times within a single scheduling step (e.g., while waiting for other requests). Since metadata.load_cpu_blocks and metadata.store_gpu_blocks are not cleared after the first call, the same transfers will be launched repeatedly on every call to get_finished. This can lead to significant performance degradation, race conditions on KV blocks, and potential NPU command queue overflow.

Suggested change

if metadata.load_cpu_blocks:

self._backend.launch_copy(

metadata.load_cpu_blocks,

metadata.load_gpu_blocks,

is_store=False,

event_idx=metadata.load_event,

events_list=self._load_events,

)

if metadata.store_gpu_blocks:

self._backend.launch_copy(

metadata.store_gpu_blocks,

metadata.store_cpu_blocks,

is_store=True,

event_idx=metadata.store_event,

events_list=self._store_events,

)

if metadata.load_cpu_blocks:

self._backend.launch_copy(

metadata.load_cpu_blocks,

metadata.load_gpu_blocks,

is_store=False,

event_idx=metadata.load_event,

events_list=self._load_events,

)

# Clear the lists to prevent redundant submission in subsequent calls

metadata.load_cpu_blocks = []

if metadata.store_gpu_blocks:

self._backend.launch_copy(

metadata.store_gpu_blocks,

metadata.store_cpu_blocks,

is_store=True,

event_idx=metadata.store_event,

events_list=self._store_events,

)

# Clear the lists to prevent redundant submission in subsequent calls

metadata.store_gpu_blocks = []

gemini-code-assist · 2026-04-27T06:56:39Z

+                req_ids = (
+                    metadata.load_event_to_reqs.get(j)
+                    if metadata is not None
+                    else None
+                )
+                if req_ids:


Potential hang for requests loading from CPU. metadata.load_event_to_reqs only contains the mapping for transfers initiated in the current step. If a load transfer from a previous step completes during the current step, metadata.load_event_to_reqs.get(j) will return None (or the wrong mapping), and the scheduler will never be notified that those requests are ready. The worker must maintain its own persistent mapping of event_idx -> req_ids (e.g., in a dictionary initialized in __init__) and populate it during bind_connector_metadata.

gemini-code-assist · 2026-04-27T06:56:39Z

+                copy_blocks(src_blocks, dst_blocks, params)
+                event = torch.npu.Event()
+                event.record(stream)
+            events_list.append((event_idx, event))


Race condition on events_list. The worker thread appends to events_list while the main thread iterates over or pops from it in _poll_stream_events and _flush_and_sync_all. While CPython lists are generally thread-safe for single operations like append and pop, the multi-step iteration and modification across threads without synchronization is risky and can lead to inconsistent state or RuntimeError. It is recommended to use a thread-safe queue (like queue.SimpleQueue) for completion events, which the main thread can drain into its local list.

Signed-off-by: HF-001 <1670186653@qq.com>

HF-001 added 2 commits April 27, 2026 14:23

[Feature] add simple kvcache offloading

77b3705

Signed-off-by: HF-001 <1670186653@qq.com>

[Feature] add simple kvcache offload

a2d6217

Signed-off-by: HF-001 <1670186653@qq.com>

HF-001 requested review from LCAIZJ, MengqingCao, Yikun and wangxiyuan as code owners April 27, 2026 06:54

github-actions Bot added the module:tests label Apr 27, 2026

gemini-code-assist Bot reviewed Apr 27, 2026

View reviewed changes

fix

a215201

Signed-off-by: HF-001 <1670186653@qq.com>

HF-001 force-pushed the simple_kv_offload branch from 64e4791 to a215201 Compare April 27, 2026 06:57

HF-001 added 6 commits April 27, 2026 14:57

Merge branch 'main' into simple_kv_offload

5ef2850

fix

d5ebc7b

Signed-off-by: HF-001 <1670186653@qq.com>

fix

4d6291a

Signed-off-by: HF-001 <1670186653@qq.com>

fix

d00d50f

Signed-off-by: HF-001 <1670186653@qq.com>

fix

3ae148d

Signed-off-by: HF-001 <1670186653@qq.com>

fix

6fbecad

Signed-off-by: HF-001 <1670186653@qq.com>

HF-001 force-pushed the simple_kv_offload branch from d112d57 to 6fbecad Compare April 27, 2026 09:29

Merge branch 'main' into simple_kv_offload

7a9883d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Simple yet General CPU KV Cache Offloading#8743

[Feature] Simple yet General CPU KV Cache Offloading#8743
HF-001 wants to merge 10 commits intovllm-project:mainfrom
HF-001:simple_kv_offload

HF-001 commented Apr 27, 2026 •

edited by github-actions Bot

Loading

Uh oh!

gemini-code-assist Bot commented Apr 27, 2026

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

HF-001 commented Apr 27, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Uh oh!

gemini-code-assist Bot commented Apr 27, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

HF-001 commented Apr 27, 2026 •

edited by github-actions Bot

Loading