[Feature] Simple yet General CPU KV Cache Offloading#8743
[Feature] Simple yet General CPU KV Cache Offloading#8743HF-001 wants to merge 10 commits intovllm-project:mainfrom
Conversation
Signed-off-by: HF-001 <1670186653@qq.com>
Signed-off-by: HF-001 <1670186653@qq.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request implements a native CPU KV cache offloading solution for Ascend NPUs. By integrating with vLLM's existing connector infrastructure, it enables efficient memory management and offloading capabilities, such as prefix caching and LRU eviction, while utilizing NPU-specific stream and event handling for high-performance data movement. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Code Review
Suggested PR Title:\n\nmarkdown\n[vllm-ascend][Ops][Feature] Implement Ascend NPU adaptation for SimpleCPUOffloadConnector\n\n\nSuggested PR Summary:\n\nmarkdown\n### What this PR does / why we need it?\nThis PR implements the Ascend NPU adaptation of vLLM's `SimpleCPUOffloadConnector`. It introduces the `SimpleCPUOffloadNPUWorker` and `NPUDmaCopyBackend` to handle KV-cache transfers between NPU and CPU using `aclrtMemcpyBatchAsync`. This enables efficient memory management on Ascend hardware by offloading inactive KV blocks to pinned CPU memory. \n\nFeedback identifies three critical issues: \n1. Redundant transfer submissions in `get_finished` that could cause performance degradation and queue overflows.\n2. A potential hang in request processing because the mapping of events to requests is not persisted across scheduling steps.\n3. A thread-safety race condition on the `events_list` shared between the worker thread and the main thread.\n\n### Does this PR introduce _any_ user-facing change?\nYes, it enables the `SimpleCPUOffloadConnector` for Ascend NPU users, allowing them to utilize CPU offloading for KV caches.\n\n### How was this patch tested?\nThe patch was tested with new end-to-end tests in `tests/e2e/singlecard/test_simple_cpu_offload.py`, covering accuracy and stability (no-crash) scenarios.\n
| if metadata.load_cpu_blocks: | ||
| self._backend.launch_copy( | ||
| metadata.load_cpu_blocks, | ||
| metadata.load_gpu_blocks, | ||
| is_store=False, | ||
| event_idx=metadata.load_event, | ||
| events_list=self._load_events, | ||
| ) | ||
| if metadata.store_gpu_blocks: | ||
| self._backend.launch_copy( | ||
| metadata.store_gpu_blocks, | ||
| metadata.store_cpu_blocks, | ||
| is_store=True, | ||
| event_idx=metadata.store_event, | ||
| events_list=self._store_events, | ||
| ) |
There was a problem hiding this comment.
Redundant transfer submission. get_finished can be called multiple times within a single scheduling step (e.g., while waiting for other requests). Since metadata.load_cpu_blocks and metadata.store_gpu_blocks are not cleared after the first call, the same transfers will be launched repeatedly on every call to get_finished. This can lead to significant performance degradation, race conditions on KV blocks, and potential NPU command queue overflow.
| if metadata.load_cpu_blocks: | |
| self._backend.launch_copy( | |
| metadata.load_cpu_blocks, | |
| metadata.load_gpu_blocks, | |
| is_store=False, | |
| event_idx=metadata.load_event, | |
| events_list=self._load_events, | |
| ) | |
| if metadata.store_gpu_blocks: | |
| self._backend.launch_copy( | |
| metadata.store_gpu_blocks, | |
| metadata.store_cpu_blocks, | |
| is_store=True, | |
| event_idx=metadata.store_event, | |
| events_list=self._store_events, | |
| ) | |
| if metadata.load_cpu_blocks: | |
| self._backend.launch_copy( | |
| metadata.load_cpu_blocks, | |
| metadata.load_gpu_blocks, | |
| is_store=False, | |
| event_idx=metadata.load_event, | |
| events_list=self._load_events, | |
| ) | |
| # Clear the lists to prevent redundant submission in subsequent calls | |
| metadata.load_cpu_blocks = [] | |
| if metadata.store_gpu_blocks: | |
| self._backend.launch_copy( | |
| metadata.store_gpu_blocks, | |
| metadata.store_cpu_blocks, | |
| is_store=True, | |
| event_idx=metadata.store_event, | |
| events_list=self._store_events, | |
| ) | |
| # Clear the lists to prevent redundant submission in subsequent calls | |
| metadata.store_gpu_blocks = [] |
| req_ids = ( | ||
| metadata.load_event_to_reqs.get(j) | ||
| if metadata is not None | ||
| else None | ||
| ) | ||
| if req_ids: |
There was a problem hiding this comment.
Potential hang for requests loading from CPU. metadata.load_event_to_reqs only contains the mapping for transfers initiated in the current step. If a load transfer from a previous step completes during the current step, metadata.load_event_to_reqs.get(j) will return None (or the wrong mapping), and the scheduler will never be notified that those requests are ready. The worker must maintain its own persistent mapping of event_idx -> req_ids (e.g., in a dictionary initialized in __init__) and populate it during bind_connector_metadata.
| copy_blocks(src_blocks, dst_blocks, params) | ||
| event = torch.npu.Event() | ||
| event.record(stream) | ||
| events_list.append((event_idx, event)) |
There was a problem hiding this comment.
Race condition on events_list. The worker thread appends to events_list while the main thread iterates over or pops from it in _poll_stream_events and _flush_and_sync_all. While CPython lists are generally thread-safe for single operations like append and pop, the multi-step iteration and modification across threads without synchronization is risky and can lead to inconsistent state or RuntimeError. It is recommended to use a thread-safe queue (like queue.SimpleQueue) for completion events, which the main thread can drain into its local list.
64e4791 to
a215201
Compare
d112d57 to
6fbecad
Compare
What this PR does / why we need it?
refer to: vllm-project/vllm#37160 , SimpleCPUOffloadConnector is another design of vLLM's CPU KV cache offloading path. Instead of maintaining a parallel block management stack, it reuses vLLM's existing BlockPool and KVCacheCoordinator infrastructure directly. This gives us HMA support, prefix caching, and LRU eviction for free.