[Feature] enable indexCache on npu#8398
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces IndexCache support for Ascend NPUs within the vLLM-Ascend framework. By allowing the reuse of top-k indices across layers, the changes optimize sparse attention computations, resulting in notable performance gains in both token throughput and latency as demonstrated by the provided benchmarks. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Code Review
Suggested PR Title:
[Attention][Feature] Implement index caching for Sparse Flash AttentionSuggested PR Summary:
### What this PR does / why we need it?
This PR implements an index caching mechanism for Sparse Flash Attention (SFA) on Ascend. It introduces `skip_topk` and `use_index_cache` logic to allow the model to reuse previously computed top-k indices stored in a buffer, reducing computation overhead.
Feedback:
- A redundant assignment of `self.use_index_cache` in the `__init__` method should be removed for clarity.
- An `info` level log statement in the `forward` method should be removed or downgraded to `debug` to avoid performance degradation in the hot path.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI passed with existing tests.| self.skip_topk = kwargs.get("skip_topk", False) | ||
| self.use_index_cache = self.skip_topk | ||
| self.topk_indices_buffer = kwargs.get("topk_indices_buffer") |
There was a problem hiding this comment.
The assignment self.use_index_cache = self.skip_topk at line 421 is redundant and potentially misleading because self.use_index_cache is re-initialized a few lines later (line 435) with the correct logic that considers both the layer-specific skip_topk and the global use_index_cache configuration. Removing this line improves code clarity and maintainability.
| self.skip_topk = kwargs.get("skip_topk", False) | |
| self.use_index_cache = self.skip_topk | |
| self.topk_indices_buffer = kwargs.get("topk_indices_buffer") | |
| self.skip_topk = kwargs.get("skip_topk", False) | |
| self.topk_indices_buffer = kwargs.get("topk_indices_buffer") |
There was a problem hiding this comment.
I have removed the duplicate code: self.use_index_cache = self.skip_topk
| if self.skip_topk: | ||
| topk_indices = self._get_indexcache_topk_indices(topk_num_tokens) | ||
| logger.info("--- skip topk_indices --- ") |
There was a problem hiding this comment.
Logging at info level within the model's forward method (hot path) will cause significant performance degradation and log spam, as it executes for every layer in every forward pass. This log statement should be removed or changed to debug level to avoid impacting throughput and latency in production environments.
| if self.skip_topk: | |
| topk_indices = self._get_indexcache_topk_indices(topk_num_tokens) | |
| logger.info("--- skip topk_indices --- ") | |
| if self.skip_topk: | |
| topk_indices = self._get_indexcache_topk_indices(topk_num_tokens) |
There was a problem hiding this comment.
Logging have been removed
Signed-off-by: wuyuefeng <565948592@qq.com>
Signed-off-by: wuyuefeng <565948592@qq.com>
Signed-off-by: wuyuefeng <565948592@qq.com>
dcb32cf to
3fadb79
Compare
Motivation
Implemented the corresponding NPU adaptation based on upstream IndexCache work.
This is an optimization for DSA models on NPU, which can significantly improve throughput and reduce E2E latency.
This PR adds Ascend NPU adaptation for IndexCache in vLLM-Ascend, based on:
Modifications
This PR includes NPU-oriented integration and adaptation for IndexCache in vLLM-Ascend.
IndexCache can be enabled through HF overrides, for example:
--hf-overrides '{"use_index_cache": true, "index_topk_freq": 4}'Accuracy Tests
I evaluated accuracy on the C-Eval dataset. The baseline score is
0.9008, whileindexCache=4(reusing 3/4 of the indices) achieves0.9063.Benchmarking and Profiling
Benchmark model:
IndexCache configuration:
--hf-overrides '{"use_index_cache": true, "index_topk_freq": 4}'Concurrency 1
Command summary:
Concurrency 3
Command summary:
Benchmark Summary
With
use_index_cache=trueandindex_topk_freq=4, the GLM-5 W8A8 workload shows consistent performance improvement:Concurrency 1:
Concurrency 3:
No failed requests were observed in either baseline or IndexCache runs.
Checklist
Format your code according to the Format code with pre-commit.
Update documentation according to Write documentations. (if required)
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
vLLM version: v0.19.0
vLLM main: vllm-project/vllm@6f786f2