[Feature] enable index Cache for npu by ChefWu551 · Pull Request #8324 · vllm-project/vllm-ascend

ChefWu551 · 2026-04-16T03:27:29Z

Motivation

Implemented the corresponding NPU adaptation based on upstream IndexCache work.

This PR adds Ascend NPU adaptation for IndexCache in vLLM-Ascend, based on:

upstream vLLM PR: [Feature]: IndexCache support for DSA models vllm#37735
upstream issue implemented by that PR: [Feature]: IndexCache support for DSA models vllm#37684
IndexCache integration reference: https://github.com/THUDM/IndexCache?tab=readme-ov-file#step-1-clone-sglang--vllm

Modifications

This PR includes NPU-oriented integration and adaptation for IndexCache in vLLM-Ascend

Accuracy Tests

TODO: Accuracy evaluation data will be added in a follow-up update.

Benchmarking and Profiling

TODO: Benchmark and profiling data (IndexCache on/off) will be added in a follow-up update.

Checklist

Format your code according to the Format code with pre-commit.
Update documentation according to Write documentations. (if required)
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed. (pending final data upload)
Follow the SGLang code style guidance.
vLLM version:
vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0

github-actions · 2026-04-16T03:27:44Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist · 2026-04-16T03:27:45Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates IndexCache functionality into the vLLM-Ascend stack. By allowing attention layers to share and reuse top-k indices, the changes reduce redundant computations during model inference, aligning the NPU implementation with upstream vLLM improvements.

Highlights

IndexCache Integration: Enabled IndexCache support for Ascend NPU by implementing shared top-k index reuse across attention layers.
Performance Optimization: Introduced a skip_topk mechanism to avoid redundant index calculations during the forward pass, improving efficiency.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a mechanism to share topk_indices across layers in AscendSFAMetadata using a skip_topk flag to optimize performance. A critical issue was identified where the shared indices are incorrectly reused despite being dependent on layer-specific hidden_states.

Suggested PR Title:

[Attention][Ops][Feature] Support shared top-k indices in SFA

Suggested PR Summary:

### What this PR does / why we need it?
This PR adds a `shared_topk_indices` field to `AscendSFAMetadata` and a `skip_topk` flag to the attention and MLA modules. This allows layers to reuse previously computed top-k indices to reduce redundant computations.

Feedback: A critical flaw was identified where `topk_indices` are computed from layer-specific `hidden_states`, making the reuse of these indices across layers mathematically incorrect.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI passed.

gemini-code-assist · 2026-04-16T03:33:48Z

+        if self.skip_topk and attn_metadata.shared_topk_indices is not None:
+            topk_indices = attn_metadata.shared_topk_indices
+        else:
+            topk_indices = self.indexer_select_post_process(
+                x=hidden_states,
+                q_c=q_c,
+                kv_cache=kv_cache,
+                attn_metadata=attn_metadata,
+                cos=cos,
+                sin=sin,
+                actual_seq_lengths_query=actual_seq_lengths_query,
+                actual_seq_lengths_key=actual_seq_lengths_key,
+            )
+            attn_metadata.shared_topk_indices = topk_indices


The current implementation of IndexCache appears to have a critical flaw. The topk_indices are cached and reused across layers, but their computation in indexer_select_post_process depends on hidden_states, which is unique to each layer.

Specifically, indexer_select_post_process uses x (which is hidden_states) to compute weights:

weights, _ = self.weights_proj(x)

These weights are then used to determine topk_indices. Since hidden_states differ from one layer to the next, the topk_indices will also be different. Reusing them will lead to incorrect attention calculations.

For IndexCache to work correctly, the computation of topk_indices must be based on tensors that are shared across the layers intended to use the cache. This might require passing a shared tensor to indexer_select_post_process instead of the per-layer hidden_states.

Thanks for the review. This behavior is intentional and matches upstream IndexCache semantics (vLLM PR #37735, issue #37684).
IndexCache is an approximate optimization: “full” layers compute top-k indices, while “shared” layers reuse cached indices to reduce redundant computation.
In our implementation, reuse is only enabled when skip_topk=True; otherwise indices are computed per layer as usual.
We’ll also attach accuracy/performance results to quantify the tradeoff.

ChefWu551 · 2026-04-17T12:50:17Z

This PR has been closed due to implementation issues. PR #8398 has fixed the corresponding functionality and provided relevant benchmark data, showing an improvement of 16%–18%.

enable index Cache for npu

2129f7c

ChefWu551 requested review from realliujiaxu, weijinqian0, whx-sjtu and zzzzwwjj as code owners April 16, 2026 03:27

github-actions Bot added the module:ops label Apr 16, 2026

gemini-code-assist Bot reviewed Apr 16, 2026

View reviewed changes

ChefWu551 changed the title ~~[NPU] enable index Cache for npu~~ [Feature] enable index Cache for npu Apr 16, 2026

ChefWu551 closed this Apr 17, 2026

ChefWu551 deleted the index-cache-npu branch April 17, 2026 12:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] enable index Cache for npu#8324

[Feature] enable index Cache for npu#8324
ChefWu551 wants to merge 1 commit intovllm-project:mainfrom
ChefWu551:index-cache-npu

ChefWu551 commented Apr 16, 2026 •

edited by github-actions Bot

Loading

Uh oh!

github-actions Bot commented Apr 16, 2026

Uh oh!

gemini-code-assist Bot commented Apr 16, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 16, 2026

Uh oh!

ChefWu551 Apr 16, 2026

Uh oh!

ChefWu551 commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ChefWu551 commented Apr 16, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

github-actions Bot commented Apr 16, 2026

Uh oh!

gemini-code-assist Bot commented Apr 16, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

ChefWu551 Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

ChefWu551 commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ChefWu551 commented Apr 16, 2026 •

edited by github-actions Bot

Loading