[Feature] enable indexCache on npu by ChefWu551 · Pull Request #8398 · vllm-project/vllm-ascend

ChefWu551 · 2026-04-17T12:47:52Z

Motivation

Implemented the corresponding NPU adaptation based on upstream IndexCache work.
This is an optimization for DSA models on NPU, which can significantly improve throughput and reduce E2E latency.

This PR adds Ascend NPU adaptation for IndexCache in vLLM-Ascend, based on:

upstream vLLM PR: [Feature]: IndexCache support for DSA models vllm#37735
upstream issue implemented by that PR: [Feature]: IndexCache support for DSA models vllm#37684
IndexCache integration reference: https://github.com/THUDM/IndexCache?tab=readme-ov-file#step-1-clone-sglang--vllm

Modifications

This PR includes NPU-oriented integration and adaptation for IndexCache in vLLM-Ascend.

IndexCache can be enabled through HF overrides, for example:

--hf-overrides '{"use_index_cache": true, "index_topk_freq": 4}'

Accuracy Tests

I evaluated accuracy on the C-Eval dataset. The baseline score is 0.9008, while indexCache=4 (reusing 3/4 of the indices) achieves 0.9063.

Setting	Description	C-Eval Score
Baseline	IndexCache disabled	0.9008
IndexCache = 4	Reuse 3/4 of indices	0.9063

Benchmarking and Profiling

Benchmark model:

/nas/disk1/GLM-5-w8a8

IndexCache configuration:

--hf-overrides '{"use_index_cache": true, "index_topk_freq": 4}'

Concurrency 1

Command summary:

vllm bench serve \
  --backend openai-chat \
  --model glm-5 \
  --served-model-name glm-5 \
  --tokenizer /nas/disk1/GLM-5-w8a8 \
  --dataset-name random \
  --max-concurrency 1 \
  --num-prompts 10 \
  --random-input-len 20480 \
  --random-output-len 100

Metric	Baseline	IndexCache freq=4	Change
Successful requests	10	10	-
Failed requests	0	0	-
Benchmark duration	158.08 s	135.64 s	-14.20%
Request throughput	0.06 req/s	0.07 req/s	+16.67%
Output token throughput	5.62 tok/s	6.55 tok/s	+16.55%
Total token throughput	1289.57 tok/s	1502.87 tok/s	+16.54%
Mean TTFT	10757.46 ms	8922.62 ms	-17.06%
Median TTFT	10733.79 ms	9207.87 ms	-14.22%
P99 TTFT	14907.90 ms	12111.08 ms	-18.76%
Mean TPOT	57.53 ms	52.86 ms	-8.12%
Median TPOT	57.31 ms	52.70 ms	-8.04%
P99 TPOT	59.28 ms	54.47 ms	-8.11%
Mean ITL	56.87 ms	52.27 ms	-8.09%
Median ITL	57.49 ms	52.79 ms	-8.18%
P99 ITL	59.28 ms	54.34 ms	-8.33%
Mean E2EL	15807.51 ms	13563.86 ms	-14.19%
Median E2EL	15465.32 ms	13375.34 ms	-13.51%
P99 E2EL	20480.54 ms	17271.27 ms	-15.67%

Concurrency 3

Command summary:

vllm bench serve \
  --backend openai-chat \
  --model glm-5 \
  --served-model-name glm-5 \
  --tokenizer /nas/disk1/GLM-5-w8a8 \
  --dataset-name random \
  --max-concurrency 3 \
  --num-prompts 30 \
  --random-input-len 20480 \
  --random-output-len 100

Metric	Baseline	IndexCache freq=4	Change
Successful requests	30	30	-
Failed requests	0	0	-
Benchmark duration	373.53 s	319.61 s	-14.44%
Request throughput	0.08 req/s	0.09 req/s	+12.50%
Output token throughput	8.34 tok/s	9.74 tok/s	+16.79%
Total token throughput	1645.72 tok/s	1923.37 tok/s	+16.87%
Mean TTFT	13091.83 ms	11148.76 ms	-14.84%
Median TTFT	11481.13 ms	9499.50 ms	-17.26%
P99 TTFT	31325.00 ms	26006.29 ms	-16.98%
Mean TPOT	247.58 ms	210.51 ms	-14.97%
Median TPOT	228.46 ms	195.56 ms	-14.40%
P99 TPOT	578.27 ms	476.18 ms	-17.65%
Mean ITL	231.70 ms	198.65 ms	-14.26%
Median ITL	60.78 ms	55.68 ms	-8.39%
P99 ITL	2300.33 ms	1783.22 ms	-22.48%
Mean E2EL	37142.37 ms	31768.29 ms	-14.47%
Median E2EL	35100.75 ms	30648.91 ms	-12.68%
P99 E2EL	61257.49 ms	51345.84 ms	-16.18%

Benchmark Summary

With use_index_cache=true and index_topk_freq=4, the GLM-5 W8A8 workload shows consistent performance improvement:

Concurrency 1:
- total token throughput improves by 16.54%
- mean TTFT improves by 17.06%
- mean E2E latency improves by 14.19%
- mean TPOT improves by 8.12%
Concurrency 3:
- total token throughput improves by 16.87%
- mean TTFT improves by 14.84%
- mean E2E latency improves by 14.47%
- mean TPOT improves by 14.97%
- P99 ITL improves by 22.48%

No failed requests were observed in either baseline or IndexCache runs.

Checklist

Format your code according to the Format code with pre-commit.
Update documentation according to Write documentations. (if required)
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
vLLM version: v0.19.0
vLLM main: vllm-project/vllm@6f786f2

gemini-code-assist · 2026-04-17T12:48:01Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces IndexCache support for Ascend NPUs within the vLLM-Ascend framework. By allowing the reuse of top-k indices across layers, the changes optimize sparse attention computations, resulting in notable performance gains in both token throughput and latency as demonstrated by the provided benchmarks.

Highlights

IndexCache Integration: Implemented Ascend NPU adaptation for IndexCache, enabling significant throughput improvements and reduced E2E latency for DSA models.
MLA and SFA Compatibility: Added skip_topk support to the Ascend MLA wrapper and integrated the shared topk_indices_buffer into the Ascend Sparse Flash Attention (SFA) implementation.
Shape Compatibility: Implemented explicit shape conversion for cached top-k indices to bridge the gap between upstream buffer formats and Ascend sparse attention requirements.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

github-actions · 2026-04-17T12:48:16Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

Suggested PR Title:

[Attention][Feature] Implement index caching for Sparse Flash Attention

Suggested PR Summary:

### What this PR does / why we need it?
This PR implements an index caching mechanism for Sparse Flash Attention (SFA) on Ascend. It introduces `skip_topk` and `use_index_cache` logic to allow the model to reuse previously computed top-k indices stored in a buffer, reducing computation overhead. 

Feedback:
- A redundant assignment of `self.use_index_cache` in the `__init__` method should be removed for clarity.
- An `info` level log statement in the `forward` method should be removed or downgraded to `debug` to avoid performance degradation in the hot path.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI passed with existing tests.

gemini-code-assist · 2026-04-17T12:50:00Z

+        self.skip_topk = kwargs.get("skip_topk", False)
+        self.use_index_cache = self.skip_topk
+        self.topk_indices_buffer = kwargs.get("topk_indices_buffer")


The assignment self.use_index_cache = self.skip_topk at line 421 is redundant and potentially misleading because self.use_index_cache is re-initialized a few lines later (line 435) with the correct logic that considers both the layer-specific skip_topk and the global use_index_cache configuration. Removing this line improves code clarity and maintainability.

Suggested change

self.skip_topk = kwargs.get("skip_topk", False)

self.use_index_cache = self.skip_topk

self.topk_indices_buffer = kwargs.get("topk_indices_buffer")

self.skip_topk = kwargs.get("skip_topk", False)

self.topk_indices_buffer = kwargs.get("topk_indices_buffer")

I have removed the duplicate code: self.use_index_cache = self.skip_topk

gemini-code-assist · 2026-04-17T12:50:01Z

+        if self.skip_topk:
+            topk_indices = self._get_indexcache_topk_indices(topk_num_tokens)
+            logger.info("--- skip topk_indices --- ")


Logging at info level within the model's forward method (hot path) will cause significant performance degradation and log spam, as it executes for every layer in every forward pass. This log statement should be removed or changed to debug level to avoid impacting throughput and latency in production environments.

Suggested change

if self.skip_topk:

topk_indices = self._get_indexcache_topk_indices(topk_num_tokens)

logger.info("--- skip topk_indices --- ")

if self.skip_topk:

topk_indices = self._get_indexcache_topk_indices(topk_num_tokens)

Logging have been removed

Signed-off-by: wuyuefeng <565948592@qq.com>

ChefWu551 requested review from realliujiaxu, weijinqian0, whx-sjtu and zzzzwwjj as code owners April 17, 2026 12:47

github-actions Bot added the module:ops label Apr 17, 2026

gemini-code-assist Bot reviewed Apr 17, 2026

View reviewed changes

ChefWu551 mentioned this pull request Apr 17, 2026

[Feature] enable index Cache for npu #8324

Closed

4 tasks

ChefWu551 changed the title ~~enable indexCache on npu~~ [Feature] enable indexCache on npu Apr 17, 2026

ChefWu551 mentioned this pull request Apr 21, 2026

[Feature]: IndexCache support for DSA models vllm-project/vllm#37735

Merged

5 tasks

ChefWu551 added 3 commits April 30, 2026 10:07

enable indexCache on npu

80f44af

Signed-off-by: wuyuefeng <565948592@qq.com>

enable indexCache on npu

db08cd5

Signed-off-by: wuyuefeng <565948592@qq.com>

fix gemini-code-assist code review

3fadb79

Signed-off-by: wuyuefeng <565948592@qq.com>

ChefWu551 force-pushed the enable-indexCache branch from dcb32cf to 3fadb79 Compare April 30, 2026 02:12

wangxiyuan approved these changes May 7, 2026

View reviewed changes

wangxiyuan merged commit ba074eb into vllm-project:main May 7, 2026
51 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] enable indexCache on npu#8398

[Feature] enable indexCache on npu#8398
wangxiyuan merged 3 commits intovllm-project:mainfrom
ChefWu551:enable-indexCache

ChefWu551 commented Apr 17, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Apr 17, 2026

Uh oh!

github-actions Bot commented Apr 17, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 17, 2026

Uh oh!

ChefWu551 Apr 17, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot Apr 17, 2026

Uh oh!

ChefWu551 Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ChefWu551 commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Concurrency 1

Concurrency 3

Benchmark Summary

Checklist

Uh oh!

gemini-code-assist Bot commented Apr 17, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

github-actions Bot commented Apr 17, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

ChefWu551 Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

ChefWu551 Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ChefWu551 commented Apr 17, 2026 •

edited

Loading

ChefWu551 Apr 17, 2026 •

edited

Loading