[Feature] ngram npu implementtation compatilble with aync scheduler by HF-001 · Pull Request #8337 · vllm-project/vllm-ascend

HF-001 · 2026-04-16T08:10:27Z

What this PR does / why we need it?

refer to: #7311 , On the basis of the previous PR, performance was improved by optimizing operators

How was this patch tested?

ngram_gpu + async_scheduling script:

vllm serve /model/Qwen3-14B --port 8898 --dtype bfloat16
--tensor-parallel-size 1 --gpu-memory-utilization 0.8
--max-model-len 32768 --trust-remote-code
--no-enable-prefix-caching
--async-scheduling
--speculative_config '{"method": "ngram_gpu", "num_speculative_tokens": 3, "prompt_lookup_max": 2,"prompt_lookup_min": 2}'

ngram script:

vllm serve /model/Qwen3-14B --port 8898 --dtype bfloat16
--tensor-parallel-size 1 --gpu-memory-utilization 0.8
--max-model-len 32768 --trust-remote-code
--no-enable-prefix-caching
--speculative_config '{"method": "ngram", "num_speculative_tokens": 3, "prompt_lookup_max": 2,"prompt_lookup_min": 2}'

test script:

vllm bench serve
--port 8898
--backend vllm
--model /model/Qwen3-14B
--endpoint /v1/completions
--dataset-name sonnet
--dataset-path /model/vllm/benchmarks/sonnet.txt
--request-rate 2.0
--sonnet-input-len 128
--sonnet-output-len 100
--sonnet-prefix-len 10
--num-prompts 40
--ignore-eos
--percentile-metrics "ttft,tpot,itl,e2el

test results:
ngram+sync: ttft 330ms tpot 39ms
ngram+async(this ps)：ttft 136ms tpot 33ms

vLLM version: v0.19.0
vLLM main: vllm-project/vllm@5af684c

cooperate with @CarterDuan

Signed-off-by: 01267596 <xiongkai123@cmbchina.com>

Signed-off-by: kx <1670186653@qq.com>

Signed-off-by: HF-001 <1670186653@qq.com>

gemini-code-assist · 2026-04-16T08:10:39Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a high-performance NPU implementation of the N-gram speculative decoding operator, specifically designed to be compatible with the asynchronous scheduler in vLLM-Ascend. By moving the N-gram matching logic to the NPU, the implementation significantly reduces latency and improves throughput for speculative decoding workloads. The changes include the necessary kernel definitions, host-side tiling, and integration into the model runner to handle asynchronous state updates and data synchronization correctly.

Highlights

NPU Operator Implementation: Implemented the NgramSpecDecode operator for NPU, including host-side tiling logic and device-side Ascend C kernel to support speculative decoding.
Async Scheduler Compatibility: Updated the model runner to support asynchronous scheduling with the new NPU-based N-gram speculative decoding, ensuring correct state management and synchronization.
Performance Optimization: Optimized operator performance by enabling chunked processing for large sequences and integrating with the async scheduler to reduce latency.
Testing and Validation: Added comprehensive end-to-end accuracy tests for the new operator, comparing NPU results against a CPU-based golden reference implementation.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

github-actions · 2026-04-16T08:10:47Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

Suggested PR Title: [Ops][Feature] Add N-gram speculative decoding support for Ascend NPU Suggested PR Summary: markdown ### What this PR does / why we need it? This pull request introduces a new custom NPU operator, npu_ngram_spec_decode, and integrates it into the vLLM speculative decoding framework for Ascend devices. The implementation includes the Ascend C kernel, tiling logic, ACLNN wrappers, and PyTorch bindings. Review feedback indicates that the N-gram matching logic should be optimized to prioritize the longest match (iterating from max_n down to min_n) to improve acceptance rates. Additionally, a bug was identified in the suffix offset calculation for short sequences, and it is recommended to replace the scalar fallback path with a unified SIMD approach to maintain performance when processing multiple rows. ### Does this PR introduce _any_ user-facing change? Yes, it adds the ngram_gpu speculative decoding method for Ascend NPU users. ### How was this patch tested? The PR includes new E2E accuracy tests comparing the NPU kernel against a CPU golden reference and correctness tests within the vLLM speculative decoding pipeline.

gemini-code-assist · 2026-04-16T08:13:45Z

+                for (int32_t ngram_len = this->min_n_val; ngram_len <= this->max_n_val; ++ngram_len) {
+                    if (ngram_len > nt) break;


The N-gram matching logic in ProcessChunkedRows iterates from min_n to max_n and breaks on the first match found. This results in selecting the shortest match, which is suboptimal for speculative decoding. To maximize the number of speculative tokens and improve the acceptance rate, the algorithm should find the longest match by iterating from max_n down to min_n and breaking on the first success. Additionally, the break condition should be changed to continue when ngram_len > nt to allow checking shorter N-grams.

for (int32_t ngram_len = this->max_n_val; ngram_len >= this->min_n_val; --ngram_len) { if (ngram_len > nt) continue;

gemini-code-assist · 2026-04-16T08:13:45Z

+                    int32_t wc = nt - ngram_len;
+                    if (wc <= 0) break;
+
+                    int32_t suffix_offset = this->max_n_val - ngram_len;


The suffix_offset calculation is incorrect when the sequence length nt is less than max_n_val. In this scenario, suffix_gm_start is clamped to 0, and suffixLocal contains elements from the start of the sequence. The correct offset for a suffix of length ngram_len should be relative to the actual number of elements loaded into suffixLocal (which is min(nt, max_n_val)). Using a fixed max_n_val leads to incorrect matching for short sequences where the sequence length is less than the maximum N-gram size.

int32_t suffix_len = (nt < this->max_n_val) ? nt : this->max_n_val; int32_t suffix_offset = suffix_len - ngram_len;

gemini-code-assist · 2026-04-16T08:13:45Z

+                for (int32_t ngram_len = this->min_n_val; ngram_len <= this->max_n_val; ++ngram_len) {
+                    if (ngram_len > nt) break;


Similar to the large row path, the SIMD path in ComputeOneRow should iterate from max_n down to min_n to find the longest match first. Additionally, a break condition should be added after the cmp_off loop to exit the ngram_len loop once a match is found, avoiding redundant computations for shorter N-grams after a longer match has already been identified.

for (int32_t ngram_len = this->max_n_val; ngram_len >= this->min_n_val; --ngram_len) { if (ngram_len > nt) continue;

gemini-code-assist · 2026-04-16T08:13:45Z

+                // 标量路径：多行在 UB 中，逐元素 GetValue
+                int32_t row_base = static_cast<int32_t>(idx) * static_cast<int32_t>(msa);
+
+                for (int32_t ngram_len = this->min_n_val; ngram_len <= this->max_n_val; ++ngram_len) {
+                    if (ngram_len > num_tokens_tmp) break;
+
+                    for (int32_t s = 0; s < ngram_len; ++s) {
+                        suffixLocal.SetValue(static_cast<uint32_t>(s),
+                            tokenLocal.GetValue(static_cast<uint32_t>(
+                                row_base + num_tokens_tmp - ngram_len + s)));
+                    }
+
+                    int32_t max_pos = num_tokens_tmp - ngram_len - 1;
+                    for (int32_t pos = 0; pos <= max_pos; ++pos) {
+                        bool match = true;
+                        for (int32_t s = 0; s < ngram_len; ++s) {
+                            if (tokenLocal.GetValue(static_cast<uint32_t>(row_base + pos + s))
+                                != suffixLocal.GetValue(static_cast<uint32_t>(s))) {
+                                match = false;
+                                break;
+                            }
+                        }
+                        if (match) {
+                            best_match_pos = pos;
+                            best_ngram_len = ngram_len;
+                            break;
+                        }
+                    }
+                }
+            }


The kernel falls back to a slow scalar path whenever block_rows > 1. AI Cores are not optimized for scalar loops, and this fallback will significantly degrade performance for batches where multiple rows fit into the Unified Buffer (e.g., shorter sequence lengths). The SIMD path using CompareScalar should be used for all cases by correctly calculating the row offset in tokenLocal using idx * msa. This would unify the logic and ensure high performance across all sequence lengths.

github-actions · 2026-04-16T08:23:27Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: kx <1670186653@qq.com>

Signed-off-by: HF-001 <1670186653@qq.com>

HF-001 · 2026-04-17T02:02:52Z

@wangxiyuan @weijinqian0 hi, this pr is ready, Would it be convenient for you to take a look?

Signed-off-by: HF-001 <1670186653@qq.com>

01267596 and others added 10 commits March 16, 2026 08:23

[feat] ngram npu implementtation compatilble with aync scheduler

cee3155

Signed-off-by: 01267596 <xiongkai123@cmbchina.com>

[feat] fix format

fc0a481

Signed-off-by: 01267596 <xiongkai123@cmbchina.com>

Merge branch 'main' into ngram_aync_dev

ec52d9c

Merge branch 'main' into ngram_aync_dev

5b242fb

Merge branch 'main' into ngram_aync_dev

8430721

Merge branch 'main' into ngram_aync_dev

897a4ed

Merge branch 'main' into ngram_aync_dev

41f7993

add ci test

df46f31

Signed-off-by: 01267596 <xiongkai123@cmbchina.com>

Merge branch 'main' into ngram_aync_dev

5be5f27

Signed-off-by: kx <1670186653@qq.com>

[Feature] ngram npu implementtation compatilble with aync scheduler

2ca987e

Signed-off-by: HF-001 <1670186653@qq.com>

HF-001 requested review from MengqingCao, wangxiyuan and zzzzwwjj as code owners April 16, 2026 08:10

github-actions bot added the module:tests label Apr 16, 2026

gemini-code-assist bot reviewed Apr 16, 2026

View reviewed changes

github-actions bot added the merge-conflicts label Apr 16, 2026

Merge branch 'main' into ngram_async_dev2

a8892a3

Signed-off-by: kx <1670186653@qq.com>

github-actions bot removed the merge-conflicts label Apr 16, 2026

HF-001 added 2 commits April 16, 2026 17:08

[Feature] ngram npu implementtation compatilble with aync scheduler

83f89f0

Signed-off-by: HF-001 <1670186653@qq.com>

[Feature] ngram npu implementtation compatilble with aync scheduler

6ed8f5e

Signed-off-by: HF-001 <1670186653@qq.com>

HF-001 force-pushed the ngram_async_dev2 branch from 8fe6b90 to 6ed8f5e Compare April 16, 2026 09:15

HF-001 added 2 commits April 16, 2026 17:16

Merge branch 'main' into ngram_async_dev2

69a09b6

[Feature] ngram npu implementtation compatilble with aync scheduler

0549943

Signed-off-by: HF-001 <1670186653@qq.com>

wangxiyuan approved these changes Apr 17, 2026

View reviewed changes

HF-001 added 2 commits April 17, 2026 17:05

fix

7371dca

Signed-off-by: HF-001 <1670186653@qq.com>

fix

b80eaa6

Signed-off-by: HF-001 <1670186653@qq.com>

fix

d1d5e0e

Signed-off-by: HF-001 <1670186653@qq.com>

HF-001 force-pushed the ngram_async_dev2 branch from 54b3389 to d1d5e0e Compare April 17, 2026 09:18

Merge branch 'main' into ngram_async_dev2

333445a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] ngram npu implementtation compatilble with aync scheduler#8337

[Feature] ngram npu implementtation compatilble with aync scheduler#8337
HF-001 wants to merge 19 commits intovllm-project:mainfrom
HF-001:ngram_async_dev2

HF-001 commented Apr 16, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Apr 16, 2026

Uh oh!

github-actions bot commented Apr 16, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 16, 2026

Uh oh!

gemini-code-assist bot Apr 16, 2026

Uh oh!

gemini-code-assist bot Apr 16, 2026

Uh oh!

gemini-code-assist bot Apr 16, 2026

Uh oh!

github-actions bot commented Apr 16, 2026

Uh oh!

HF-001 commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		for (int32_t ngram_len = this->min_n_val; ngram_len <= this->max_n_val; ++ngram_len) {
		if (ngram_len > nt) break;

Conversation

HF-001 commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

How was this patch tested?

Uh oh!

gemini-code-assist bot commented Apr 16, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

github-actions bot commented Apr 16, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 16, 2026

Uh oh!

HF-001 commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HF-001 commented Apr 16, 2026 •

edited

Loading