Skip to content

[Feature] ngram npu implementtation compatilble with aync scheduler#8337

Open
HF-001 wants to merge 19 commits intovllm-project:mainfrom
HF-001:ngram_async_dev2
Open

[Feature] ngram npu implementtation compatilble with aync scheduler#8337
HF-001 wants to merge 19 commits intovllm-project:mainfrom
HF-001:ngram_async_dev2

Conversation

@HF-001
Copy link
Copy Markdown
Contributor

@HF-001 HF-001 commented Apr 16, 2026

What this PR does / why we need it?

refer to: #7311 , On the basis of the previous PR, performance was improved by optimizing operators

How was this patch tested?

ngram_gpu + async_scheduling script:

vllm serve /model/Qwen3-14B --port 8898 --dtype bfloat16
--tensor-parallel-size 1 --gpu-memory-utilization 0.8
--max-model-len 32768 --trust-remote-code
--no-enable-prefix-caching
--async-scheduling
--speculative_config '{"method": "ngram_gpu", "num_speculative_tokens": 3, "prompt_lookup_max": 2,"prompt_lookup_min": 2}'

ngram script:

vllm serve /model/Qwen3-14B --port 8898 --dtype bfloat16
--tensor-parallel-size 1 --gpu-memory-utilization 0.8
--max-model-len 32768 --trust-remote-code
--no-enable-prefix-caching
--speculative_config '{"method": "ngram", "num_speculative_tokens": 3, "prompt_lookup_max": 2,"prompt_lookup_min": 2}'

test script:

vllm bench serve
--port 8898
--backend vllm
--model /model/Qwen3-14B
--endpoint /v1/completions
--dataset-name sonnet
--dataset-path /model/vllm/benchmarks/sonnet.txt
--request-rate 2.0
--sonnet-input-len 128
--sonnet-output-len 100
--sonnet-prefix-len 10
--num-prompts 40
--ignore-eos
--percentile-metrics "ttft,tpot,itl,e2el

test results:
ngram+sync: ttft 330ms tpot 39ms
ngram+async(this ps):ttft 136ms tpot 33ms

cooperate with @CarterDuan

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a high-performance NPU implementation of the N-gram speculative decoding operator, specifically designed to be compatible with the asynchronous scheduler in vLLM-Ascend. By moving the N-gram matching logic to the NPU, the implementation significantly reduces latency and improves throughput for speculative decoding workloads. The changes include the necessary kernel definitions, host-side tiling, and integration into the model runner to handle asynchronous state updates and data synchronization correctly.

Highlights

  • NPU Operator Implementation: Implemented the NgramSpecDecode operator for NPU, including host-side tiling logic and device-side Ascend C kernel to support speculative decoding.
  • Async Scheduler Compatibility: Updated the model runner to support asynchronous scheduling with the new NPU-based N-gram speculative decoding, ensuring correct state management and synchronization.
  • Performance Optimization: Optimized operator performance by enabling chunked processing for large sequences and integrating with the async scheduler to reduce latency.
  • Testing and Validation: Added comprehensive end-to-end accuracy tests for the new operator, comparing NPU results against a CPU-based golden reference implementation.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions
Copy link
Copy Markdown
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Suggested PR Title: [Ops][Feature] Add N-gram speculative decoding support for Ascend NPU Suggested PR Summary: markdown ### What this PR does / why we need it? This pull request introduces a new custom NPU operator, npu_ngram_spec_decode, and integrates it into the vLLM speculative decoding framework for Ascend devices. The implementation includes the Ascend C kernel, tiling logic, ACLNN wrappers, and PyTorch bindings. Review feedback indicates that the N-gram matching logic should be optimized to prioritize the longest match (iterating from max_n down to min_n) to improve acceptance rates. Additionally, a bug was identified in the suffix offset calculation for short sequences, and it is recommended to replace the scalar fallback path with a unified SIMD approach to maintain performance when processing multiple rows. ### Does this PR introduce _any_ user-facing change? Yes, it adds the ngram_gpu speculative decoding method for Ascend NPU users. ### How was this patch tested? The PR includes new E2E accuracy tests comparing the NPU kernel against a CPU golden reference and correctness tests within the vLLM speculative decoding pipeline.

Comment on lines +222 to +223
for (int32_t ngram_len = this->min_n_val; ngram_len <= this->max_n_val; ++ngram_len) {
if (ngram_len > nt) break;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The N-gram matching logic in ProcessChunkedRows iterates from min_n to max_n and breaks on the first match found. This results in selecting the shortest match, which is suboptimal for speculative decoding. To maximize the number of speculative tokens and improve the acceptance rate, the algorithm should find the longest match by iterating from max_n down to min_n and breaking on the first success. Additionally, the break condition should be changed to continue when ngram_len > nt to allow checking shorter N-grams.

                for (int32_t ngram_len = this->max_n_val; ngram_len >= this->min_n_val; --ngram_len) {
                    if (ngram_len > nt) continue;

int32_t wc = nt - ngram_len;
if (wc <= 0) break;

int32_t suffix_offset = this->max_n_val - ngram_len;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The suffix_offset calculation is incorrect when the sequence length nt is less than max_n_val. In this scenario, suffix_gm_start is clamped to 0, and suffixLocal contains elements from the start of the sequence. The correct offset for a suffix of length ngram_len should be relative to the actual number of elements loaded into suffixLocal (which is min(nt, max_n_val)). Using a fixed max_n_val leads to incorrect matching for short sequences where the sequence length is less than the maximum N-gram size.

                    int32_t suffix_len = (nt < this->max_n_val) ? nt : this->max_n_val;
                    int32_t suffix_offset = suffix_len - ngram_len;

Comment on lines +493 to +494
for (int32_t ngram_len = this->min_n_val; ngram_len <= this->max_n_val; ++ngram_len) {
if (ngram_len > nt) break;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to the large row path, the SIMD path in ComputeOneRow should iterate from max_n down to min_n to find the longest match first. Additionally, a break condition should be added after the cmp_off loop to exit the ngram_len loop once a match is found, avoiding redundant computations for shorter N-grams after a longer match has already been identified.

                for (int32_t ngram_len = this->max_n_val; ngram_len >= this->min_n_val; --ngram_len) {
                    if (ngram_len > nt) continue;

Comment on lines +553 to +582
// 标量路径:多行在 UB 中,逐元素 GetValue
int32_t row_base = static_cast<int32_t>(idx) * static_cast<int32_t>(msa);

for (int32_t ngram_len = this->min_n_val; ngram_len <= this->max_n_val; ++ngram_len) {
if (ngram_len > num_tokens_tmp) break;

for (int32_t s = 0; s < ngram_len; ++s) {
suffixLocal.SetValue(static_cast<uint32_t>(s),
tokenLocal.GetValue(static_cast<uint32_t>(
row_base + num_tokens_tmp - ngram_len + s)));
}

int32_t max_pos = num_tokens_tmp - ngram_len - 1;
for (int32_t pos = 0; pos <= max_pos; ++pos) {
bool match = true;
for (int32_t s = 0; s < ngram_len; ++s) {
if (tokenLocal.GetValue(static_cast<uint32_t>(row_base + pos + s))
!= suffixLocal.GetValue(static_cast<uint32_t>(s))) {
match = false;
break;
}
}
if (match) {
best_match_pos = pos;
best_ngram_len = ngram_len;
break;
}
}
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The kernel falls back to a slow scalar path whenever block_rows > 1. AI Cores are not optimized for scalar loops, and this fallback will significantly degrade performance for batches where multiple rows fit into the Unified Buffer (e.g., shorter sequence lengths). The SIMD path using CompareScalar should be used for all cases by correctly calculating the row offset in tokenLocal using idx * msa. This would unify the logic and ensure high performance across all sequence lengths.

@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: kx <1670186653@qq.com>
@HF-001 HF-001 force-pushed the ngram_async_dev2 branch from 8fe6b90 to 6ed8f5e Compare April 16, 2026 09:15
@HF-001
Copy link
Copy Markdown
Contributor Author

HF-001 commented Apr 17, 2026

@wangxiyuan @weijinqian0 hi, this pr is ready, Would it be convenient for you to take a look?

HF-001 added 2 commits April 17, 2026 17:05
Signed-off-by: HF-001 <1670186653@qq.com>
Signed-off-by: HF-001 <1670186653@qq.com>
Signed-off-by: HF-001 <1670186653@qq.com>
@HF-001 HF-001 force-pushed the ngram_async_dev2 branch from 54b3389 to d1d5e0e Compare April 17, 2026 09:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants