[Feature] ngram npu implementtation compatilble with aync scheduler#8337
[Feature] ngram npu implementtation compatilble with aync scheduler#8337HF-001 wants to merge 19 commits intovllm-project:mainfrom
Conversation
Signed-off-by: 01267596 <xiongkai123@cmbchina.com>
Signed-off-by: 01267596 <xiongkai123@cmbchina.com>
Signed-off-by: 01267596 <xiongkai123@cmbchina.com>
Signed-off-by: kx <1670186653@qq.com>
Signed-off-by: HF-001 <1670186653@qq.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a high-performance NPU implementation of the N-gram speculative decoding operator, specifically designed to be compatible with the asynchronous scheduler in vLLM-Ascend. By moving the N-gram matching logic to the NPU, the implementation significantly reduces latency and improves throughput for speculative decoding workloads. The changes include the necessary kernel definitions, host-side tiling, and integration into the model runner to handle asynchronous state updates and data synchronization correctly. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Code Review
Suggested PR Title: [Ops][Feature] Add N-gram speculative decoding support for Ascend NPU Suggested PR Summary: markdown ### What this PR does / why we need it? This pull request introduces a new custom NPU operator, npu_ngram_spec_decode, and integrates it into the vLLM speculative decoding framework for Ascend devices. The implementation includes the Ascend C kernel, tiling logic, ACLNN wrappers, and PyTorch bindings. Review feedback indicates that the N-gram matching logic should be optimized to prioritize the longest match (iterating from max_n down to min_n) to improve acceptance rates. Additionally, a bug was identified in the suffix offset calculation for short sequences, and it is recommended to replace the scalar fallback path with a unified SIMD approach to maintain performance when processing multiple rows. ### Does this PR introduce _any_ user-facing change? Yes, it adds the ngram_gpu speculative decoding method for Ascend NPU users. ### How was this patch tested? The PR includes new E2E accuracy tests comparing the NPU kernel against a CPU golden reference and correctness tests within the vLLM speculative decoding pipeline.
| for (int32_t ngram_len = this->min_n_val; ngram_len <= this->max_n_val; ++ngram_len) { | ||
| if (ngram_len > nt) break; |
There was a problem hiding this comment.
The N-gram matching logic in ProcessChunkedRows iterates from min_n to max_n and breaks on the first match found. This results in selecting the shortest match, which is suboptimal for speculative decoding. To maximize the number of speculative tokens and improve the acceptance rate, the algorithm should find the longest match by iterating from max_n down to min_n and breaking on the first success. Additionally, the break condition should be changed to continue when ngram_len > nt to allow checking shorter N-grams.
for (int32_t ngram_len = this->max_n_val; ngram_len >= this->min_n_val; --ngram_len) {
if (ngram_len > nt) continue;| int32_t wc = nt - ngram_len; | ||
| if (wc <= 0) break; | ||
|
|
||
| int32_t suffix_offset = this->max_n_val - ngram_len; |
There was a problem hiding this comment.
The suffix_offset calculation is incorrect when the sequence length nt is less than max_n_val. In this scenario, suffix_gm_start is clamped to 0, and suffixLocal contains elements from the start of the sequence. The correct offset for a suffix of length ngram_len should be relative to the actual number of elements loaded into suffixLocal (which is min(nt, max_n_val)). Using a fixed max_n_val leads to incorrect matching for short sequences where the sequence length is less than the maximum N-gram size.
int32_t suffix_len = (nt < this->max_n_val) ? nt : this->max_n_val;
int32_t suffix_offset = suffix_len - ngram_len;| for (int32_t ngram_len = this->min_n_val; ngram_len <= this->max_n_val; ++ngram_len) { | ||
| if (ngram_len > nt) break; |
There was a problem hiding this comment.
Similar to the large row path, the SIMD path in ComputeOneRow should iterate from max_n down to min_n to find the longest match first. Additionally, a break condition should be added after the cmp_off loop to exit the ngram_len loop once a match is found, avoiding redundant computations for shorter N-grams after a longer match has already been identified.
for (int32_t ngram_len = this->max_n_val; ngram_len >= this->min_n_val; --ngram_len) {
if (ngram_len > nt) continue;| // 标量路径:多行在 UB 中,逐元素 GetValue | ||
| int32_t row_base = static_cast<int32_t>(idx) * static_cast<int32_t>(msa); | ||
|
|
||
| for (int32_t ngram_len = this->min_n_val; ngram_len <= this->max_n_val; ++ngram_len) { | ||
| if (ngram_len > num_tokens_tmp) break; | ||
|
|
||
| for (int32_t s = 0; s < ngram_len; ++s) { | ||
| suffixLocal.SetValue(static_cast<uint32_t>(s), | ||
| tokenLocal.GetValue(static_cast<uint32_t>( | ||
| row_base + num_tokens_tmp - ngram_len + s))); | ||
| } | ||
|
|
||
| int32_t max_pos = num_tokens_tmp - ngram_len - 1; | ||
| for (int32_t pos = 0; pos <= max_pos; ++pos) { | ||
| bool match = true; | ||
| for (int32_t s = 0; s < ngram_len; ++s) { | ||
| if (tokenLocal.GetValue(static_cast<uint32_t>(row_base + pos + s)) | ||
| != suffixLocal.GetValue(static_cast<uint32_t>(s))) { | ||
| match = false; | ||
| break; | ||
| } | ||
| } | ||
| if (match) { | ||
| best_match_pos = pos; | ||
| best_ngram_len = ngram_len; | ||
| break; | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
The kernel falls back to a slow scalar path whenever block_rows > 1. AI Cores are not optimized for scalar loops, and this fallback will significantly degrade performance for batches where multiple rows fit into the Unified Buffer (e.g., shorter sequence lengths). The SIMD path using CompareScalar should be used for all cases by correctly calculating the row offset in tokenLocal using idx * msa. This would unify the logic and ensure high performance across all sequence lengths.
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: kx <1670186653@qq.com>
Signed-off-by: HF-001 <1670186653@qq.com>
Signed-off-by: HF-001 <1670186653@qq.com>
8fe6b90 to
6ed8f5e
Compare
Signed-off-by: HF-001 <1670186653@qq.com>
|
@wangxiyuan @weijinqian0 hi, this pr is ready, Would it be convenient for you to take a look? |
54b3389 to
d1d5e0e
Compare
What this PR does / why we need it?
refer to: #7311 , On the basis of the previous PR, performance was improved by optimizing operators
How was this patch tested?
ngram_gpu + async_scheduling script:
vllm serve /model/Qwen3-14B --port 8898 --dtype bfloat16
--tensor-parallel-size 1 --gpu-memory-utilization 0.8
--max-model-len 32768 --trust-remote-code
--no-enable-prefix-caching
--async-scheduling
--speculative_config '{"method": "ngram_gpu", "num_speculative_tokens": 3, "prompt_lookup_max": 2,"prompt_lookup_min": 2}'
ngram script:
vllm serve /model/Qwen3-14B --port 8898 --dtype bfloat16
--tensor-parallel-size 1 --gpu-memory-utilization 0.8
--max-model-len 32768 --trust-remote-code
--no-enable-prefix-caching
--speculative_config '{"method": "ngram", "num_speculative_tokens": 3, "prompt_lookup_max": 2,"prompt_lookup_min": 2}'
test script:
vllm bench serve
--port 8898
--backend vllm
--model /model/Qwen3-14B
--endpoint /v1/completions
--dataset-name sonnet
--dataset-path /model/vllm/benchmarks/sonnet.txt
--request-rate 2.0
--sonnet-input-len 128
--sonnet-output-len 100
--sonnet-prefix-len 10
--num-prompts 40
--ignore-eos
--percentile-metrics "ttft,tpot,itl,e2el
test results:
ngram+sync: ttft 330ms tpot 39ms
ngram+async(this ps):ttft 136ms tpot 33ms
cooperate with @CarterDuan