[feat] ngram npu implementtation compatilble with aync scheduler by HF-001 · Pull Request #7311 · vllm-project/vllm-ascend

HF-001 · 2026-03-16T09:06:06Z

What this PR does / why we need it?

refer to vllm-project/vllm#29184 , implement NPU version of ngram speculative decoding and make it compatible with Async Scheduler.

How was this patch tested?

ngram_gpu + async_scheduling script:

vllm serve /model/Qwen3-1.7B --port 8898 --dtype bfloat16
--tensor-parallel-size 1 --gpu-memory-utilization 0.8
--max-model-len 32768 --trust-remote-code
--no-enable-prefix-caching
--async-scheduling
--speculative_config '{"method": "ngram_gpu", "num_speculative_tokens": 3, "prompt_lookup_max": 2,"prompt_lookup_min": 2}'

ngram script:

vllm serve /model/Qwen3-1.7B --port 8898 --dtype bfloat16
--tensor-parallel-size 1 --gpu-memory-utilization 0.8
--max-model-len 32768 --trust-remote-code
--no-enable-prefix-caching
--speculative_config '{"method": "ngram", "num_speculative_tokens": 3, "prompt_lookup_max": 2,"prompt_lookup_min": 2}'

test script:

vllm bench serve
--port 8898
--backend vllm
--model /model/Qwen3-1.7B
--endpoint /v1/completions
--dataset-name sonnet
--dataset-path /model/vllm/benchmarks/sonnet.txt
--request-rate 2.0
--sonnet-input-len 128
--sonnet-output-len 100
--sonnet-prefix-len 10
--num-prompts 40
--ignore-eos
--percentile-metrics "ttft,tpot,itl,e2el

test results:
on qwen3-1.7B model, TTFT -67.33%(269.53 ms -> 88.05 ms)
on qwen3-14B model, TTFT -48.24%(284.37ms -> 147.25ms)

vLLM version: v0.17.0
vLLM main: vllm-project/vllm@4034c3d

Signed-off-by: 01267596 <xiongkai123@cmbchina.com>

gemini-code-assist · 2026-03-16T09:06:30Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates an NPU-optimized version of Ngram speculative decoding into the system, significantly enhancing performance metrics like TTFT and TPOT. The changes ensure seamless operation with the asynchronous scheduler by introducing a dedicated NPU proposer and refining data handling mechanisms, thereby improving the overall efficiency of speculative decoding on NPU hardware.

Highlights

NPU-compatible Ngram Speculative Decoding: Introduced a new AscendNgramProposerNPU class to enable Ngram speculative decoding specifically optimized for NPU (Neural Processing Unit) environments.
Async Scheduler Compatibility: Ensured the new NPU-based Ngram speculative decoding method is fully compatible with the asynchronous scheduler, allowing for efficient concurrent request processing.
Performance Improvements: Achieved significant performance gains, with a 67.33% reduction in Time To First Token (TTFT) and a 22.3% increase in Throughput Per Output Token (TPOT) during testing.
Optimized Data Handling for NPU: Refactored num_tokens_no_spec in NPUInputBatch to use a pinned CPU tensor, improving data transfer efficiency for NPU operations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

vllm_ascend/spec_decode/init.py
- Imported AscendNgramProposerNPU.
- Registered ngram_gpu as a new speculative decoding method, returning an instance of AscendNgramProposerNPU.
vllm_ascend/spec_decode/ngram_proposer_npu.py
- Added a new file defining AscendNgramProposerNPU.
- Implemented AscendNgramProposerNPU inheriting from NgramProposerGPU.
- Overrode load_model and dummy_run methods with no-op implementations.
- Overrode the propose method to call the superclass's propose method.
vllm_ascend/worker/model_runner_v1.py
- Imported replace from dataclasses for efficient object copying.
- Imported copy_num_valid_draft_tokens from vllm.v1.spec_decode.ngram_proposer_gpu.
- Imported AscendNgramProposerNPU.
- Updated the type hint for self.drafter to include AscendNgramProposerNPU.
- Added a new conditional block in propose_draft_token_ids to handle AscendNgramProposerNPU, including updating token IDs, proposing drafts, caching valid draft counts, and performing asynchronous D2H copy.
- Introduced logic in execute_model to create a copy of scheduler_output using replace when ngram_gpu is used, preventing modification side effects with the async scheduler.
- Modified the use_padded_batch condition to include self.speculative_config.use_ngram_gpu() for determining when to use GPU-sampled tokens as inputs for speculative decoding.
vllm_ascend/worker/npu_input_batch.py
- Modified self.num_tokens_no_spec to be a NumPy array backed by a torch.Tensor with pin_memory=True, optimizing data transfer for NPU operations.

Activity

The author, HF-001, implemented the NPU version of ngram speculative decoding, ensuring compatibility with the Async Scheduler.
The changes were tested using both ngram_gpu + async_scheduling and ngram scripts with vllm serve and vllm bench serve.
Test results showed significant performance improvements: TTFT decreased by 67.33% (from 269.53 ms to 88.05 ms) and TPOT increased by 22.3% (from 18.86 ms to 23.07 ms).
The author plans to further optimize the NgramGPUKernel logic in the future to enhance NPU performance.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces NPU compatibility for the ngram speculative decoding method, integrating AscendNgramProposerNPU into the system. The changes involve adding the new NPU proposer, updating type hints, and modifying the propose_draft_token_ids logic in model_runner_v1.py to handle the NPU-specific implementation. Additionally, a copy mechanism for scheduler_output is introduced when ngram_gpu is used with async scheduling, and the num_tokens_no_spec initialization in npu_input_batch.py is adjusted for better NPU compatibility. While the overall approach is clear, there are several critical areas that require attention to ensure full NPU compatibility and correct functionality.

I am having trouble creating individual review comments. Click here to see my feedback.

vllm_ascend/worker/model_runner_v1.py (992-1003)

The update_token_ids_ngram method is called on self.drafter (which is AscendNgramProposerNPU). However, AscendNgramProposerNPU does not define this method, meaning it will fall back to NgramProposerGPU.update_token_ids_ngram. If NgramProposerGPU.update_token_ids_ngram contains GPU-specific kernels or operations (e.g., CUDA-specific tensor operations, stream management) that are not NPU-compatible, this will lead to runtime failures. Please verify the compatibility of this method with NPU or provide an NPU-specific implementation in AscendNgramProposerNPU.

vllm_ascend/spec_decode/ngram_proposer_npu.py (14-27)

The dummy_run method in AscendNgramProposerNPU is currently a no-op. If the parent class NgramProposerGPU's dummy_run contains essential logic for setup, graph capturing, or profiling that is expected by the NPU implementation, overriding it with pass could lead to runtime errors or incorrect behavior. Please ensure that the NPU implementation does not require any specific initialization or dummy computation during this phase, or implement the necessary NPU-specific logic here.

vllm_ascend/worker/model_runner_v1.py (1010-1015)

The propose method in AscendNgramProposerNPU directly calls super().propose, which means NgramProposerGPU.propose will be executed. Similar to update_token_ids_ngram, if NgramProposerGPU.propose contains GPU-specific logic or kernel calls that are not NPU-compatible, this will cause issues. Please ensure that the underlying NgramProposerGPU.propose method is fully compatible with NPU operations or provide an NPU-specific override.

vllm_ascend/worker/model_runner_v1.py (1021-1027)

The copy_num_valid_draft_tokens function is imported from vllm.v1.spec_decode.ngram_proposer_gpu and performs an "Async D2H copy on a dedicated stream." While _torch_cuda_wrapper attempts to patch torch.cuda calls to torch.npu, it is crucial to verify that this specific function correctly uses NPU streams/events or is otherwise NPU-compatible. If it directly uses torch.cuda without going through the patched aliases, it will fail. Please confirm its NPU compatibility.

Signed-off-by: 01267596 <xiongkai123@cmbchina.com>

HF-001 · 2026-03-16T09:36:06Z

@wangxiyuan hi, vllm version of the ci is too old, which cause ci error:
Error: vllm_ascend/spec_decode/ngram_proposer_npu.py:2: error: Cannot find implementation or library stub for module named "vllm.v1.spec_decode.ngram_proposer_gpu"

github-actions · 2026-03-16T09:57:12Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

HF-001 · 2026-03-24T07:32:29Z

@wangxiyuan @weijinqian0 hi, this pr is ready，Could you help review and merge it ?

github-actions · 2026-03-25T14:33:38Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: 01267596 <xiongkai123@cmbchina.com>

Signed-off-by: kx <1670186653@qq.com>

[feat] ngram npu implementtation compatilble with aync scheduler

cee3155

Signed-off-by: 01267596 <xiongkai123@cmbchina.com>

HF-001 requested review from MengqingCao and wangxiyuan as code owners March 16, 2026 09:06

gemini-code-assist bot reviewed Mar 16, 2026

View reviewed changes

[feat] fix format

fc0a481

Signed-off-by: 01267596 <xiongkai123@cmbchina.com>

HF-001 force-pushed the ngram_aync_dev branch from b474b94 to fc0a481 Compare March 16, 2026 09:29

HF-001 and others added 2 commits March 19, 2026 09:24

Merge branch 'main' into ngram_aync_dev

ec52d9c

Merge branch 'main' into ngram_aync_dev

5b242fb

weijinqian0 approved these changes Mar 19, 2026

View reviewed changes

HF-001 added 3 commits March 23, 2026 09:25

Merge branch 'main' into ngram_aync_dev

8430721

Merge branch 'main' into ngram_aync_dev

897a4ed

Merge branch 'main' into ngram_aync_dev

41f7993

github-actions bot added the merge-conflicts label Mar 25, 2026

01267596 and others added 2 commits March 26, 2026 01:20

add ci test

df46f31

Signed-off-by: 01267596 <xiongkai123@cmbchina.com>

Merge branch 'main' into ngram_aync_dev

5be5f27

Signed-off-by: kx <1670186653@qq.com>

github-actions bot removed the merge-conflicts label Mar 26, 2026

HF-001 mentioned this pull request Apr 16, 2026

[Feature] ngram npu implementtation compatilble with aync scheduler #8337

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] ngram npu implementtation compatilble with aync scheduler#7311

[feat] ngram npu implementtation compatilble with aync scheduler#7311
HF-001 wants to merge 9 commits intovllm-project:mainfrom
HF-001:ngram_aync_dev

HF-001 commented Mar 16, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Mar 16, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

HF-001 commented Mar 16, 2026

Uh oh!

github-actions bot commented Mar 16, 2026

Uh oh!

HF-001 commented Mar 24, 2026

Uh oh!

github-actions bot commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

HF-001 commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

How was this patch tested?

Uh oh!

gemini-code-assist bot commented Mar 16, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

vllm_ascend/worker/model_runner_v1.py (992-1003)

vllm_ascend/spec_decode/ngram_proposer_npu.py (14-27)

vllm_ascend/worker/model_runner_v1.py (1010-1015)

vllm_ascend/worker/model_runner_v1.py (1021-1027)

Uh oh!

HF-001 commented Mar 16, 2026

Uh oh!

github-actions bot commented Mar 16, 2026

Uh oh!

HF-001 commented Mar 24, 2026

Uh oh!

github-actions bot commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HF-001 commented Mar 16, 2026 •

edited

Loading