Skip to content

[feat] ngram npu implementtation compatilble with aync scheduler#7311

Open
HF-001 wants to merge 9 commits intovllm-project:mainfrom
HF-001:ngram_aync_dev
Open

[feat] ngram npu implementtation compatilble with aync scheduler#7311
HF-001 wants to merge 9 commits intovllm-project:mainfrom
HF-001:ngram_aync_dev

Conversation

@HF-001
Copy link
Copy Markdown
Contributor

@HF-001 HF-001 commented Mar 16, 2026

What this PR does / why we need it?

refer to vllm-project/vllm#29184 , implement NPU version of ngram speculative decoding and make it compatible with Async Scheduler.

How was this patch tested?

ngram_gpu + async_scheduling script:

vllm serve /model/Qwen3-1.7B --port 8898 --dtype bfloat16
--tensor-parallel-size 1 --gpu-memory-utilization 0.8
--max-model-len 32768 --trust-remote-code
--no-enable-prefix-caching
--async-scheduling
--speculative_config '{"method": "ngram_gpu", "num_speculative_tokens": 3, "prompt_lookup_max": 2,"prompt_lookup_min": 2}'

ngram script:

vllm serve /model/Qwen3-1.7B --port 8898 --dtype bfloat16
--tensor-parallel-size 1 --gpu-memory-utilization 0.8
--max-model-len 32768 --trust-remote-code
--no-enable-prefix-caching
--speculative_config '{"method": "ngram", "num_speculative_tokens": 3, "prompt_lookup_max": 2,"prompt_lookup_min": 2}'

test script:

vllm bench serve
--port 8898
--backend vllm
--model /model/Qwen3-1.7B
--endpoint /v1/completions
--dataset-name sonnet
--dataset-path /model/vllm/benchmarks/sonnet.txt
--request-rate 2.0
--sonnet-input-len 128
--sonnet-output-len 100
--sonnet-prefix-len 10
--num-prompts 40
--ignore-eos
--percentile-metrics "ttft,tpot,itl,e2el

test results:
on qwen3-1.7B model, TTFT -67.33%(269.53 ms -> 88.05 ms)
on qwen3-14B model, TTFT -48.24%(284.37ms -> 147.25ms)

Signed-off-by: 01267596 <xiongkai123@cmbchina.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates an NPU-optimized version of Ngram speculative decoding into the system, significantly enhancing performance metrics like TTFT and TPOT. The changes ensure seamless operation with the asynchronous scheduler by introducing a dedicated NPU proposer and refining data handling mechanisms, thereby improving the overall efficiency of speculative decoding on NPU hardware.

Highlights

  • NPU-compatible Ngram Speculative Decoding: Introduced a new AscendNgramProposerNPU class to enable Ngram speculative decoding specifically optimized for NPU (Neural Processing Unit) environments.
  • Async Scheduler Compatibility: Ensured the new NPU-based Ngram speculative decoding method is fully compatible with the asynchronous scheduler, allowing for efficient concurrent request processing.
  • Performance Improvements: Achieved significant performance gains, with a 67.33% reduction in Time To First Token (TTFT) and a 22.3% increase in Throughput Per Output Token (TPOT) during testing.
  • Optimized Data Handling for NPU: Refactored num_tokens_no_spec in NPUInputBatch to use a pinned CPU tensor, improving data transfer efficiency for NPU operations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • vllm_ascend/spec_decode/init.py
    • Imported AscendNgramProposerNPU.
    • Registered ngram_gpu as a new speculative decoding method, returning an instance of AscendNgramProposerNPU.
  • vllm_ascend/spec_decode/ngram_proposer_npu.py
    • Added a new file defining AscendNgramProposerNPU.
    • Implemented AscendNgramProposerNPU inheriting from NgramProposerGPU.
    • Overrode load_model and dummy_run methods with no-op implementations.
    • Overrode the propose method to call the superclass's propose method.
  • vllm_ascend/worker/model_runner_v1.py
    • Imported replace from dataclasses for efficient object copying.
    • Imported copy_num_valid_draft_tokens from vllm.v1.spec_decode.ngram_proposer_gpu.
    • Imported AscendNgramProposerNPU.
    • Updated the type hint for self.drafter to include AscendNgramProposerNPU.
    • Added a new conditional block in propose_draft_token_ids to handle AscendNgramProposerNPU, including updating token IDs, proposing drafts, caching valid draft counts, and performing asynchronous D2H copy.
    • Introduced logic in execute_model to create a copy of scheduler_output using replace when ngram_gpu is used, preventing modification side effects with the async scheduler.
    • Modified the use_padded_batch condition to include self.speculative_config.use_ngram_gpu() for determining when to use GPU-sampled tokens as inputs for speculative decoding.
  • vllm_ascend/worker/npu_input_batch.py
    • Modified self.num_tokens_no_spec to be a NumPy array backed by a torch.Tensor with pin_memory=True, optimizing data transfer for NPU operations.
Activity
  • The author, HF-001, implemented the NPU version of ngram speculative decoding, ensuring compatibility with the Async Scheduler.
  • The changes were tested using both ngram_gpu + async_scheduling and ngram scripts with vllm serve and vllm bench serve.
  • Test results showed significant performance improvements: TTFT decreased by 67.33% (from 269.53 ms to 88.05 ms) and TPOT increased by 22.3% (from 18.86 ms to 23.07 ms).
  • The author plans to further optimize the NgramGPUKernel logic in the future to enhance NPU performance.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces NPU compatibility for the ngram speculative decoding method, integrating AscendNgramProposerNPU into the system. The changes involve adding the new NPU proposer, updating type hints, and modifying the propose_draft_token_ids logic in model_runner_v1.py to handle the NPU-specific implementation. Additionally, a copy mechanism for scheduler_output is introduced when ngram_gpu is used with async scheduling, and the num_tokens_no_spec initialization in npu_input_batch.py is adjusted for better NPU compatibility. While the overall approach is clear, there are several critical areas that require attention to ensure full NPU compatibility and correct functionality.

I am having trouble creating individual review comments. Click here to see my feedback.

vllm_ascend/worker/model_runner_v1.py (992-1003)

critical

The update_token_ids_ngram method is called on self.drafter (which is AscendNgramProposerNPU). However, AscendNgramProposerNPU does not define this method, meaning it will fall back to NgramProposerGPU.update_token_ids_ngram. If NgramProposerGPU.update_token_ids_ngram contains GPU-specific kernels or operations (e.g., CUDA-specific tensor operations, stream management) that are not NPU-compatible, this will lead to runtime failures. Please verify the compatibility of this method with NPU or provide an NPU-specific implementation in AscendNgramProposerNPU.

vllm_ascend/spec_decode/ngram_proposer_npu.py (14-27)

high

The dummy_run method in AscendNgramProposerNPU is currently a no-op. If the parent class NgramProposerGPU's dummy_run contains essential logic for setup, graph capturing, or profiling that is expected by the NPU implementation, overriding it with pass could lead to runtime errors or incorrect behavior. Please ensure that the NPU implementation does not require any specific initialization or dummy computation during this phase, or implement the necessary NPU-specific logic here.

vllm_ascend/worker/model_runner_v1.py (1010-1015)

high

The propose method in AscendNgramProposerNPU directly calls super().propose, which means NgramProposerGPU.propose will be executed. Similar to update_token_ids_ngram, if NgramProposerGPU.propose contains GPU-specific logic or kernel calls that are not NPU-compatible, this will cause issues. Please ensure that the underlying NgramProposerGPU.propose method is fully compatible with NPU operations or provide an NPU-specific override.

vllm_ascend/worker/model_runner_v1.py (1021-1027)

high

The copy_num_valid_draft_tokens function is imported from vllm.v1.spec_decode.ngram_proposer_gpu and performs an "Async D2H copy on a dedicated stream." While _torch_cuda_wrapper attempts to patch torch.cuda calls to torch.npu, it is crucial to verify that this specific function correctly uses NPU streams/events or is otherwise NPU-compatible. If it directly uses torch.cuda without going through the patched aliases, it will fail. Please confirm its NPU compatibility.

Signed-off-by: 01267596 <xiongkai123@cmbchina.com>
@HF-001
Copy link
Copy Markdown
Contributor Author

HF-001 commented Mar 16, 2026

@wangxiyuan hi, vllm version of the ci is too old, which cause ci error:
Error: vllm_ascend/spec_decode/ngram_proposer_npu.py:2: error: Cannot find implementation or library stub for module named "vllm.v1.spec_decode.ngram_proposer_gpu"

@github-actions
Copy link
Copy Markdown
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@HF-001
Copy link
Copy Markdown
Contributor Author

HF-001 commented Mar 24, 2026

@wangxiyuan @weijinqian0 hi, this pr is ready,Could you help review and merge it ?

@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

01267596 and others added 2 commits March 26, 2026 01:20
Signed-off-by: 01267596 <xiongkai123@cmbchina.com>
Signed-off-by: kx <1670186653@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants