[feat] ngram npu implementtation compatilble with aync scheduler#7311
[feat] ngram npu implementtation compatilble with aync scheduler#7311HF-001 wants to merge 9 commits intovllm-project:mainfrom
Conversation
Signed-off-by: 01267596 <xiongkai123@cmbchina.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request integrates an NPU-optimized version of Ngram speculative decoding into the system, significantly enhancing performance metrics like TTFT and TPOT. The changes ensure seamless operation with the asynchronous scheduler by introducing a dedicated NPU proposer and refining data handling mechanisms, thereby improving the overall efficiency of speculative decoding on NPU hardware. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces NPU compatibility for the ngram speculative decoding method, integrating AscendNgramProposerNPU into the system. The changes involve adding the new NPU proposer, updating type hints, and modifying the propose_draft_token_ids logic in model_runner_v1.py to handle the NPU-specific implementation. Additionally, a copy mechanism for scheduler_output is introduced when ngram_gpu is used with async scheduling, and the num_tokens_no_spec initialization in npu_input_batch.py is adjusted for better NPU compatibility. While the overall approach is clear, there are several critical areas that require attention to ensure full NPU compatibility and correct functionality.
I am having trouble creating individual review comments. Click here to see my feedback.
vllm_ascend/worker/model_runner_v1.py (992-1003)
The update_token_ids_ngram method is called on self.drafter (which is AscendNgramProposerNPU). However, AscendNgramProposerNPU does not define this method, meaning it will fall back to NgramProposerGPU.update_token_ids_ngram. If NgramProposerGPU.update_token_ids_ngram contains GPU-specific kernels or operations (e.g., CUDA-specific tensor operations, stream management) that are not NPU-compatible, this will lead to runtime failures. Please verify the compatibility of this method with NPU or provide an NPU-specific implementation in AscendNgramProposerNPU.
vllm_ascend/spec_decode/ngram_proposer_npu.py (14-27)
The dummy_run method in AscendNgramProposerNPU is currently a no-op. If the parent class NgramProposerGPU's dummy_run contains essential logic for setup, graph capturing, or profiling that is expected by the NPU implementation, overriding it with pass could lead to runtime errors or incorrect behavior. Please ensure that the NPU implementation does not require any specific initialization or dummy computation during this phase, or implement the necessary NPU-specific logic here.
vllm_ascend/worker/model_runner_v1.py (1010-1015)
The propose method in AscendNgramProposerNPU directly calls super().propose, which means NgramProposerGPU.propose will be executed. Similar to update_token_ids_ngram, if NgramProposerGPU.propose contains GPU-specific logic or kernel calls that are not NPU-compatible, this will cause issues. Please ensure that the underlying NgramProposerGPU.propose method is fully compatible with NPU operations or provide an NPU-specific override.
vllm_ascend/worker/model_runner_v1.py (1021-1027)
The copy_num_valid_draft_tokens function is imported from vllm.v1.spec_decode.ngram_proposer_gpu and performs an "Async D2H copy on a dedicated stream." While _torch_cuda_wrapper attempts to patch torch.cuda calls to torch.npu, it is crucial to verify that this specific function correctly uses NPU streams/events or is otherwise NPU-compatible. If it directly uses torch.cuda without going through the patched aliases, it will fail. Please confirm its NPU compatibility.
Signed-off-by: 01267596 <xiongkai123@cmbchina.com>
|
@wangxiyuan hi, vllm version of the ci is too old, which cause ci error: |
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
|
@wangxiyuan @weijinqian0 hi, this pr is ready,Could you help review and merge it ? |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: 01267596 <xiongkai123@cmbchina.com>
Signed-off-by: kx <1670186653@qq.com>
What this PR does / why we need it?
refer to vllm-project/vllm#29184 , implement NPU version of ngram speculative decoding and make it compatible with Async Scheduler.
How was this patch tested?
ngram_gpu + async_scheduling script:
vllm serve /model/Qwen3-1.7B --port 8898 --dtype bfloat16
--tensor-parallel-size 1 --gpu-memory-utilization 0.8
--max-model-len 32768 --trust-remote-code
--no-enable-prefix-caching
--async-scheduling
--speculative_config '{"method": "ngram_gpu", "num_speculative_tokens": 3, "prompt_lookup_max": 2,"prompt_lookup_min": 2}'
ngram script:
vllm serve /model/Qwen3-1.7B --port 8898 --dtype bfloat16
--tensor-parallel-size 1 --gpu-memory-utilization 0.8
--max-model-len 32768 --trust-remote-code
--no-enable-prefix-caching
--speculative_config '{"method": "ngram", "num_speculative_tokens": 3, "prompt_lookup_max": 2,"prompt_lookup_min": 2}'
test script:
vllm bench serve
--port 8898
--backend vllm
--model /model/Qwen3-1.7B
--endpoint /v1/completions
--dataset-name sonnet
--dataset-path /model/vllm/benchmarks/sonnet.txt
--request-rate 2.0
--sonnet-input-len 128
--sonnet-output-len 100
--sonnet-prefix-len 10
--num-prompts 40
--ignore-eos
--percentile-metrics "ttft,tpot,itl,e2el
test results:
on qwen3-1.7B model, TTFT -67.33%(269.53 ms -> 88.05 ms)
on qwen3-14B model, TTFT -48.24%(284.37ms -> 147.25ms)