[main][refactor] Align spec_decode with vllm by drslark · Pull Request #6913 · vllm-project/vllm-ascend

drslark · 2026-03-02T07:32:27Z

What this PR does / why we need it?

According to #6881, this pr follows vllm-project/vllm#32887.

In previous vllm-ascend codes,

    +-----------------------------------+
    |    EagleProposer (vllm)           |  <---  (Base Class)
    +-----------------------------------+
                ^
                | 
                |
    +-----------------------------------+
    |    EagleProposer (vllm-ascend)    |  <--- (Sub class)
    +-----------------------------------+

And we want to change the structure to follow vllm:

          +---------------------------------------+
          |     SpecDecodeBaseProposer (vllm)     |
          +---------------------------------------+
                            ^
                            |
          +----------------------------------------------+
          |     SpecDecodeBaseProposer (vllm-ascend)     | 
          +----------------------------------------------+
             ^                                         ^
             |                                         |
      +------+                                         +------+
      |                                                       |
+--------------------------------+                     +-------------------------------------+
|  EagleProposer (vllm-ascend)   |                     |  DraftModelProposer (vllm-ascend)   |
+--------------------------------+                     +-------------------------------------+

But unfortunately, there are a lot of isinstance(proposer, EagleProposer) in vllm code and we don't want to patch these codes.
Also, we don't want to copy and paste most of EagleProposer and DraftModelProposer codes.

Based on the above requirements, we decide to do as below:

Following the inheritance structure of vllm.
Patches EagleProposer and DraftModelProposer to be ABCMeta (no logic will be modified!), and uses python's virtual subclass mechanism.
Clone EagleProposer and DraftModelProposer without any redundant codes, but with vllm-ascend's own base class SpecDecodeBaseProposer.

Does this PR introduce any user-facing change?

N/A

How was this patch tested?

vLLM version: v0.16.0
vLLM main: vllm-project/vllm@15d76f7

gemini-code-assist · 2026-03-02T07:32:49Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on refactoring the speculative decoding mechanism to enhance its alignment and compatibility with the main vLLM codebase. It introduces a more structured approach to class inheritance and method organization, particularly for the proposer components. Additionally, it improves the handling of rejected tokens within the Triton kernel, which is crucial for the efficiency and correctness of speculative decoding. The changes also include dynamic class manipulation utilities to facilitate seamless integration and patching of upstream classes.

Highlights

Speculative Decoding Refactoring: The core speculative decoding logic has been refactored to align with the upstream vLLM project, primarily by renaming EagleProposer to SpecDecodeBaseProposer and encapsulating input preparation into a new set_inputs_first_pass method.
Dynamic Class Manipulation Utilities: New utility functions (add_abc_meta, make_cls, fix_new_class_closures) were introduced in vllm_ascend/utils.py to dynamically modify class structures, enabling better compatibility and inheritance with upstream vLLM classes.
Rejected Token Handling in Triton Kernel: The Triton kernel for prepare_inputs_padded and its corresponding test have been updated to explicitly handle and store num_rejected_tokens_gpu, improving the accuracy of speculative decoding.
Conditional Hidden State Passing: The _run_merged_draft method now conditionally passes hidden_states to the model based on a new pass_hidden_states_to_model flag, offering more flexible model invocation.
Lazy Import for Model Runner: The NPUModelRunner import in vllm_ascend/worker/worker.py was changed to a lazy import to ensure that patches are effectively applied.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

tests/e2e/nightly/single_node/ops/singlecard_ops/triton/test_prepare_inputs_padded.py
- Updated test to include num_rejected_tokens_tri in Triton kernel call.
vllm_ascend/ops/triton/spec_decode/utils.py
- Modified prepare_inputs_padded_kernel to accept and store num_rejected_tokens_gpu.
vllm_ascend/patch/worker/init.py
- Imported the new patch_spec_decode module.
vllm_ascend/patch/worker/patch_spec_decode.py
- Added a new file to dynamically patch vllm_eagle.EaglePorposer with ABCMeta and fix class closures.
vllm_ascend/spec_decode/draft_model.py
- Added a new file to create and register a DraftModelProposer by reusing VllmDraftModelProposer functions.
vllm_ascend/spec_decode/eagle_proposer.py
- Refactored EagleProposer into SpecDecodeBaseProposer.
- Extracted input preparation logic into a new set_inputs_first_pass method.
- Updated model invocation logic in _run_merged_draft to conditionally pass hidden_states.
- Dynamically created and registered EagleProposer using make_cls and fix_new_class_closures.
vllm_ascend/utils.py
- Added utility functions add_abc_meta, make_cls, and fix_new_class_closures for dynamic class manipulation.
vllm_ascend/worker/model_runner_v1.py
- Modified propose_draft_token_ids to pass num_rejected_tokens_gpu to the drafter.
vllm_ascend/worker/worker.py
- Changed NPUModelRunner import to be lazy for patch effectiveness.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request refactors the speculative decoding implementation to align it with the upstream vLLM library. The changes are quite extensive, introducing a new SpecDecodeBaseProposer and using dynamic class creation to reuse upstream logic while injecting custom behavior. This alignment also involves adding support for returning the number of rejected tokens from the prepare_inputs_padded kernel and plumbing this information through the model runner.

My review identified a critical bug in vllm_ascend/spec_decode/eagle_proposer.py that would cause an UnboundLocalError under certain conditions. The rest of the changes appear to be consistent with the refactoring goal. Please address the identified issue.

github-actions · 2026-03-02T07:36:06Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Signed-off-by: drslark <slarksblood@qq.com>

github-actions · 2026-03-03T12:12:52Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

drslark requested review from MengqingCao, realliujiaxu, wangxiyuan, whx-sjtu and zzzzwwjj as code owners March 2, 2026 07:32

gemini-code-assist bot reviewed Mar 2, 2026

View reviewed changes

Comment thread vllm_ascend/spec_decode/eagle_proposer.py

github-actions bot added module:tests module:ops module:core labels Mar 2, 2026

drslark force-pushed the main branch 10 times, most recently from 4cbba31 to a7e97ee Compare March 3, 2026 02:45

[main][refactor] Align spec_decode with vllm

87fd7f9

Signed-off-by: drslark <slarksblood@qq.com>

drslark force-pushed the main branch from a7e97ee to 87fd7f9 Compare March 3, 2026 02:54

drslark marked this pull request as draft March 3, 2026 03:42

github-actions bot added the merge-conflicts label Mar 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[main][refactor] Align spec_decode with vllm#6913

[main][refactor] Align spec_decode with vllm#6913
drslark wants to merge 1 commit intovllm-project:mainfrom
drslark:main

drslark commented Mar 2, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Mar 2, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

github-actions bot commented Mar 2, 2026

Uh oh!

github-actions bot commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drslark commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot commented Mar 2, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

github-actions bot commented Mar 2, 2026

Uh oh!

github-actions bot commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

drslark commented Mar 2, 2026 •

edited

Loading