Skip to content

[main][refactor] Align spec_decode with vllm#6913

Draft
drslark wants to merge 1 commit intovllm-project:mainfrom
drslark:main
Draft

[main][refactor] Align spec_decode with vllm#6913
drslark wants to merge 1 commit intovllm-project:mainfrom
drslark:main

Conversation

@drslark
Copy link
Copy Markdown
Contributor

@drslark drslark commented Mar 2, 2026

What this PR does / why we need it?

According to #6881, this pr follows vllm-project/vllm#32887.

  1. In previous vllm-ascend codes,
    +-----------------------------------+
    |    EagleProposer (vllm)           |  <---  (Base Class)
    +-----------------------------------+
                ^
                | 
                |
    +-----------------------------------+
    |    EagleProposer (vllm-ascend)    |  <--- (Sub class)
    +-----------------------------------+

And we want to change the structure to follow vllm:

          +---------------------------------------+
          |     SpecDecodeBaseProposer (vllm)     |
          +---------------------------------------+
                            ^
                            |
          +----------------------------------------------+
          |     SpecDecodeBaseProposer (vllm-ascend)     | 
          +----------------------------------------------+
             ^                                         ^
             |                                         |
      +------+                                         +------+
      |                                                       |
+--------------------------------+                     +-------------------------------------+
|  EagleProposer (vllm-ascend)   |                     |  DraftModelProposer (vllm-ascend)   |
+--------------------------------+                     +-------------------------------------+
  1. But unfortunately, there are a lot of isinstance(proposer, EagleProposer) in vllm code and we don't want to patch these codes.

  2. Also, we don't want to copy and paste most of EagleProposer and DraftModelProposer codes.

Based on the above requirements, we decide to do as below:

  1. Following the inheritance structure of vllm.
  2. Patches EagleProposer and DraftModelProposer to be ABCMeta (no logic will be modified!), and uses python's virtual subclass mechanism.
  3. Clone EagleProposer and DraftModelProposer without any redundant codes, but with vllm-ascend's own base class SpecDecodeBaseProposer.

Does this PR introduce any user-facing change?

N/A

How was this patch tested?

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on refactoring the speculative decoding mechanism to enhance its alignment and compatibility with the main vLLM codebase. It introduces a more structured approach to class inheritance and method organization, particularly for the proposer components. Additionally, it improves the handling of rejected tokens within the Triton kernel, which is crucial for the efficiency and correctness of speculative decoding. The changes also include dynamic class manipulation utilities to facilitate seamless integration and patching of upstream classes.

Highlights

  • Speculative Decoding Refactoring: The core speculative decoding logic has been refactored to align with the upstream vLLM project, primarily by renaming EagleProposer to SpecDecodeBaseProposer and encapsulating input preparation into a new set_inputs_first_pass method.
  • Dynamic Class Manipulation Utilities: New utility functions (add_abc_meta, make_cls, fix_new_class_closures) were introduced in vllm_ascend/utils.py to dynamically modify class structures, enabling better compatibility and inheritance with upstream vLLM classes.
  • Rejected Token Handling in Triton Kernel: The Triton kernel for prepare_inputs_padded and its corresponding test have been updated to explicitly handle and store num_rejected_tokens_gpu, improving the accuracy of speculative decoding.
  • Conditional Hidden State Passing: The _run_merged_draft method now conditionally passes hidden_states to the model based on a new pass_hidden_states_to_model flag, offering more flexible model invocation.
  • Lazy Import for Model Runner: The NPUModelRunner import in vllm_ascend/worker/worker.py was changed to a lazy import to ensure that patches are effectively applied.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • tests/e2e/nightly/single_node/ops/singlecard_ops/triton/test_prepare_inputs_padded.py
    • Updated test to include num_rejected_tokens_tri in Triton kernel call.
  • vllm_ascend/ops/triton/spec_decode/utils.py
    • Modified prepare_inputs_padded_kernel to accept and store num_rejected_tokens_gpu.
  • vllm_ascend/patch/worker/init.py
    • Imported the new patch_spec_decode module.
  • vllm_ascend/patch/worker/patch_spec_decode.py
    • Added a new file to dynamically patch vllm_eagle.EaglePorposer with ABCMeta and fix class closures.
  • vllm_ascend/spec_decode/draft_model.py
    • Added a new file to create and register a DraftModelProposer by reusing VllmDraftModelProposer functions.
  • vllm_ascend/spec_decode/eagle_proposer.py
    • Refactored EagleProposer into SpecDecodeBaseProposer.
    • Extracted input preparation logic into a new set_inputs_first_pass method.
    • Updated model invocation logic in _run_merged_draft to conditionally pass hidden_states.
    • Dynamically created and registered EagleProposer using make_cls and fix_new_class_closures.
  • vllm_ascend/utils.py
    • Added utility functions add_abc_meta, make_cls, and fix_new_class_closures for dynamic class manipulation.
  • vllm_ascend/worker/model_runner_v1.py
    • Modified propose_draft_token_ids to pass num_rejected_tokens_gpu to the drafter.
  • vllm_ascend/worker/worker.py
    • Changed NPUModelRunner import to be lazy for patch effectiveness.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the speculative decoding implementation to align it with the upstream vLLM library. The changes are quite extensive, introducing a new SpecDecodeBaseProposer and using dynamic class creation to reuse upstream logic while injecting custom behavior. This alignment also involves adding support for returning the number of rejected tokens from the prepare_inputs_padded kernel and plumbing this information through the model runner.

My review identified a critical bug in vllm_ascend/spec_decode/eagle_proposer.py that would cause an UnboundLocalError under certain conditions. The rest of the changes appear to be consistent with the refactoring goal. Please address the identified issue.

Comment thread vllm_ascend/spec_decode/eagle_proposer.py
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 2, 2026

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@drslark drslark force-pushed the main branch 10 times, most recently from 4cbba31 to a7e97ee Compare March 3, 2026 02:45
Signed-off-by: drslark <slarksblood@qq.com>
@drslark drslark marked this pull request as draft March 3, 2026 03:42
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 3, 2026

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant