Skip to content

Add EAGLE-3 Speculative Decoding Support for Qwen3 MoE#26485

Merged
DarkLight1337 merged 4 commits intovllm-project:mainfrom
neuralmagic:feature/support-qwen3-moe-eagle3
Oct 11, 2025
Merged

Add EAGLE-3 Speculative Decoding Support for Qwen3 MoE#26485
DarkLight1337 merged 4 commits intovllm-project:mainfrom
neuralmagic:feature/support-qwen3-moe-eagle3

Conversation

@rahul-tuli
Copy link
Contributor

@rahul-tuli rahul-tuli commented Oct 9, 2025

This PR adds support for EAGLE-3 speculative decoding to the Qwen3MoeForCausalLM model, enabling faster inference with draft models like nm-testing/Mockup-qwen235-eagle3-fp16.

Changes

Modified Files

  • vllm/model_executor/models/qwen3_moe.py

Implementation Details

  1. Added SupportsEagle3 Interface

    • Imported and added SupportsEagle3 to Qwen3MoeForCausalLM class inheritance
    • Implements required methods: set_aux_hidden_state_layers() and get_eagle3_aux_hidden_state_layers()
  2. Updated Qwen3MoeModel

    • Added aux_hidden_state_layers attribute to track layers that output auxiliary hidden states
    • Modified forward() method to collect auxiliary hidden states at specified layers
    • Returns tuple of (hidden_states, aux_hidden_states) when auxiliary states are collected
  3. Updated Qwen3MoeForCausalLM

    • Implements get_eagle3_aux_hidden_state_layers() to return auxiliary layer indices (2, mid-layer, and n-3)
    • Implements set_aux_hidden_state_layers() to configure which layers output auxiliary states

Testing

Tested with Qwen3-235B-A22B MoE model and EAGLE-3 drafter:

from vllm import LLM, SamplingParams

# Initialize with EAGLE-3 speculative decoding
llm = LLM(
    model="Qwen/Qwen3-235B-A22B-Instruct-2507-FP8",
    tensor_parallel_size=4,
    speculative_config={
        "model": "nm-testing/Mockup-qwen235-eagle3-fp16",
        "method": "eagle3",
        "num_speculative_tokens": 3,
    },
    max_model_len=16384,
)

# Generate with speculative decoding
prompts = [
    "Hello, my name is",
    "The capital of France is",
]

sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=100)
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(f"Prompt: {output.prompt}")
    print(f"Generated: {output.outputs[0].text}")

Related

This implementation follows the same pattern as existing EAGLE-3 support in:

  • Qwen2ForCausalLM
  • Qwen3ForCausalLM
  • LlamaForCausalLM

@mergify mergify bot added the qwen Related to Qwen models label Oct 9, 2025
@rahul-tuli rahul-tuli marked this pull request as ready for review October 9, 2025 13:30
@rahul-tuli rahul-tuli requested a review from sighingnow as a code owner October 9, 2025 13:30
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

@DarkLight1337 DarkLight1337 requested a review from 22quinn October 9, 2025 13:36
@mgoin mgoin added speculative-decoding ready ONLY add when PR is ready to merge/full CI is needed labels Oct 9, 2025
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
@rahul-tuli rahul-tuli force-pushed the feature/support-qwen3-moe-eagle3 branch from 0d94c76 to 05a6bb2 Compare October 10, 2025 10:50
Copy link
Collaborator

@benchislett benchislett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) October 11, 2025 09:31
@DarkLight1337 DarkLight1337 merged commit d2a7153 into vllm-project:main Oct 11, 2025
54 checks passed
Dhruvilbhatt pushed a commit to Dhruvilbhatt/vllm that referenced this pull request Oct 14, 2025
…26485)

Signed-off-by: Rahul Tuli <rtuli@redhat.com>
Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>
bbartels pushed a commit to bbartels/vllm that referenced this pull request Oct 16, 2025
…26485)

Signed-off-by: Rahul Tuli <rtuli@redhat.com>
Signed-off-by: bbartels <benjamin@bartels.dev>
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
…26485)

Signed-off-by: Rahul Tuli <rtuli@redhat.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
…26485)

Signed-off-by: Rahul Tuli <rtuli@redhat.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
@zeroorhero
Copy link

@rahul-tuli Hi, did the Speculative Decoding actually make Qwen3 MoE faster? What kind of acceptance rate did you see on average?

rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025
@gyou2021
Copy link

Is nm-testing/Mockup-qwen235-eagle3-fp16 nm-testing/Mockup-qwen235-eagle3-fp16-speculators-converted? No nm-testing/Mockup-qwen235-eagle3-fp16 in https://huggingface.co/nm-testing currently.

@eldarkurtic
Copy link
Contributor

As the name suggests, nm-testing models are temp models for testing purposes. We will soon release an official qwen235 speculator under RedHatAI which community should use.

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed speculative-decoding

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants