[Bugfix][Model] Fix Eagle3 speculative decoding for Qwen3Next-based models by NikitosKh · Pull Request #36527 · vllm-project/vllm

NikitosKh · 2026-03-09T18:04:37Z

Purpose

While training an Eagle3 draft model for Qwen3.5-9B (weights on HF), I discovered that Eagle3 speculative decoding is silently broken for all Qwen3Next-based models (Qwen3.5 and future models using this base).

The root cause is straightforward: Qwen3NextModel.forward() never captures auxiliary hidden states. The SupportsEagle3 protocol is declared on Qwen3_5ForConditionalGeneration (inherited from Qwen3VLForConditionalGeneration), so vLLM happily calls set_aux_hidden_state_layers() — but the inner model's forward() just ignores the setting. No error is raised; Eagle3 simply doesn't work.

Compare with Qwen2Model.forward() which correctly implements enumerate(islice(...)) with aux hidden state capture. Qwen3NextModel was missing the same logic.

Changes

qwen3_next.py (the actual fix):

Initialize self.aux_hidden_state_layers in Qwen3NextModel.__init__
Update Qwen3NextModel.forward() to capture hidden_states + residual at specified layer indices, mirroring the Qwen2Model pattern
Add SupportsEagle3 + set_aux_hidden_state_layers + get_eagle3_aux_hidden_state_layers to Qwen3NextForCausalLM

qwen3_5.py (propagate to the Qwen3.5-specific class):

Add SupportsEagle3 to Qwen3_5ForCausalLMBase with the same Eagle3 methods

tests/v1/spec_decode/test_qwen3next_eagle3_support.py (new):

9 tests covering protocol compliance, forward() behavior, and consistency with the Qwen2Model reference implementation — all fail before the fix, all pass after

Test Plan

# Unit tests (no GPU needed)
pytest tests/v1/spec_decode/test_qwen3next_eagle3_support.py -v

# E2E with trained Eagle3 draft model
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3.5-9B \
  --speculative-config '{"method": "eagle3", "model": "BLR2/Qwen3.5-9B-Eagle3-ShareGPT", "num_speculative_tokens": 3}' \
  --trust-remote-code --port 8199 --max-model-len 4096

Test Result

9/9 unit tests pass. E2E server starts, loads the draft model, and serves requests with speculative decoding active.
The modest accept length is due to the draft model being trained on a small dataset.
Benchmark on a single H100 (Qwen3.5-9B + our Eagle3 draft):

Domain	Baseline tok/s	Eagle3 tok/s	Speedup
code	145.1	260.3	1.79x
math/reasoning	145.0	265.3	1.83x
QA	145.1	207.6	1.43x
summarization	144.9	226.5	1.56x
translation	145.1	200.3	1.38x
overall	145.1	232.2	1.60x

…odels Qwen3NextModel.forward() was missing auxiliary hidden state capture logic, making Eagle3 non-functional for all Qwen3Next-based models (including Qwen3.5). The SupportsEagle3 protocol was declared on Qwen3_5ForConditionalGeneration via inheritance, but the inner model forward() never captured the states — silently breaking Eagle3. Fix: - Add aux_hidden_state_layers init and capture logic to Qwen3NextModel.forward(), mirroring the Qwen2Model pattern - Add SupportsEagle3 to Qwen3NextForCausalLM and Qwen3_5ForCausalLMBase with set/get methods - Add 9 tests verifying protocol compliance, forward() behavior, and consistency with the Qwen2 reference implementation Tested with Qwen3.5-9B + Eagle3 draft model on vLLM, achieving 2.1x speedup over autoregressive baseline. Signed-off-by: NikitosKh <nikitak.khomich.work@gmail.com>

github-actions · 2026-03-09T18:04:48Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

This pull request aims to fix Eagle3 speculative decoding for Qwen3Next-based models by implementing auxiliary hidden state capturing in Qwen3NextModel. However, it introduces a critical bug in the initialization of auxiliary hidden state layers, specifically an invalid call to a type alias (tuple[int, ...]()). This will cause a runtime TypeError during the model's forward pass when Eagle3 speculative decoding is not enabled, leading to a denial of service for standard use cases. Additionally, the layer indexing logic in the forward pass is incompatible with Pipeline Parallelism, which may cause crashes or incorrect behavior in distributed configurations. The changes correctly follow the pattern from Qwen2Model but these two critical issues prevent the feature from working as intended.

gemini-code-assist · 2026-03-09T18:12:56Z

        else:
            self.norm = PPMissingLayer()

+        self.aux_hidden_state_layers = tuple[int, ...]()


The initialization of self.aux_hidden_state_layers using tuple[int, ...]() is invalid Python syntax and will cause a TypeError at runtime. This occurs in the __init__ method of Qwen3NextModel. If Eagle3 speculative decoding is not enabled, the forward pass will crash when attempting to perform an in check on a GenericAlias object, leading to a complete denial of service for Qwen3Next-based models without Eagle3 enabled. To initialize an empty tuple, use the literal ().

Suggested change

self.aux_hidden_state_layers = tuple[int, ...]()

self.aux_hidden_state_layers: tuple[int, ...] = ()

gemini-code-assist · 2026-03-09T18:12:57Z

+        for idx, layer in enumerate(
+            islice(self.layers, self.start_layer, self.end_layer)
+        ):
+            if idx in self.aux_hidden_state_layers:


The current layer iteration logic using enumerate(islice(...)) causes idx to be relative to the current shard's start layer. Since self.aux_hidden_state_layers stores absolute layer indices, this check will fail for shards where start_layer > 0. This incompatibility with Pipeline Parallelism means no auxiliary hidden states will be captured on those shards, likely causing the Eagle3 engine to crash or produce incorrect results. To correctly support Pipeline Parallelism, the loop structure needs to be adjusted to ensure idx represents the absolute layer index.

Suggested change

for idx, layer in enumerate(

islice(self.layers, self.start_layer, self.end_layer)

):

if idx in self.aux_hidden_state_layers:

for idx, layer in islice(

enumerate(self.layers), self.start_layer, self.end_layer

):

if idx in self.aux_hidden_state_layers:

mergify · 2026-03-11T11:11:09Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NikitosKh.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

NikitosKh · 2026-03-12T10:42:29Z

Closing since #36658 covered the same fix

NikitosKh requested a review from sighingnow as a code owner March 9, 2026 18:04

mergify Bot added qwen Related to Qwen models speculative-decoding v1 bug Something isn't working labels Mar 9, 2026

gemini-code-assist Bot reviewed Mar 9, 2026

View reviewed changes

mergify Bot added the needs-rebase label Mar 11, 2026

NikitosKh closed this Mar 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix][Model] Fix Eagle3 speculative decoding for Qwen3Next-based models#36527

[Bugfix][Model] Fix Eagle3 speculative decoding for Qwen3Next-based models#36527
NikitosKh wants to merge 1 commit intovllm-project:mainfrom
NikitosKh:fix/eagle3-qwen3next-support

NikitosKh commented Mar 9, 2026 •

edited by github-actions Bot

Loading

Uh oh!

github-actions Bot commented Mar 9, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Mar 9, 2026

Uh oh!

gemini-code-assist Bot Mar 9, 2026

Uh oh!

mergify Bot commented Mar 11, 2026

Uh oh!

NikitosKh commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	self.aux_hidden_state_layers = tuple[int, ...]()
	self.aux_hidden_state_layers: tuple[int, ...] = ()

Uh oh!

Conversation

NikitosKh commented Mar 9, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes

Test Plan

Test Result

Uh oh!

github-actions Bot commented Mar 9, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented Mar 11, 2026

Uh oh!

NikitosKh commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

NikitosKh commented Mar 9, 2026 •

edited by github-actions Bot

Loading