Skip to content

[Bugfix][Model] Fix Eagle3 speculative decoding for Qwen3Next-based models#36527

Closed
NikitosKh wants to merge 1 commit intovllm-project:mainfrom
NikitosKh:fix/eagle3-qwen3next-support
Closed

[Bugfix][Model] Fix Eagle3 speculative decoding for Qwen3Next-based models#36527
NikitosKh wants to merge 1 commit intovllm-project:mainfrom
NikitosKh:fix/eagle3-qwen3next-support

Conversation

@NikitosKh
Copy link
Copy Markdown

@NikitosKh NikitosKh commented Mar 9, 2026

Purpose

While training an Eagle3 draft model for Qwen3.5-9B (weights on HF), I discovered that Eagle3 speculative decoding is silently broken for all Qwen3Next-based models (Qwen3.5 and future models using this base).

The root cause is straightforward: Qwen3NextModel.forward() never captures auxiliary hidden states. The SupportsEagle3 protocol is declared on Qwen3_5ForConditionalGeneration (inherited from Qwen3VLForConditionalGeneration), so vLLM happily calls set_aux_hidden_state_layers() — but the inner model's forward() just ignores the setting. No error is raised; Eagle3 simply doesn't work.

Compare with Qwen2Model.forward() which correctly implements enumerate(islice(...)) with aux hidden state capture. Qwen3NextModel was missing the same logic.

Changes

qwen3_next.py (the actual fix):

  • Initialize self.aux_hidden_state_layers in Qwen3NextModel.__init__
  • Update Qwen3NextModel.forward() to capture hidden_states + residual at specified layer indices, mirroring the Qwen2Model pattern
  • Add SupportsEagle3 + set_aux_hidden_state_layers + get_eagle3_aux_hidden_state_layers to Qwen3NextForCausalLM

qwen3_5.py (propagate to the Qwen3.5-specific class):

  • Add SupportsEagle3 to Qwen3_5ForCausalLMBase with the same Eagle3 methods

tests/v1/spec_decode/test_qwen3next_eagle3_support.py (new):

  • 9 tests covering protocol compliance, forward() behavior, and consistency with the Qwen2Model reference implementation — all fail before the fix, all pass after

Test Plan

# Unit tests (no GPU needed)
pytest tests/v1/spec_decode/test_qwen3next_eagle3_support.py -v

# E2E with trained Eagle3 draft model
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3.5-9B \
  --speculative-config '{"method": "eagle3", "model": "BLR2/Qwen3.5-9B-Eagle3-ShareGPT", "num_speculative_tokens": 3}' \
  --trust-remote-code --port 8199 --max-model-len 4096

Test Result

9/9 unit tests pass. E2E server starts, loads the draft model, and serves requests with speculative decoding active.
The modest accept length is due to the draft model being trained on a small dataset.
Benchmark on a single H100 (Qwen3.5-9B + our Eagle3 draft):

Domain Baseline tok/s Eagle3 tok/s Speedup
code 145.1 260.3 1.79x
math/reasoning 145.0 265.3 1.83x
QA 145.1 207.6 1.43x
summarization 144.9 226.5 1.56x
translation 145.1 200.3 1.38x
overall 145.1 232.2 1.60x

…odels

Qwen3NextModel.forward() was missing auxiliary hidden state capture
logic, making Eagle3 non-functional for all Qwen3Next-based models
(including Qwen3.5). The SupportsEagle3 protocol was declared on
Qwen3_5ForConditionalGeneration via inheritance, but the inner model
forward() never captured the states — silently breaking Eagle3.
Fix:
- Add aux_hidden_state_layers init and capture logic to
  Qwen3NextModel.forward(), mirroring the Qwen2Model pattern
- Add SupportsEagle3 to Qwen3NextForCausalLM and
  Qwen3_5ForCausalLMBase with set/get methods
- Add 9 tests verifying protocol compliance, forward() behavior,
  and consistency with the Qwen2 reference implementation
Tested with Qwen3.5-9B + Eagle3 draft model on vLLM, achieving
2.1x speedup over autoregressive baseline.

Signed-off-by: NikitosKh <nikitak.khomich.work@gmail.com>
@NikitosKh NikitosKh requested a review from sighingnow as a code owner March 9, 2026 18:04
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 9, 2026

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@mergify mergify Bot added qwen Related to Qwen models speculative-decoding v1 bug Something isn't working labels Mar 9, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix Eagle3 speculative decoding for Qwen3Next-based models by implementing auxiliary hidden state capturing in Qwen3NextModel. However, it introduces a critical bug in the initialization of auxiliary hidden state layers, specifically an invalid call to a type alias (tuple[int, ...]()). This will cause a runtime TypeError during the model's forward pass when Eagle3 speculative decoding is not enabled, leading to a denial of service for standard use cases. Additionally, the layer indexing logic in the forward pass is incompatible with Pipeline Parallelism, which may cause crashes or incorrect behavior in distributed configurations. The changes correctly follow the pattern from Qwen2Model but these two critical issues prevent the feature from working as intended.

else:
self.norm = PPMissingLayer()

self.aux_hidden_state_layers = tuple[int, ...]()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The initialization of self.aux_hidden_state_layers using tuple[int, ...]() is invalid Python syntax and will cause a TypeError at runtime. This occurs in the __init__ method of Qwen3NextModel. If Eagle3 speculative decoding is not enabled, the forward pass will crash when attempting to perform an in check on a GenericAlias object, leading to a complete denial of service for Qwen3Next-based models without Eagle3 enabled. To initialize an empty tuple, use the literal ().

Suggested change
self.aux_hidden_state_layers = tuple[int, ...]()
self.aux_hidden_state_layers: tuple[int, ...] = ()

Comment on lines +1155 to +1158
for idx, layer in enumerate(
islice(self.layers, self.start_layer, self.end_layer)
):
if idx in self.aux_hidden_state_layers:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The current layer iteration logic using enumerate(islice(...)) causes idx to be relative to the current shard's start layer. Since self.aux_hidden_state_layers stores absolute layer indices, this check will fail for shards where start_layer > 0. This incompatibility with Pipeline Parallelism means no auxiliary hidden states will be captured on those shards, likely causing the Eagle3 engine to crash or produce incorrect results. To correctly support Pipeline Parallelism, the loop structure needs to be adjusted to ensure idx represents the absolute layer index.

Suggested change
for idx, layer in enumerate(
islice(self.layers, self.start_layer, self.end_layer)
):
if idx in self.aux_hidden_state_layers:
for idx, layer in islice(
enumerate(self.layers), self.start_layer, self.end_layer
):
if idx in self.aux_hidden_state_layers:

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Mar 11, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NikitosKh.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Mar 11, 2026
@NikitosKh
Copy link
Copy Markdown
Author

Closing since #36658 covered the same fix

@NikitosKh NikitosKh closed this Mar 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working needs-rebase qwen Related to Qwen models speculative-decoding v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant