Skip to content

[Refactor] Consolidate SupportsEagle #36063

Merged
hmellor merged 5 commits intovllm-project:mainfrom
CentML:bchislett/eagle-abstraction
Mar 13, 2026
Merged

[Refactor] Consolidate SupportsEagle #36063
hmellor merged 5 commits intovllm-project:mainfrom
CentML:bchislett/eagle-abstraction

Conversation

@benchislett
Copy link
Copy Markdown
Collaborator

@benchislett benchislett commented Mar 4, 2026

Purpose

  • Consolidate set_aux_hidden_state_layers and get_eagle3_aux_hidden_state_layers which are the same pretty much everywhere
  • Capture aux_hidden_states for the output of the last layer also, for completeness. Adds EagleModelMixin to facilitate some of the logic.
  • Add SupportsEagle tag to models which support Eagle3, for consistency.

The goal of this PR is to keep existing behaviours unchanged as much as possible. As such, some implementations' semantics are unchanged even if they might be considered buggy.

Specifically, #36151 outlines an issue with PP support due to incorrect iteration order over the layers. I consider this to be out-of-scope for this PR, as I do not want to change semantics as part of this refactor if I can help it. In the event that the PP fix has consequences, it would be easier to rollback as a standalone follow-up than as part of a broader refactor. I am open to discussion on this matter if we feel it would be easier to lump it all together.
This means that some implementations will use aux_hidden_states = self._maybe_add_hidden_state([], 0, hidden_states, residual) and others will have self.start_layer for the first layer. This can be very easily changed in a follow-up PR.

Testing

Spec Decoding E2E tests all passing locally. Also manually checked that the (marked as skipped) Qwen3-VL EAGLE3 test works properly

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
@mergify mergify Bot added llama Related to Llama models qwen Related to Qwen models gpt-oss Related to GPT-OSS models v1 labels Mar 4, 2026
@mergify mergify Bot added the kv-connector label Mar 4, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a great refactoring that consolidates the logic for Eagle speculative decoding support by introducing an EagleModelMixin and updating the SupportsEagle3 interface. This significantly reduces code duplication across multiple models. However, I've found a critical issue in the implementation of the forward pass for several models. The layer index passed to _maybe_add_hidden_state is incorrect when pipeline parallelism is used, as it uses a relative index instead of an absolute one. This will cause speculative decoding to fail in pipeline parallel setups. I've provided suggestions to fix this in the affected files.

Note: Security Review did not run due to the size of the PR.

Comment thread vllm/model_executor/models/afmoe.py
Comment thread vllm/model_executor/models/apertus.py
Comment thread vllm/model_executor/models/arcee.py
Comment thread vllm/model_executor/models/llama.py
Comment thread vllm/model_executor/models/minicpm.py
Comment thread vllm/model_executor/models/qwen2.py
Comment thread vllm/model_executor/models/step1.py
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Mar 5, 2026

Hi @benchislett, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Copy link
Copy Markdown
Contributor

@fynnsu fynnsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, a few small comments/questions below.

residual = intermediate_tensors["residual"]

aux_hidden_states = []
aux_hidden_states = self._maybe_add_hidden_state([], 0, hidden_states, residual)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to use start layer to handle pp case

Suggested change
aux_hidden_states = self._maybe_add_hidden_state([], 0, hidden_states, residual)
aux_hidden_states = self._maybe_add_hidden_state([], self.start_layer, hidden_states, residual)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in description, see also #36151. Let me know if you think it would be better to apply the fix to all the models in this PR.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, it seems like fixing this likely won't be enough to get PP + spec decode working, since it seems like we aren't transferring aux_hidden_states across PP ranks in gpu model runner. So this will probably require a larger fix.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed the bug accordingly.

Comment thread vllm/model_executor/models/afmoe.py
Comment on lines 812 to 813
assert hasattr(self.language_model, "set_aux_hidden_state_layers")
self.language_model.set_aux_hidden_state_layers(layers)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is any of this required for this model?

Won't the SupportEagle3.set_aux_hidden_state_layers implementation already find the .language_model and call _set_aux_hidden_state_layers on its .model attr?

Same for get_eagle3_default_aux_hidden_state_layers

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is special because it calls .set_aux_hidden_state_layers on the language_model instead of reaching into .language_model.model directly. Because of this, it's technically possible for the language_model parent to override .set_aux_hidden_state_layers and change the behaviour.

Since a primary goal of this PR is not to change behaviour, I chose to leave this one alone.

Comment thread vllm/model_executor/models/qwen2_5_vl.py Outdated
@github-project-automation github-project-automation Bot moved this from To Triage to In progress in gpt-oss Issues & Enhancements Mar 5, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Mar 5, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @benchislett.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
@mergify mergify Bot removed the needs-rebase label Mar 5, 2026
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Copy link
Copy Markdown
Contributor

@fynnsu fynnsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread vllm/model_executor/models/interfaces.py
Comment thread vllm/model_executor/models/mimo_v2_flash.py
@github-project-automation github-project-automation Bot moved this from In progress to Ready in gpt-oss Issues & Enhancements Mar 13, 2026
@hmellor hmellor enabled auto-merge (squash) March 13, 2026 18:42
@github-actions github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 13, 2026
@hmellor hmellor merged commit 8b34630 into vllm-project:main Mar 13, 2026
66 checks passed
Lucaskabela pushed a commit to Lucaskabela/vllm that referenced this pull request Mar 17, 2026
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
wangxiyuan pushed a commit to vllm-project/vllm-ascend that referenced this pull request Mar 18, 2026
### What this PR does / why we need it?

1.fix "TypeError: get_attn_backend() remove variable": [Refactor
`check_and_update_config`](vllm-project/vllm#35122)

2.fix [Rename `compile_ranges_split_points` to
`compile_ranges_endpoints`](vllm-project/vllm#36027)

3.fix "RuntimeError: device_allocator not a DeviceAllocator":[Replace
memory related torch.cuda
APIs"](vllm-project/vllm#37031)

4.fix [Support multiple KV groups in OffloadingSpec
](vllm-project/vllm#36610) removed
self.offloaded_block_size and changed self.gpu_block_size from a scalar
to a tuple of per-group block sizes, adding block_size_factor.

5.fix [Consolidate
SupportsEagle](vllm-project/vllm#36063) renamed
get_eagle3_aux_hidden_state_layers() to
get_eagle3_default_aux_hidden_state_layers() and added a
supports_eagle3() guard before calling it.

### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
E2E


- vLLM version: v0.17.0
- vLLM main:
vllm-project/vllm@8a68046

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: Claude Code <noreply@anthropic.com>
starmountain1997 pushed a commit to starmountain1997/vllm-ascend that referenced this pull request Mar 25, 2026
### What this PR does / why we need it?

1.fix "TypeError: get_attn_backend() remove variable": [Refactor
`check_and_update_config`](vllm-project/vllm#35122)

2.fix [Rename `compile_ranges_split_points` to
`compile_ranges_endpoints`](vllm-project/vllm#36027)

3.fix "RuntimeError: device_allocator not a DeviceAllocator":[Replace
memory related torch.cuda
APIs"](vllm-project/vllm#37031)

4.fix [Support multiple KV groups in OffloadingSpec
](vllm-project/vllm#36610) removed
self.offloaded_block_size and changed self.gpu_block_size from a scalar
to a tuple of per-group block sizes, adding block_size_factor.

5.fix [Consolidate
SupportsEagle](vllm-project/vllm#36063) renamed
get_eagle3_aux_hidden_state_layers() to
get_eagle3_default_aux_hidden_state_layers() and added a
supports_eagle3() guard before calling it.

### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
E2E


- vLLM version: v0.17.0
- vLLM main:
vllm-project/vllm@8a68046

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: Claude Code <noreply@anthropic.com>
khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 27, 2026
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
lihaokun-2026 pushed a commit to lihaokun-2026/vllm-ascend that referenced this pull request Mar 29, 2026
### What this PR does / why we need it?

1.fix "TypeError: get_attn_backend() remove variable": [Refactor
`check_and_update_config`](vllm-project/vllm#35122)

2.fix [Rename `compile_ranges_split_points` to
`compile_ranges_endpoints`](vllm-project/vllm#36027)

3.fix "RuntimeError: device_allocator not a DeviceAllocator":[Replace
memory related torch.cuda
APIs"](vllm-project/vllm#37031)

4.fix [Support multiple KV groups in OffloadingSpec
](vllm-project/vllm#36610) removed
self.offloaded_block_size and changed self.gpu_block_size from a scalar
to a tuple of per-group block sizes, adding block_size_factor.

5.fix [Consolidate
SupportsEagle](vllm-project/vllm#36063) renamed
get_eagle3_aux_hidden_state_layers() to
get_eagle3_default_aux_hidden_state_layers() and added a
supports_eagle3() guard before calling it.

### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
E2E


- vLLM version: v0.17.0
- vLLM main:
vllm-project/vllm@8a68046

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: Claude Code <noreply@anthropic.com>
chenchuw886 pushed a commit to chenchuw886/vllm-ascend that referenced this pull request Apr 1, 2026
### What this PR does / why we need it?

1.fix "TypeError: get_attn_backend() remove variable": [Refactor
`check_and_update_config`](vllm-project/vllm#35122)

2.fix [Rename `compile_ranges_split_points` to
`compile_ranges_endpoints`](vllm-project/vllm#36027)

3.fix "RuntimeError: device_allocator not a DeviceAllocator":[Replace
memory related torch.cuda
APIs"](vllm-project/vllm#37031)

4.fix [Support multiple KV groups in OffloadingSpec
](vllm-project/vllm#36610) removed
self.offloaded_block_size and changed self.gpu_block_size from a scalar
to a tuple of per-group block sizes, adding block_size_factor.

5.fix [Consolidate
SupportsEagle](vllm-project/vllm#36063) renamed
get_eagle3_aux_hidden_state_layers() to
get_eagle3_default_aux_hidden_state_layers() and added a
supports_eagle3() guard before calling it.

### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
E2E


- vLLM version: v0.17.0
- vLLM main:
vllm-project/vllm@8a68046

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
Co-authored-by: Claude Code <noreply@anthropic.com>
mtparet pushed a commit to blackfuel-ai/vllm that referenced this pull request Apr 9, 2026
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gpt-oss Related to GPT-OSS models kv-connector llama Related to Llama models qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants