[Refactor] Consolidate SupportsEagle #36063
Conversation
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
There was a problem hiding this comment.
Code Review
This pull request is a great refactoring that consolidates the logic for Eagle speculative decoding support by introducing an EagleModelMixin and updating the SupportsEagle3 interface. This significantly reduces code duplication across multiple models. However, I've found a critical issue in the implementation of the forward pass for several models. The layer index passed to _maybe_add_hidden_state is incorrect when pipeline parallelism is used, as it uses a relative index instead of an absolute one. This will cause speculative decoding to fail in pipeline parallel setups. I've provided suggestions to fix this in the affected files.
Note: Security Review did not run due to the size of the PR.
|
Hi @benchislett, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
fynnsu
left a comment
There was a problem hiding this comment.
This looks good, a few small comments/questions below.
| residual = intermediate_tensors["residual"] | ||
|
|
||
| aux_hidden_states = [] | ||
| aux_hidden_states = self._maybe_add_hidden_state([], 0, hidden_states, residual) |
There was a problem hiding this comment.
I think this needs to use start layer to handle pp case
| aux_hidden_states = self._maybe_add_hidden_state([], 0, hidden_states, residual) | |
| aux_hidden_states = self._maybe_add_hidden_state([], self.start_layer, hidden_states, residual) |
There was a problem hiding this comment.
Addressed in description, see also #36151. Let me know if you think it would be better to apply the fix to all the models in this PR.
There was a problem hiding this comment.
Sure, it seems like fixing this likely won't be enough to get PP + spec decode working, since it seems like we aren't transferring aux_hidden_states across PP ranks in gpu model runner. So this will probably require a larger fix.
There was a problem hiding this comment.
Renamed the bug accordingly.
| assert hasattr(self.language_model, "set_aux_hidden_state_layers") | ||
| self.language_model.set_aux_hidden_state_layers(layers) |
There was a problem hiding this comment.
Is any of this required for this model?
Won't the SupportEagle3.set_aux_hidden_state_layers implementation already find the .language_model and call _set_aux_hidden_state_layers on its .model attr?
Same for get_eagle3_default_aux_hidden_state_layers
There was a problem hiding this comment.
This one is special because it calls .set_aux_hidden_state_layers on the language_model instead of reaching into .language_model.model directly. Because of this, it's technically possible for the language_model parent to override .set_aux_hidden_state_layers and change the behaviour.
Since a primary goal of this PR is not to change behaviour, I chose to leave this one alone.
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
### What this PR does / why we need it? 1.fix "TypeError: get_attn_backend() remove variable": [Refactor `check_and_update_config`](vllm-project/vllm#35122) 2.fix [Rename `compile_ranges_split_points` to `compile_ranges_endpoints`](vllm-project/vllm#36027) 3.fix "RuntimeError: device_allocator not a DeviceAllocator":[Replace memory related torch.cuda APIs"](vllm-project/vllm#37031) 4.fix [Support multiple KV groups in OffloadingSpec ](vllm-project/vllm#36610) removed self.offloaded_block_size and changed self.gpu_block_size from a scalar to a tuple of per-group block sizes, adding block_size_factor. 5.fix [Consolidate SupportsEagle](vllm-project/vllm#36063) renamed get_eagle3_aux_hidden_state_layers() to get_eagle3_default_aux_hidden_state_layers() and added a supports_eagle3() guard before calling it. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? E2E - vLLM version: v0.17.0 - vLLM main: vllm-project/vllm@8a68046 --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: Claude Code <noreply@anthropic.com>
### What this PR does / why we need it? 1.fix "TypeError: get_attn_backend() remove variable": [Refactor `check_and_update_config`](vllm-project/vllm#35122) 2.fix [Rename `compile_ranges_split_points` to `compile_ranges_endpoints`](vllm-project/vllm#36027) 3.fix "RuntimeError: device_allocator not a DeviceAllocator":[Replace memory related torch.cuda APIs"](vllm-project/vllm#37031) 4.fix [Support multiple KV groups in OffloadingSpec ](vllm-project/vllm#36610) removed self.offloaded_block_size and changed self.gpu_block_size from a scalar to a tuple of per-group block sizes, adding block_size_factor. 5.fix [Consolidate SupportsEagle](vllm-project/vllm#36063) renamed get_eagle3_aux_hidden_state_layers() to get_eagle3_default_aux_hidden_state_layers() and added a supports_eagle3() guard before calling it. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? E2E - vLLM version: v0.17.0 - vLLM main: vllm-project/vllm@8a68046 --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: Claude Code <noreply@anthropic.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
### What this PR does / why we need it? 1.fix "TypeError: get_attn_backend() remove variable": [Refactor `check_and_update_config`](vllm-project/vllm#35122) 2.fix [Rename `compile_ranges_split_points` to `compile_ranges_endpoints`](vllm-project/vllm#36027) 3.fix "RuntimeError: device_allocator not a DeviceAllocator":[Replace memory related torch.cuda APIs"](vllm-project/vllm#37031) 4.fix [Support multiple KV groups in OffloadingSpec ](vllm-project/vllm#36610) removed self.offloaded_block_size and changed self.gpu_block_size from a scalar to a tuple of per-group block sizes, adding block_size_factor. 5.fix [Consolidate SupportsEagle](vllm-project/vllm#36063) renamed get_eagle3_aux_hidden_state_layers() to get_eagle3_default_aux_hidden_state_layers() and added a supports_eagle3() guard before calling it. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? E2E - vLLM version: v0.17.0 - vLLM main: vllm-project/vllm@8a68046 --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: Claude Code <noreply@anthropic.com>
### What this PR does / why we need it? 1.fix "TypeError: get_attn_backend() remove variable": [Refactor `check_and_update_config`](vllm-project/vllm#35122) 2.fix [Rename `compile_ranges_split_points` to `compile_ranges_endpoints`](vllm-project/vllm#36027) 3.fix "RuntimeError: device_allocator not a DeviceAllocator":[Replace memory related torch.cuda APIs"](vllm-project/vllm#37031) 4.fix [Support multiple KV groups in OffloadingSpec ](vllm-project/vllm#36610) removed self.offloaded_block_size and changed self.gpu_block_size from a scalar to a tuple of per-group block sizes, adding block_size_factor. 5.fix [Consolidate SupportsEagle](vllm-project/vllm#36063) renamed get_eagle3_aux_hidden_state_layers() to get_eagle3_default_aux_hidden_state_layers() and added a supports_eagle3() guard before calling it. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? E2E - vLLM version: v0.17.0 - vLLM main: vllm-project/vllm@8a68046 --------- Signed-off-by: leo-pony <nengjunma@outlook.com> Co-authored-by: Claude Code <noreply@anthropic.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Purpose
set_aux_hidden_state_layersandget_eagle3_aux_hidden_state_layerswhich are the same pretty much everywhereaux_hidden_statesfor the output of the last layer also, for completeness. AddsEagleModelMixinto facilitate some of the logic.The goal of this PR is to keep existing behaviours unchanged as much as possible. As such, some implementations' semantics are unchanged even if they might be considered buggy.
Specifically, #36151 outlines an issue with PP support due to incorrect iteration order over the layers. I consider this to be out-of-scope for this PR, as I do not want to change semantics as part of this refactor if I can help it. In the event that the PP fix has consequences, it would be easier to rollback as a standalone follow-up than as part of a broader refactor. I am open to discussion on this matter if we feel it would be easier to lump it all together.
This means that some implementations will use
aux_hidden_states = self._maybe_add_hidden_state([], 0, hidden_states, residual)and others will haveself.start_layerfor the first layer. This can be very easily changed in a follow-up PR.Testing
Spec Decoding E2E tests all passing locally. Also manually checked that the (marked as skipped) Qwen3-VL EAGLE3 test works properly