Add: Eagle3 support for Qwen3.5#36658
Conversation
Signed-off-by: Rahul-Tuli <rtuli@redhat.com>
gambletan
left a comment
There was a problem hiding this comment.
Nice addition of Eagle3 support for Qwen3.5 and Qwen3Next.
One thing I noticed: in qwen3_next.py, the forward() method's return type annotation is changed to torch.Tensor | IntermediateTensors | tuple[torch.Tensor, list[torch.Tensor]], but the caller of this method needs to handle the new tuple return type. Is there a common dispatch layer (e.g., in the Eagle3 speculative decoding code) that already pattern-matches on isinstance(result, tuple) for other models? If not, this could break callers that expect only torch.Tensor | IntermediateTensors.
Also, in qwen3_5.py line ~601, get_eagle3_aux_hidden_state_layers hardcodes (2, num_layers // 2, num_layers - 3). If the model has very few layers (e.g., a small variant with < 5 layers), num_layers - 3 could overlap with layer 2 or even be negative. A guard like assert num_layers >= 6 or similar would make this more robust against unexpected model configurations.
Minor: aux_hidden_state_layers is initialized as an empty tuple () in Qwen3NextModel.__init__ but the forward method checks if aux_hidden_states: (checking the list, not the config tuple). This works correctly but could be slightly confusing to future readers — a brief comment clarifying that aux_hidden_states is populated only when self.aux_hidden_state_layers is non-empty would help.
There was a problem hiding this comment.
Code Review
This pull request adds support for EAGLE-3 speculative decoding to Qwen3.5 models, which is a great enhancement for inference performance. The changes are well-structured and follow the existing patterns for Eagle3 support in other models within the vLLM codebase. The implementation correctly collects auxiliary hidden states with minimal overhead when the feature is not active. I have one suggestion to improve the robustness of the layer index generation to ensure it gracefully handles models with a small number of layers.
I am having trouble creating individual review comments. Click here to see my feedback.
vllm/model_executor/models/qwen3_5.py (584-586)
The current implementation for generating auxiliary layer indices can produce out-of-bounds (e.g., negative or too large) or duplicate values when num_layers is small. While the current usage with the in operator in the forward pass implicitly filters these invalid indices, this approach is brittle and not explicit. For robustness and clarity, it's better to ensure that the returned tuple contains only unique, sorted, and valid layer indices. This prevents potential issues if this method is used in other contexts in the future where direct indexing might be assumed.
def get_eagle3_aux_hidden_state_layers(self) -> tuple[int, ...]:
num_layers = len(self.model.layers)
indices = {2, num_layers // 2, num_layers - 3}
return tuple(sorted(i for i in indices if 0 <= i < num_layers))
| class Qwen3_5ForCausalLMBase( | ||
| nn.Module, | ||
| HasInnerState, | ||
| SupportsEagle3, |
There was a problem hiding this comment.
Please also add SupportsEagle. It's not currently used everywhere but I'm trying to get all the models to have both for consistency, at least for now. See #36063
|
@gambletan none of those feedbacks are relevant. This PR's implementation is the canonical way of handling EAGLE3 support in vLLM |
Signed-off-by: Rahul-Tuli <rtuli@redhat.com>
Signed-off-by: Rahul-Tuli <rtuli@redhat.com>
Signed-off-by: Rahul-Tuli <rtuli@redhat.com>
This PR adds support for EAGLE-3 speculative decoding to
Qwen3.5, enabling faster inference with draft models likeBLR2/Qwen3.5-9B-Eagle3-ShareGPT.Changes
Modified Files
vllm/model_executor/models/qwen3_next.pyvllm/model_executor/models/qwen3_5.pyImplementation Details
Updated
Qwen3NextModel(qwen3_next.py)aux_hidden_state_layersattribute to track which layers output auxiliary hidden statesforward()to collect auxiliary hidden states at specified global layer indices(hidden_states, aux_hidden_states)when auxiliary states are collected; otherwise returnshidden_statesunchanged (zero overhead when Eagle3 is not active)Added
SupportsEagle3Interface toQwen3_5ForCausalLMBase(qwen3_5.py)SupportsEagle3toQwen3_5ForCausalLMBaseclass inheritance — inherited automatically by bothQwen3_5ForCausalLMandQwen3_5MoeForCausalLMset_aux_hidden_state_layers()andget_eagle3_aux_hidden_state_layers()returning layer indices(2, num_layers // 2, num_layers - 3)self.aux_hidden_state_layers = ()inQwen3_5Model.__init__because it callssuper(Qwen3NextModel, self).__init__(), skippingQwen3NextModel.__init__Testing
Tested with
Qwen/Qwen3.5-9Band EAGLE-3 drafterBLR2/Qwen3.5-9B-Eagle3-ShareGPTon mt-bench:CUDA_VISIBLE_DEVICES=0 python examples/offline_inference/spec_decode.py \ --model-dir Qwen/Qwen3.5-9B \ --eagle-dir BLR2/Qwen3.5-9B-Eagle3-ShareGPT \ --method eagle3 \ --num-spec-tokens 3 \ --dataset-name hf \ --dataset-path philschmid/mt-bench \ --num-prompts 80 \ --enable-chunked-prefill \ --temp 0 \ --print-outputRelated
This implementation follows the same pattern as existing EAGLE-3 support in:
Qwen2ForCausalLMQwen3ForCausalLMLlamaForCausalLMOffline inference script