[Bugfix][Speculative Decoding] Fix Eagle3 quantization config issue#25883
Conversation
…ance Eagle3 drafters were incorrectly inheriting the verifier's quantization configuration instead of using their own, causing KeyError when loading unquantized drafter weights with quantized verifiers. This implements a clean inheritance pattern where: - Base LlamaDecoderLayer has configurable get_quant_config() method - Eagle3 LlamaDecoderLayer overrides to use drafter's quantization config - Uses existing VllmConfig._get_quantization_config() infrastructure Fixes speculative decoding with quantized verifier + unquantized drafter. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: rtuli@redhat.com Signed-off-by: Rahul Tuli <rtuli@redhat.com>
There was a problem hiding this comment.
Code Review
This pull request addresses a critical bug in Eagle3 speculative decoding where the drafter model incorrectly inherited the verifier's quantization configuration, leading to failures when mixing quantized verifiers and unquantized drafters. The fix is well-designed, employing the Template Method pattern by introducing a get_quant_config method in the base LlamaDecoderLayer. This allows the LlamaDecoderLayer subclass in llama_eagle3.py to override it and correctly fetch the drafter's quantization configuration. The changes are logical and correctly implemented. The addition of a new smoke test case for a quantized model is a valuable addition to prevent regressions. Overall, this is a solid contribution that improves the robustness of speculative decoding.
benchislett
left a comment
There was a problem hiding this comment.
LGTM. Maybe we should consider a broader refactor to include self.get_quant_config(vllm_config) in more models? @rahul-tuli could you create a github issue to track this for future work?
Sounds good @benchislett I'll create the issue and link it here! |
…llm-project#25883) Signed-off-by: Rahul Tuli <rtuli@redhat.com>
…25883) Signed-off-by: Rahul Tuli <rtuli@redhat.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>
…llm-project#25883) Signed-off-by: Rahul Tuli <rtuli@redhat.com> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
…llm-project#25883) Signed-off-by: Rahul Tuli <rtuli@redhat.com>
…llm-project#25883) Signed-off-by: Rahul Tuli <rtuli@redhat.com>
…llm-project#25883) Signed-off-by: Rahul Tuli <rtuli@redhat.com>
…llm-project#25883) Signed-off-by: Rahul Tuli <rtuli@redhat.com>
Fixes Eagle3 speculative decoding failure when using quantized verifier with unquantized drafter models by implementing proper quantization config inheritance.
Related Issue
Fixes #25882
Problem
Eagle3 drafters were incorrectly inheriting the verifier's quantization configuration instead of using their own, causing
KeyError: 'layers.0.mlp.down_proj.weight'when loading unquantized drafter weights with quantized verifiers.Root Cause
In
vllm/model_executor/models/llama_eagle3.py:38, the Eagle3LlamaDecoderLayerwas using:This caused the drafter to expect quantized weight params when the drafter's checkpoint contained unquantized weights.
When Introduced
Issue was introduced by commit
a5354b3ed([Bugfix][WideEP] Apply TP Attn + EP MoE fix to other models) on September 27, 2025, which updated Eagle3 to use the new vLLM v1 API but didn't properly handle quantization config inheritance.Solution
Implements a clean inheritance pattern using the Template Method design pattern:
LlamaDecoderLayer- Added configurableget_quant_config()methodLlamaDecoderLayer- Overrides to use drafter's quantization configVllmConfig._get_quantization_config()for consistencyCode Changes
vllm/model_executor/models/llama.pyvllm/model_executor/models/llama_eagle3.pyScript to reproduce linked in the issue!
Before Fix
After Fix
Configuration Tested
RedHatAI/Qwen3-8B-quantized.w4a16(quantized)nm-testing/Speculator-Qwen3-8B-Eagle3-converted-071(unquantized)Files Changed
vllm/model_executor/models/llama.py- Added configurable quantization methodvllm/model_executor/models/llama_eagle3.py- Eagle3 override implementationtests/speculative_decoding/speculators/test_eagle3.py- Added smoke test