[Bugfix][Speculative Decoding] Fix Eagle3 quantization config issue by rahul-tuli · Pull Request #25883 · vllm-project/vllm

rahul-tuli · 2025-09-29T12:33:57Z

Fixes Eagle3 speculative decoding failure when using quantized verifier with unquantized drafter models by implementing proper quantization config inheritance.

Related Issue

Fixes #25882

Problem

Eagle3 drafters were incorrectly inheriting the verifier's quantization configuration instead of using their own, causing KeyError: 'layers.0.mlp.down_proj.weight' when loading unquantized drafter weights with quantized verifiers.

Root Cause

In vllm/model_executor/models/llama_eagle3.py:38, the Eagle3 LlamaDecoderLayer was using:

quant_config = vllm_config.quant_config  # BUG: Uses verifier's quantization config

This caused the drafter to expect quantized weight params when the drafter's checkpoint contained unquantized weights.

When Introduced

Issue was introduced by commit a5354b3ed ([Bugfix][WideEP] Apply TP Attn + EP MoE fix to other models) on September 27, 2025, which updated Eagle3 to use the new vLLM v1 API but didn't properly handle quantization config inheritance.

Solution

Implements a clean inheritance pattern using the Template Method design pattern:

Base LlamaDecoderLayer - Added configurable get_quant_config() method
Eagle3 LlamaDecoderLayer - Overrides to use drafter's quantization config
Existing Infrastructure - Uses VllmConfig._get_quantization_config() for consistency

Code Changes

`vllm/model_executor/models/llama.py`

def get_quant_config(self, vllm_config: VllmConfig) -> Optional[QuantizationConfig]:
    """Get quantization config for this layer. Override in subclasses."""
    return vllm_config.quant_config

def __init__(self, vllm_config: VllmConfig, ...):
    # Use configurable method instead of direct access
    quant_config = self.get_quant_config(vllm_config)

`vllm/model_executor/models/llama_eagle3.py`

def get_quant_config(self, vllm_config: VllmConfig) -> Optional[QuantizationConfig]:
    """Use drafter's quantization config instead of verifier's."""
    draft_model_config = vllm_config.speculative_config.draft_model_config
    draft_load_config = vllm_config.load_config

    return VllmConfig.get_quantization_config(
        draft_model_config, draft_load_config
    ) if draft_model_config else None

Script to reproduce linked in the issue!

Before Fix

KeyError: 'layers.0.mlp.down_proj.weight'

After Fix

Configuration Tested

Verifier: RedHatAI/Qwen3-8B-quantized.w4a16 (quantized)
Drafter: nm-testing/Speculator-Qwen3-8B-Eagle3-converted-071 (unquantized)
Method: Eagle3 with 3 speculative tokens

Files Changed

vllm/model_executor/models/llama.py - Added configurable quantization method
vllm/model_executor/models/llama_eagle3.py - Eagle3 override implementation
tests/speculative_decoding/speculators/test_eagle3.py - Added smoke test

…ance Eagle3 drafters were incorrectly inheriting the verifier's quantization configuration instead of using their own, causing KeyError when loading unquantized drafter weights with quantized verifiers. This implements a clean inheritance pattern where: - Base LlamaDecoderLayer has configurable get_quant_config() method - Eagle3 LlamaDecoderLayer overrides to use drafter's quantization config - Uses existing VllmConfig._get_quantization_config() infrastructure Fixes speculative decoding with quantized verifier + unquantized drafter. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: rtuli@redhat.com Signed-off-by: Rahul Tuli <rtuli@redhat.com>

gemini-code-assist

Code Review

This pull request addresses a critical bug in Eagle3 speculative decoding where the drafter model incorrectly inherited the verifier's quantization configuration, leading to failures when mixing quantized verifiers and unquantized drafters. The fix is well-designed, employing the Template Method pattern by introducing a get_quant_config method in the base LlamaDecoderLayer. This allows the LlamaDecoderLayer subclass in llama_eagle3.py to override it and correctly fetch the drafter's quantization configuration. The changes are logical and correctly implemented. The addition of a new smoke test case for a quantized model is a valuable addition to prevent regressions. Overall, this is a solid contribution that improves the robustness of speculative decoding.

benchislett

LGTM. Maybe we should consider a broader refactor to include self.get_quant_config(vllm_config) in more models? @rahul-tuli could you create a github issue to track this for future work?

rahul-tuli · 2025-09-29T14:39:19Z

LGTM. Maybe we should consider a broader refactor to include self.get_quant_config(vllm_config) in more models? @rahul-tuli could you create a github issue to track this for future work?

Sounds good @benchislett I'll create the issue and link it here!

vllm/model_executor/models/llama.py

…llm-project#25883) Signed-off-by: Rahul Tuli <rtuli@redhat.com>

…25883) Signed-off-by: Rahul Tuli <rtuli@redhat.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>

…llm-project#25883) Signed-off-by: Rahul Tuli <rtuli@redhat.com> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

Signed-off-by: Rahul Tuli <rtuli@redhat.com>

…llm-project#25883) Signed-off-by: Rahul Tuli <rtuli@redhat.com>

mergify bot added llama Related to Llama models speculative-decoding labels Sep 29, 2025

gemini-code-assist bot reviewed Sep 29, 2025

View reviewed changes

benchislett approved these changes Sep 29, 2025

View reviewed changes

tlrmchlsmth approved these changes Sep 29, 2025

View reviewed changes

tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 29, 2025

mgoin reviewed Sep 29, 2025

View reviewed changes

vllm/model_executor/models/llama.py Show resolved Hide resolved

mgoin merged commit 145ac73 into vllm-project:main Sep 29, 2025
54 checks passed

tlrmchlsmth mentioned this pull request Sep 30, 2025

[Bugfix][DeepSeek Fix config used for DeepseekV2 Eagle #25953

Closed

pdasigi pushed a commit to pdasigi/vllm that referenced this pull request Oct 2, 2025

[Bugfix][Speculative Decoding] Fix Eagle3 quantization config issue (v…

da931fc

…llm-project#25883) Signed-off-by: Rahul Tuli <rtuli@redhat.com>

yewentao256 pushed a commit that referenced this pull request Oct 3, 2025

[Bugfix][Speculative Decoding] Fix Eagle3 quantization config issue (#…

c339921

…25883) Signed-off-by: Rahul Tuli <rtuli@redhat.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>

tlrmchlsmth mentioned this pull request Oct 8, 2025

[Bug]: EAGLE incompatible w/ Compressed Tensors Quantized Target Model #26402

Closed

1 task

jmkuebler mentioned this pull request Oct 8, 2025

[EAGLE] [Quantization] turn quantization off for draft model initialization #26411

Closed

5 tasks

rahul-tuli mentioned this pull request Oct 10, 2025

[Bugfix][Speculative Decoding] Extend Eagle quantization config fix to llama_eagle.py #26590

Merged

rahul-tuli added a commit to neuralmagic/vllm that referenced this pull request Oct 13, 2025

Extend: fix from vllm-project#25883 to llama_eagle.py

0e55d08

Signed-off-by: Rahul Tuli <rtuli@redhat.com>

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

[Bugfix][Speculative Decoding] Fix Eagle3 quantization config issue (v…

15f7fce

…llm-project#25883) Signed-off-by: Rahul Tuli <rtuli@redhat.com>

alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025

[Bugfix][Speculative Decoding] Fix Eagle3 quantization config issue (v…

12adf84

…llm-project#25883) Signed-off-by: Rahul Tuli <rtuli@redhat.com>

rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025

[Bugfix][Speculative Decoding] Fix Eagle3 quantization config issue (v…

a490987

…llm-project#25883) Signed-off-by: Rahul Tuli <rtuli@redhat.com>

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[Bugfix][Speculative Decoding] Fix Eagle3 quantization config issue (v…

15c9831

…llm-project#25883) Signed-off-by: Rahul Tuli <rtuli@redhat.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix][Speculative Decoding] Fix Eagle3 quantization config issue#25883

[Bugfix][Speculative Decoding] Fix Eagle3 quantization config issue#25883
mgoin merged 1 commit intovllm-project:mainfrom
neuralmagic:fix/eagle3-quantization-config

rahul-tuli commented Sep 29, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

benchislett left a comment

Uh oh!

rahul-tuli commented Sep 29, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

rahul-tuli commented Sep 29, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related Issue

Problem

Root Cause

When Introduced

Solution

Code Changes

vllm/model_executor/models/llama.py

vllm/model_executor/models/llama_eagle3.py

Before Fix

After Fix

Configuration Tested

Files Changed

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

benchislett left a comment

Choose a reason for hiding this comment

Uh oh!

rahul-tuli commented Sep 29, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rahul-tuli commented Sep 29, 2025 •

edited by github-actions bot

Loading

`vllm/model_executor/models/llama.py`

`vllm/model_executor/models/llama_eagle3.py`