Skip to content

[Bugfix][Speculative Decoding] Fix Eagle3 quantization config issue#25883

Merged
mgoin merged 1 commit intovllm-project:mainfrom
neuralmagic:fix/eagle3-quantization-config
Sep 29, 2025
Merged

[Bugfix][Speculative Decoding] Fix Eagle3 quantization config issue#25883
mgoin merged 1 commit intovllm-project:mainfrom
neuralmagic:fix/eagle3-quantization-config

Conversation

@rahul-tuli
Copy link
Copy Markdown
Contributor

@rahul-tuli rahul-tuli commented Sep 29, 2025

Fixes Eagle3 speculative decoding failure when using quantized verifier with unquantized drafter models by implementing proper quantization config inheritance.

Related Issue

Fixes #25882

Problem

Eagle3 drafters were incorrectly inheriting the verifier's quantization configuration instead of using their own, causing KeyError: 'layers.0.mlp.down_proj.weight' when loading unquantized drafter weights with quantized verifiers.

Root Cause

In vllm/model_executor/models/llama_eagle3.py:38, the Eagle3 LlamaDecoderLayer was using:

quant_config = vllm_config.quant_config  # BUG: Uses verifier's quantization config

This caused the drafter to expect quantized weight params when the drafter's checkpoint contained unquantized weights.

When Introduced

Issue was introduced by commit a5354b3ed ([Bugfix][WideEP] Apply TP Attn + EP MoE fix to other models) on September 27, 2025, which updated Eagle3 to use the new vLLM v1 API but didn't properly handle quantization config inheritance.

Solution

Implements a clean inheritance pattern using the Template Method design pattern:

  1. Base LlamaDecoderLayer - Added configurable get_quant_config() method
  2. Eagle3 LlamaDecoderLayer - Overrides to use drafter's quantization config
  3. Existing Infrastructure - Uses VllmConfig._get_quantization_config() for consistency

Code Changes

vllm/model_executor/models/llama.py

def get_quant_config(self, vllm_config: VllmConfig) -> Optional[QuantizationConfig]:
    """Get quantization config for this layer. Override in subclasses."""
    return vllm_config.quant_config

def __init__(self, vllm_config: VllmConfig, ...):
    # Use configurable method instead of direct access
    quant_config = self.get_quant_config(vllm_config)

vllm/model_executor/models/llama_eagle3.py

def get_quant_config(self, vllm_config: VllmConfig) -> Optional[QuantizationConfig]:
    """Use drafter's quantization config instead of verifier's."""
    draft_model_config = vllm_config.speculative_config.draft_model_config
    draft_load_config = vllm_config.load_config

    return VllmConfig.get_quantization_config(
        draft_model_config, draft_load_config
    ) if draft_model_config else None

Script to reproduce linked in the issue!

Before Fix

KeyError: 'layers.0.mlp.down_proj.weight'

After Fix

Configuration Tested

  • Verifier: RedHatAI/Qwen3-8B-quantized.w4a16 (quantized)
  • Drafter: nm-testing/Speculator-Qwen3-8B-Eagle3-converted-071 (unquantized)
  • Method: Eagle3 with 3 speculative tokens

Files Changed

  • vllm/model_executor/models/llama.py - Added configurable quantization method
  • vllm/model_executor/models/llama_eagle3.py - Eagle3 override implementation
  • tests/speculative_decoding/speculators/test_eagle3.py - Added smoke test

…ance

Eagle3 drafters were incorrectly inheriting the verifier's quantization
configuration instead of using their own, causing KeyError when loading
unquantized drafter weights with quantized verifiers.

This implements a clean inheritance pattern where:
- Base LlamaDecoderLayer has configurable get_quant_config() method
- Eagle3 LlamaDecoderLayer overrides to use drafter's quantization config
- Uses existing VllmConfig._get_quantization_config() infrastructure

Fixes speculative decoding with quantized verifier + unquantized drafter.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: rtuli@redhat.com

Signed-off-by: Rahul Tuli <rtuli@redhat.com>
@mergify mergify bot added llama Related to Llama models speculative-decoding labels Sep 29, 2025
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a critical bug in Eagle3 speculative decoding where the drafter model incorrectly inherited the verifier's quantization configuration, leading to failures when mixing quantized verifiers and unquantized drafters. The fix is well-designed, employing the Template Method pattern by introducing a get_quant_config method in the base LlamaDecoderLayer. This allows the LlamaDecoderLayer subclass in llama_eagle3.py to override it and correctly fetch the drafter's quantization configuration. The changes are logical and correctly implemented. The addition of a new smoke test case for a quantized model is a valuable addition to prevent regressions. Overall, this is a solid contribution that improves the robustness of speculative decoding.

Copy link
Copy Markdown
Collaborator

@benchislett benchislett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Maybe we should consider a broader refactor to include self.get_quant_config(vllm_config) in more models? @rahul-tuli could you create a github issue to track this for future work?

@tlrmchlsmth tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 29, 2025
@rahul-tuli
Copy link
Copy Markdown
Contributor Author

LGTM. Maybe we should consider a broader refactor to include self.get_quant_config(vllm_config) in more models? @rahul-tuli could you create a github issue to track this for future work?

Sounds good @benchislett I'll create the issue and link it here!

@mgoin mgoin merged commit 145ac73 into vllm-project:main Sep 29, 2025
54 checks passed
pdasigi pushed a commit to pdasigi/vllm that referenced this pull request Oct 2, 2025
yewentao256 pushed a commit that referenced this pull request Oct 3, 2025
…25883)

Signed-off-by: Rahul Tuli <rtuli@redhat.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
tomeras91 pushed a commit to tomeras91/vllm that referenced this pull request Oct 6, 2025
…llm-project#25883)

Signed-off-by: Rahul Tuli <rtuli@redhat.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
rahul-tuli added a commit to neuralmagic/vllm that referenced this pull request Oct 13, 2025
Signed-off-by: Rahul Tuli <rtuli@redhat.com>
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025
rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

llama Related to Llama models ready ONLY add when PR is ready to merge/full CI is needed speculative-decoding

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Eagle3 speculative decoding fails with quantized verifier + unquantized drafter

4 participants