[Bugfix][Speculative Decoding] Extend Eagle quantization config fix to llama_eagle.py by rahul-tuli · Pull Request #26590 · vllm-project/vllm

rahul-tuli · 2025-10-10T14:25:04Z

This PR extends the quantization config fix from #25883 to llama_eagle.py

Background

PR #25883 fixed an issue where Eagle3 drafter models were incorrectly using the verifier model's quantization config instead of their own. This caused problems when the drafter and verifier models had different quantization configurations.

The fix introduced a get_quant_config() method in LlamaDecoderLayer that can be overridden by Eagle subclasses to use the draft model's quantization config.

Changes

This PR applies the same pattern to additional Eagle drafters:

llama_eagle.py

Adds get_quant_config() method to the LlamaDecoderLayer subclass in llama_eagle.py
The method retrieves the quantization config from the draft model instead of the verifier model
Uses VllmConfig.get_quantization_config() to properly obtain the draft model's quantization config

Impact

The fix ensures that Eagle drafter models correctly use their own quantization configuration, preventing quantization mismatches when used with differently quantized verifier models.

Notes

llama_eagle3.py was already fixed in [Bugfix][Speculative Decoding] Fix Eagle3 quantization config issue #25883
llama4_eagle.py handles this differently by explicitly passing quantization config as a parameter, so no changes are needed there
minicpm_eagle.py accepts a separate quant_config parameter in its decoder layer, so it doesn't need this fix

Fixes: Extends fix from #25883
Related: #25883

Verification

CUDA_VISIBLE_DEVICES=0 python examples/offline_inference/spec_decode.py \
  --method "eagle" \
  --tp 1 \
  --print-output \
  --model-dir "RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8-dynamic" \
  --eagle-dir "yuhuili/EAGLE-LLaMA3.1-Instruct-8B" \
  --dataset_name "hf" \
  --dataset_path "philschmid/mt-bench" \
  --num-spec-tokens 5 2>&1 | tee local/acceptance-rates-eagle.txt

Output:

--------------------------------------------------
--------------------------------------------------
total_num_output_tokens: 210113
num_drafts: 87017
num_draft_tokens: 435085
num_accepted_tokens: 123333
mean acceptance length: 2.42
--------------------------------------------------
acceptance at token 0: 0.68
acceptance at token 1: 0.38
acceptance at token 2: 0.20
acceptance at token 3: 0.10
acceptance at token 4: 0.06

Signed-off-by: Rahul Tuli <rtuli@redhat.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

vllm/model_executor/models/llama_eagle.py

Signed-off-by: Rahul Tuli <rtuli@redhat.com>

jmkuebler · 2025-10-13T13:26:01Z

Thx @rahul-tuli , the fix LGTM and solves #26402

I wouldn't expect this to correctly handle cases where the eagle head is actually quantized. For example, I beleive the ignore list would need to be written taking into account that the layers are registered as further layers in the original model etc.. But that is beyond the pressing issue.

Thanks!

yewentao256

LGTM, thanks for the work!

shreyas269 · 2025-10-13T16:02:17Z

I tested these changes with quantized Eagle head. LGTM!

…o llama_eagle.py (vllm-project#26590) Signed-off-by: Rahul Tuli <rtuli@redhat.com> Signed-off-by: 1994 <1994@users.noreply.github.com>

…o llama_eagle.py (vllm-project#26590) Signed-off-by: Rahul Tuli <rtuli@redhat.com> Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>

…o llama_eagle.py (vllm-project#26590) Signed-off-by: Rahul Tuli <rtuli@redhat.com> Signed-off-by: bbartels <benjamin@bartels.dev>

…o llama_eagle.py (vllm-project#26590) Signed-off-by: Rahul Tuli <rtuli@redhat.com>

…o llama_eagle.py (vllm-project#26590) Signed-off-by: Rahul Tuli <rtuli@redhat.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

…o llama_eagle.py (vllm-project#26590) Signed-off-by: Rahul Tuli <rtuli@redhat.com>

mergify bot added llama Related to Llama models speculative-decoding labels Oct 10, 2025

Extend: fix from vllm-project#25883 to llama_eagle.py

0e55d08

Signed-off-by: Rahul Tuli <rtuli@redhat.com>

rahul-tuli force-pushed the fix-26042 branch from 1df610c to 0e55d08 Compare October 13, 2025 13:14

rahul-tuli marked this pull request as ready for review October 13, 2025 13:16

rahul-tuli changed the title ~~Extend: fix from #25883 to llama_eagle.py~~ [Bugfix][Speculative Decoding] Extend Eagle quantization config fix to llama_eagle.py Oct 13, 2025

rahul-tuli mentioned this pull request Oct 13, 2025

[EAGLE] [Quantization] turn quantization off for draft model initialization #26411

Closed

5 tasks

chatgpt-codex-connector bot reviewed Oct 13, 2025

View reviewed changes

vllm/model_executor/models/llama_eagle.py Outdated Show resolved Hide resolved

fix precommit

1ba253e

Signed-off-by: Rahul Tuli <rtuli@redhat.com>

DarkLight1337 approved these changes Oct 13, 2025

View reviewed changes

Merge branch 'main' into fix-26042

6d7a04d

markmc added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 13, 2025

robertgshaw2-redhat enabled auto-merge (squash) October 13, 2025 15:14

yewentao256 approved these changes Oct 13, 2025

View reviewed changes

robertgshaw2-redhat merged commit e3b90c1 into vllm-project:main Oct 13, 2025
55 checks passed

jmkuebler mentioned this pull request Oct 14, 2025

[Bug]: EAGLE incompatible w/ Compressed Tensors Quantized Target Model #26402

Closed

1 task

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

[Bugfix][Speculative Decoding] Extend Eagle quantization config fix t…

04cba4a

…o llama_eagle.py (vllm-project#26590) Signed-off-by: Rahul Tuli <rtuli@redhat.com>

shreyas269 mentioned this pull request Oct 23, 2025

[Eagle] [Quantization] Add complete quantization support to the draft model in Eagle #27434

Closed

5 tasks

alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025

[Bugfix][Speculative Decoding] Extend Eagle quantization config fix t…

9da4a94

…o llama_eagle.py (vllm-project#26590) Signed-off-by: Rahul Tuli <rtuli@redhat.com>

rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025

[Bugfix][Speculative Decoding] Extend Eagle quantization config fix t…

042908a

…o llama_eagle.py (vllm-project#26590) Signed-off-by: Rahul Tuli <rtuli@redhat.com>

shreyas269 mentioned this pull request Nov 11, 2025

[Quantization] [Eagle] Add complete quantization support to the draft model in Eagle #28435

Merged

5 tasks

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[Bugfix][Speculative Decoding] Extend Eagle quantization config fix t…

f40a10f

…o llama_eagle.py (vllm-project#26590) Signed-off-by: Rahul Tuli <rtuli@redhat.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix][Speculative Decoding] Extend Eagle quantization config fix to llama_eagle.py#26590

[Bugfix][Speculative Decoding] Extend Eagle quantization config fix to llama_eagle.py#26590
robertgshaw2-redhat merged 3 commits intovllm-project:mainfrom
neuralmagic:fix-26042

rahul-tuli commented Oct 10, 2025 •

edited by github-actions bot

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

jmkuebler commented Oct 13, 2025

Uh oh!

yewentao256 left a comment

Uh oh!

shreyas269 commented Oct 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Uh oh!

Conversation

rahul-tuli commented Oct 10, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Changes

llama_eagle.py

Impact

Notes

Verification

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

jmkuebler commented Oct 13, 2025

Uh oh!

yewentao256 left a comment

Choose a reason for hiding this comment

Uh oh!

shreyas269 commented Oct 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

rahul-tuli commented Oct 10, 2025 •

edited by github-actions bot

Loading