[EAGLE] [Quantization] turn quantization off for draft model initialization by jmkuebler · Pull Request #26411 · vllm-project/vllm

jmkuebler · 2025-10-08T10:52:42Z

Purpose

As an alternative, I have also considered creating a deepcopy of the VllmConfig to the drafter, but then this results in errors, as the drafter makes some modifications that would not be flushed back in this case.

Test Plan

I ran the same commands as detailed in #26402 and verified that now the drafting works correctly also if the target model is quantized.

Test Result

Fixes the issue.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Jonas Kuebler <kuebj@amazon.com>

gemini-code-assist

Code Review

I've reviewed the changes and found a critical issue that could lead to a runtime error. Please see my comment below for details and a suggested fix.

vllm/v1/spec_decode/eagle.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

vllm/v1/spec_decode/eagle.py

Signed-off-by: Jonas Kuebler <kuebj@amazon.com>

benchislett · 2025-10-08T12:41:57Z

Will this break support for draft models which are quantized? I don't know of any that are used in practice but I can't be sure that a quantized eagle head isn't being used

jmkuebler · 2025-10-08T12:45:24Z

@benchislett

Will this break support for draft models which are quantized? I don't know of any that are used in practice but I can't be sure that a quantized eagle head isn't being used

Pretty sure the answer is no.

Given the way it currently is, the draft model is initialized with the target model's quantization config. So any quantization config in the draft model checkpoint would not take any effect as of now.

jmkuebler · 2025-10-08T13:42:30Z

@benchislett notice that #25883 introduced a quant_config overwrite specifically in case of EAGLE3. But that approach still works with my changes.

robertgshaw2-redhat · 2025-10-08T14:28:02Z

@rahul-tuli can you take a look at this?

tlrmchlsmth

IMO there should be one consistent way to handle this issue -- either this PR should follow #25883 or we should use this approach or land this as-is and undo #25883

rahul-tuli · 2025-10-08T14:30:29Z

IMO there should be one consistent way to handle this issue -- either this PR should follow #25883 or we should use this approach or land this as-is and undo #25883

I agree

jmkuebler · 2025-10-08T14:32:50Z

I think the approach I take should propagate to all EAGLE variants (and we could thus undo the previous fix). For the other approach we have to fix it in each. Probably would also require a fix for Llama-4 Eagle if we follow the per-eagle patching

rahul-tuli · 2025-10-08T15:08:42Z

I think the approach I take should propagate to all EAGLE variants (and we could thus undo the previous fix). For the other approach we have to fix it in each. Probably would also require a fix for Llama-4 Eagle if we follow the per-eagle patching

You are correct, in the sense that the previous fix would have to be applied to each *eagle.py file; the only issue I have with this PR right now is the same as @benchislett that it ignores the drafters quant config, I'm trying to think of a way such that we don't have to do that. (I get that right now most eagle3 heads are not quantized, if we can't come up with a reasonable approach then we'll go this route)

jmkuebler · 2025-10-08T15:15:35Z

@rahul-tuli but I don't think the drafters quant config is ever used to overwrite the target models quant_config -- because that would then actually. change the targets models config as well. So my change will never overwrite a draft models quant config.

We propagate the draft model config separately to the model, any application of the draft models quant config, would presumably happen within the specific drafter model class. And then my change is not a problem either.

If you prefer, I can also patch llama_eagle.py similar to you did. I don't have a strong opinion here.

jmkuebler · 2025-10-08T15:21:16Z

to be a bit more crisp:
self.vllm_config is the config if the target model -- and not a copy of it.
So if we store the draft models quant config in there, we are anyways doing sth very wrong.

jmkuebler · 2025-10-09T11:12:32Z

@rahul-tuli wdyt how to proceed?

rahul-tuli · 2025-10-10T13:54:44Z

I looked for a better way to do this but so far it seems like approach from #25883 seems like the most reasonable, we will have some duplicate code though which is not that great, will put up a PR for that

shreyas269 · 2025-10-10T21:20:34Z

@rahul-tuli but I don't think the drafters quant config is ever used to overwrite the target models quant_config -- because that would then actually. change the targets models config as well. So my change will never overwrite a draft models quant config.

I don't think this is correct. We can have a separate drafters config with @rahul-tuli 's approach from #25883 where we override (not overwrite) the quant config. To avoid the duplication, the better way is to define a get_eagle_draft_quant_config in utils and then override the get_quant_config class in the LlamaDecoderLayer.

Also, we should have the flexibility to have a different quant config for drafter and verifier. There are use cases where we need it.

Also, the current implementation in llama_eagle.py (even with quant_configs) is missing some key functionalities related to quantization. I have a working fix for this. Just waiting on passing git-hooks and approval from my company's open source office.

rahul-tuli · 2025-10-13T13:18:32Z

@rahul-tuli but I don't think the drafters quant config is ever used to overwrite the target models quant_config -- because that would then actually. change the targets models config as well. So my change will never overwrite a draft models quant config.

I don't think this is correct. We can have a separate drafters config with @rahul-tuli 's approach from #25883 where we override (not overwrite) the quant config. To avoid the duplication, the better way is to define a get_eagle_draft_quant_config in utils and then override the get_quant_config class in the LlamaDecoderLayer.

Also, we should have the flexibility to have a different quant config for drafter and verifier. There are use cases where we need it.

Also, the current implementation in llama_eagle.py (even with quant_configs) is missing some key functionalities related to quantization. I have a working fix for this. Just waiting on passing git-hooks and approval from my company's open source office.

I made a fix PR here: #26590 Could you review, happy to make any changes

jmkuebler · 2025-10-13T16:23:11Z

Closing this PR in favor of #26590
thx @rahul-tuli

turn quantinzation off for draft model initialization

1a4810d

Signed-off-by: Jonas Kuebler <kuebj@amazon.com>

jmkuebler requested review from benchislett and luccafong as code owners October 8, 2025 10:52

mergify bot added speculative-decoding v1 labels Oct 8, 2025

gemini-code-assist bot reviewed Oct 8, 2025

View reviewed changes

vllm/v1/spec_decode/eagle.py Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Oct 8, 2025

View reviewed changes

vllm/v1/spec_decode/eagle.py Show resolved Hide resolved

fix for the non-quantized case

b2ef3a6

Signed-off-by: Jonas Kuebler <kuebj@amazon.com>

jmkuebler changed the title ~~[EAGLE] [Quantization} turn quantization off for draft model initialization~~ [EAGLE] [Quantization] turn quantization off for draft model initialization Oct 8, 2025

tlrmchlsmth reviewed Oct 8, 2025

View reviewed changes

jmkuebler closed this Oct 13, 2025

Uh oh!

Conversation

jmkuebler commented Oct 8, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Fixes the issue.

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

benchislett commented Oct 8, 2025

Uh oh!

jmkuebler commented Oct 8, 2025

Uh oh!

jmkuebler commented Oct 8, 2025

Uh oh!

robertgshaw2-redhat commented Oct 8, 2025

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

rahul-tuli commented Oct 8, 2025

Uh oh!

jmkuebler commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rahul-tuli commented Oct 8, 2025

Uh oh!

jmkuebler commented Oct 8, 2025

Uh oh!

jmkuebler commented Oct 8, 2025

Uh oh!

jmkuebler commented Oct 9, 2025

Uh oh!

rahul-tuli commented Oct 10, 2025

Uh oh!

shreyas269 commented Oct 10, 2025

Uh oh!

rahul-tuli commented Oct 13, 2025

Uh oh!

jmkuebler commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jmkuebler commented Oct 8, 2025 •

edited by github-actions bot

Loading

jmkuebler commented Oct 8, 2025 •

edited

Loading

jmkuebler commented Oct 13, 2025 •

edited

Loading