Skip to content

[EAGLE] [Quantization] turn quantization off for draft model initialization#26411

Closed
jmkuebler wants to merge 2 commits intovllm-project:mainfrom
jmkuebler:fix_eagle_with_quatn
Closed

[EAGLE] [Quantization] turn quantization off for draft model initialization#26411
jmkuebler wants to merge 2 commits intovllm-project:mainfrom
jmkuebler:fix_eagle_with_quatn

Conversation

@jmkuebler
Copy link
Copy Markdown
Contributor

@jmkuebler jmkuebler commented Oct 8, 2025

Purpose

Fixes #26402

As an alternative, I have also considered creating a deepcopy of the VllmConfig to the drafter, but then this results in errors, as the drafter makes some modifications that would not be flushed back in this case.

Test Plan

I ran the same commands as detailed in #26402 and verified that now the drafting works correctly also if the target model is quantized.

Test Result

Fixes the issue.

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Jonas Kuebler <kuebj@amazon.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

I've reviewed the changes and found a critical issue that could lead to a runtime error. Please see my comment below for details and a suggested fix.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Signed-off-by: Jonas Kuebler <kuebj@amazon.com>
@jmkuebler jmkuebler changed the title [EAGLE] [Quantization} turn quantization off for draft model initialization [EAGLE] [Quantization] turn quantization off for draft model initialization Oct 8, 2025
@benchislett
Copy link
Copy Markdown
Collaborator

Will this break support for draft models which are quantized? I don't know of any that are used in practice but I can't be sure that a quantized eagle head isn't being used

@jmkuebler
Copy link
Copy Markdown
Contributor Author

@benchislett

Will this break support for draft models which are quantized? I don't know of any that are used in practice but I can't be sure that a quantized eagle head isn't being used

Pretty sure the answer is no.

Given the way it currently is, the draft model is initialized with the target model's quantization config. So any quantization config in the draft model checkpoint would not take any effect as of now.

@jmkuebler
Copy link
Copy Markdown
Contributor Author

@benchislett notice that #25883 introduced a quant_config overwrite specifically in case of EAGLE3. But that approach still works with my changes.

@robertgshaw2-redhat
Copy link
Copy Markdown
Collaborator

@rahul-tuli can you take a look at this?

Copy link
Copy Markdown
Member

@tlrmchlsmth tlrmchlsmth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO there should be one consistent way to handle this issue -- either this PR should follow #25883 or we should use this approach or land this as-is and undo #25883

@rahul-tuli
Copy link
Copy Markdown
Contributor

IMO there should be one consistent way to handle this issue -- either this PR should follow #25883 or we should use this approach or land this as-is and undo #25883

I agree

@jmkuebler
Copy link
Copy Markdown
Contributor Author

jmkuebler commented Oct 8, 2025

I think the approach I take should propagate to all EAGLE variants (and we could thus undo the previous fix). For the other approach we have to fix it in each. Probably would also require a fix for Llama-4 Eagle if we follow the per-eagle patching

@rahul-tuli
Copy link
Copy Markdown
Contributor

I think the approach I take should propagate to all EAGLE variants (and we could thus undo the previous fix). For the other approach we have to fix it in each. Probably would also require a fix for Llama-4 Eagle if we follow the per-eagle patching

You are correct, in the sense that the previous fix would have to be applied to each *eagle.py file; the only issue I have with this PR right now is the same as @benchislett that it ignores the drafters quant config, I'm trying to think of a way such that we don't have to do that. (I get that right now most eagle3 heads are not quantized, if we can't come up with a reasonable approach then we'll go this route)

@jmkuebler
Copy link
Copy Markdown
Contributor Author

@rahul-tuli but I don't think the drafters quant config is ever used to overwrite the target models quant_config -- because that would then actually. change the targets models config as well. So my change will never overwrite a draft models quant config.

We propagate the draft model config separately to the model, any application of the draft models quant config, would presumably happen within the specific drafter model class. And then my change is not a problem either.

If you prefer, I can also patch llama_eagle.py similar to you did. I don't have a strong opinion here.

@jmkuebler
Copy link
Copy Markdown
Contributor Author

to be a bit more crisp:
self.vllm_config is the config if the target model -- and not a copy of it.
So if we store the draft models quant config in there, we are anyways doing sth very wrong.

@jmkuebler
Copy link
Copy Markdown
Contributor Author

@rahul-tuli wdyt how to proceed?

@rahul-tuli
Copy link
Copy Markdown
Contributor

I looked for a better way to do this but so far it seems like approach from #25883 seems like the most reasonable, we will have some duplicate code though which is not that great, will put up a PR for that

@shreyas269
Copy link
Copy Markdown
Contributor

@rahul-tuli but I don't think the drafters quant config is ever used to overwrite the target models quant_config -- because that would then actually. change the targets models config as well. So my change will never overwrite a draft models quant config.

I don't think this is correct. We can have a separate drafters config with @rahul-tuli 's approach from #25883 where we override (not overwrite) the quant config. To avoid the duplication, the better way is to define a get_eagle_draft_quant_config in utils and then override the get_quant_config class in the LlamaDecoderLayer.

Also, we should have the flexibility to have a different quant config for drafter and verifier. There are use cases where we need it.

Also, the current implementation in llama_eagle.py (even with quant_configs) is missing some key functionalities related to quantization. I have a working fix for this. Just waiting on passing git-hooks and approval from my company's open source office.

@rahul-tuli
Copy link
Copy Markdown
Contributor

@rahul-tuli but I don't think the drafters quant config is ever used to overwrite the target models quant_config -- because that would then actually. change the targets models config as well. So my change will never overwrite a draft models quant config.

I don't think this is correct. We can have a separate drafters config with @rahul-tuli 's approach from #25883 where we override (not overwrite) the quant config. To avoid the duplication, the better way is to define a get_eagle_draft_quant_config in utils and then override the get_quant_config class in the LlamaDecoderLayer.

Also, we should have the flexibility to have a different quant config for drafter and verifier. There are use cases where we need it.

Also, the current implementation in llama_eagle.py (even with quant_configs) is missing some key functionalities related to quantization. I have a working fix for this. Just waiting on passing git-hooks and approval from my company's open source office.

I made a fix PR here: #26590 Could you review, happy to make any changes

@jmkuebler
Copy link
Copy Markdown
Contributor Author

jmkuebler commented Oct 13, 2025

Closing this PR in favor of #26590
thx @rahul-tuli

@jmkuebler jmkuebler closed this Oct 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: EAGLE incompatible w/ Compressed Tensors Quantized Target Model

6 participants