[Quantization] [Eagle] Add complete quantization support to the draft model in Eagle#28435
Conversation
… model in Eagle Signed-off-by: Shreyas Kulkarni <shreyas.gp269@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request introduces comprehensive quantization support for Eagle and Eagle3 draft models, which is a significant enhancement. The changes are well-structured, including the refactoring of get_draft_quant_config into a shared utility to reduce code duplication. The replacement of torch.nn.Linear with ReplicatedLinear is correctly implemented to enable quantization for the fully connected layers. The addition of a dedicated test file with smoke tests for the new quantization features is also a great contribution that improves the robustness of the codebase.
I have one suggestion regarding code duplication in the load_weights methods of llama_eagle.py and llama_eagle3.py. Extracting the logic for handling KV cache scales into a shared function would further improve maintainability.
Overall, this is a well-executed PR that successfully adds an important feature.
| # Handle kv cache quantization scales | ||
| if self.quant_config is not None and ( | ||
| scale_name := self.quant_config.get_cache_scale(name) | ||
| ): | ||
| # Loading kv cache quantization scales | ||
| param = params_dict[scale_name] | ||
| weight_loader = getattr(param, "weight_loader", default_weight_loader) | ||
| loaded_weight = ( | ||
| loaded_weight if loaded_weight.dim() == 0 else loaded_weight[0] | ||
| ) | ||
| weight_loader(param, loaded_weight) | ||
| loaded_params.add(scale_name) | ||
| continue | ||
| # Remapping the name FP8 kv-scale | ||
| if "scale" in name: | ||
| name = maybe_remap_kv_scale_name(name, params_dict) | ||
| if name is None: | ||
| continue |
There was a problem hiding this comment.
This block of code for handling KV cache quantization scales and remapping FP8 kv-scale names is duplicated in vllm/model_executor/models/llama_eagle3.py (lines 220-237). To improve maintainability and avoid potential inconsistencies in the future, consider refactoring this logic into a shared utility function in vllm/model_executor/models/utils.py. You've already done this for get_draft_quant_config, and a similar approach would be beneficial here.
|
CC @mgoin @Isotr0py @yewentao256 for quantization and @luccafong @benchislett for eagle |
Isotr0py
left a comment
There was a problem hiding this comment.
The quantization part looks reasonable to me. See if others are fine with quantzaion for eagle. (I don't have much knowledge about spec decode)
There was a problem hiding this comment.
Seems that we should put this test under tests/models/quantization or tests/quantization?
There was a problem hiding this comment.
@Isotr0py, Thanks for the review! I placed the test in tests/model_executor because this is testing the model executor's quantization integration for the Eagle model architecture, rather than end-to-end inference with quantized models.
That said, I'm happy to move it if you think tests/models/quantization or tests/quantization would be better for discoverability.
|
Hey folks, just a quick follow-up on this PR. The reviewers were tagged previously (tagging them again below), and I’m happy to incorporate any changes needed. Whenever someone has time to take a look, I’d appreciate it! :) CC @mgoin @Isotr0py @yewentao256 @luccafong @benchislett @rahul-tuli |
mgoin
left a comment
There was a problem hiding this comment.
Looks good to me, although I think the test can be improved to use a real model
There was a problem hiding this comment.
I don't love all this Mocking and working with low-level classes and configs. I'd like to rather replace this with a single model e2e smoke test so we see that the config parsing, weight loading, and such all work together with a known model
There was a problem hiding this comment.
I agree that using a single real model would be cleaner. The only reason I didn’t do that here is because our internal model isn’t publicly shareable, so I couldn’t include it in the test. Although we could get a public drafter model released but could take time to go through approval since I'm contributing via capital one).
Given that constraint, this felt like the most straightforward solution. Happy to update this if you have a suggestion for a better approach.
Signed-off-by: Shreyas Kulkarni <shreyas.gp269@gmail.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Shreyas Kulkarni <shreyas.gp269@gmail.com>
|
The PR looks good considering the constraints for using a real model, a follow up PR with a real model would be nice. Could you also fix the pre-commit checks? |
|
@shreyas269 please fix the pre-commit as I cannot do it for you due to branch permissions, thank you! |
Signed-off-by: Shreyas Kulkarni <shreyas.gp269@gmail.com>
|
@mgoin, @rahul-tuli, fixed the pre-commit. |
… model in Eagle (vllm-project#28435) Signed-off-by: Shreyas Kulkarni <shreyas.gp269@gmail.com>
… model in Eagle (vllm-project#28435) Signed-off-by: Shreyas Kulkarni <shreyas.gp269@gmail.com>
Purpose
This PR adds comprehensive quantization support for Eagle and Eagle3 draft models in speculative decoding, including full KV cache quantization support. Previously, Eagle draft models could not use quantized weights in fully connected layers, or quantized KV caches.
Recently, #26590 was merged to properly obtain the draft model's quantization config but it doesn't address the case where the entire draft model is quantized and we want to read input and weight scales of
fclayer along with KV cache quantization scales.This PR addresses the following:
get_draft_quant_configinutilsto avoid duplication of code inllama_eagle.pyandllama_eagle3.py.ReplicatedLinearclass to makefclayer in drafters quantizable and handle input and weight quantization scales (in llama with eagle/eagle3).This PR is a duplicate to #27434 with an additional smoke test.
Test Plan
Tested with a base llama3 instruct model with a quantized Eagle draft model (one decoder layer + one FC layer) with static fp8 quantization. The quantization of the base/verifier and Eagle draft model was performed using ModelOpt.
The non-quantized models work exactly the same as before (no changes to behavior).
test_eagle_quantization.pywith unit tests for draft model quantizationget_draft_quant_config()with and without draft modelsTest Result
Before:
KeyError: 'fc.input_scale'After:
Acceptance rate of drafter:
The
test_eagle_quantizationand all pre-commit hooks pass successfully.Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.