[Eagle] [Quantization] Add complete quantization support to the draft model in Eagle#27434
Conversation
There was a problem hiding this comment.
Code Review
This pull request adds comprehensive quantization support for Eagle and Eagle3 draft models, which is a valuable feature. The implementation is solid, including the use of ReplicatedLinear for quantizable layers and the refactoring of get_draft_quant_config to reduce code duplication. My review identifies one area for improvement: there's a new block of duplicated code for handling KV cache scales in the load_weights methods of both llama_eagle.py and llama_eagle3.py. Extracting this into a shared utility would enhance the long-term maintainability of the code.
| # Handle kv cache quantization scales | ||
| if self.quant_config is not None and ( | ||
| scale_name := self.quant_config.get_cache_scale(name) | ||
| ): | ||
| # Loading kv cache quantization scales | ||
| param = params_dict[scale_name] | ||
| weight_loader = getattr(param, "weight_loader", default_weight_loader) | ||
| loaded_weight = ( | ||
| loaded_weight if loaded_weight.dim() == 0 else loaded_weight[0] | ||
| ) | ||
| weight_loader(param, loaded_weight) | ||
| loaded_params.add(scale_name) | ||
| continue | ||
| # Remapping the name FP8 kv-scale | ||
| if "scale" in name: | ||
| name = maybe_remap_kv_scale_name(name, params_dict) | ||
| if name is None: | ||
| continue |
There was a problem hiding this comment.
This block of code for handling KV cache quantization scales and remapping FP8 scale names is duplicated in vllm/model_executor/models/llama_eagle3.py. To improve maintainability and avoid potential bugs from inconsistent updates, this logic should be extracted into a shared utility function, perhaps in vllm.model_executor.model_loader.weight_utils. This would follow the same good practice you've already applied by refactoring get_draft_quant_config.
| # Handle kv cache quantization scales | ||
| if self.quant_config is not None and ( | ||
| scale_name := self.quant_config.get_cache_scale(name) | ||
| ): | ||
| # Loading kv cache quantization scales | ||
| param = params_dict[scale_name] | ||
| weight_loader = getattr(param, "weight_loader", default_weight_loader) | ||
| loaded_weight = ( | ||
| loaded_weight if loaded_weight.dim() == 0 else loaded_weight[0] | ||
| ) | ||
| weight_loader(param, loaded_weight) | ||
| loaded_params.add(scale_name) | ||
| continue | ||
| # Remapping the name FP8 kv-scale | ||
| if "scale" in name: | ||
| name = maybe_remap_kv_scale_name(name, params_dict) | ||
| if name is None: | ||
| continue |
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
dsikka
left a comment
There was a problem hiding this comment.
Is this only targeting kv cache quantization or are other quant schemes meant to be supported as well?
@dsikka, Along with kv cache quantization, I'd say it is also targeting FC layer quantization in llama based Eagle drafters. Currently, if we pass quant configs to the drafter, only the Decoder layer is quantized. |
rahul-tuli
left a comment
There was a problem hiding this comment.
Thanks for getting to this! The PR looks good to me, could you fix pre-commit, sign-off commits and add smoke tests for quantized eagle/eagle3 heads?
Signed-off-by: Shreyas Kulkarni <shreyas.gp269@gmail.com>
Signed-off-by: Shreyas Kulkarni <shreyas.gp269@gmail.com>
|
Closing this in favor of it's duplicate #28435 which has a DCO sign-off and smoke tests. |
Purpose
This PR adds comprehensive quantization support for Eagle and Eagle3 draft models in speculative decoding, including full KV cache quantization support. Previously, Eagle draft models could not use quantized weights in fully connected layers, or quantized KV caches.
Recently, #26590 was merged to properly obtain the draft model's quantization config but it doesn't address the case where the entire draft model is quantized and we want to read input and weight scales of
fclayer along with KV cache quantization scales.This PR addresses the following:
get_draft_quant_configinutilsto avoid duplication of code inllama_eagle.pyandllama_eagle3.py.ReplicatedLinearclass to makefclayer in drafters quantizable and handle input and weight quantization scales (in llama with eagle/eagle3).Test Plan
Tested with a base llama3 instruct model with a quantized Eagle draft model (one decoder layer + one FC layer) with static fp8 quantization. The quantization of the base/verifier and Eagle draft model was performed using ModelOpt.
The non-quantized models work exactly the same as before (no changes to behavior).
Test Result
Before:
KeyError: 'fc.input_scale'After:
Acceptance rate of drafter:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.