[Bugfix] Fix quantization skip modules logic#13562
[Bugfix] Fix quantization skip modules logic#13562jeejeelee wants to merge 12 commits intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
| # BitsAndBytes | ||
| if (isinstance(quant_config, BitsAndBytesConfig) | ||
| and quant_config.llm_int8_skip_modules): | ||
| quant_config.llm_int8_skip_modules = [ | ||
| hf_to_vllm_mapper._map_name(module) | ||
| for module in quant_config.llm_int8_skip_modules | ||
| ] | ||
| # AWQ | ||
| elif (isinstance(quant_config, AWQConfig) | ||
| and quant_config.modules_to_not_convert): | ||
| quant_config.modules_to_not_convert = [ | ||
| hf_to_vllm_mapper._map_name(module) | ||
| for module in quant_config.modules_to_not_convert | ||
| ] | ||
| # TODO: Supports more quantization types. |
There was a problem hiding this comment.
Maybe we should introduce a common ignored_modules or ignored_prefixes to QuantizationConfig like packed_modules_mapping https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/base_config.py#L60-L66
Then each quant config can convert their specific llm_int8_skip_modules, modules_to_not_convert, etc in a canonical format in ignored_modules. This will also allow us to generalize the is_layer_skipped function
There was a problem hiding this comment.
I'd support an implementation like this as well. This current implementation could fail to properly map module names in nested models.
modules_to_not_convert = ["SubModel.A"]
SubModel.hf_to_vllm_mapper = Mapper(orig_to_new_prefix={"A": "B"})
Note that "SubModel.A" will not match because "SubModel.A" does not start with "A"
This is a fairly minor issue, but something to keep in mind.
Another implementation could look like this:
- Add a mutable
ignored_modulesattribute to QuantizationConfig - At construction-time, using the method-specific constructor to populate the
ignored_modulesattribute from disk - At initialize-time, within
SupportsQuant, use the given model prefix and mapper to update theignored_moduleslist with the proper model-specific mapping
a.ignored_modules = [prefix + hf_to_vllm_mapper[module - prefix] for module in ignored_modules]
This has the advantage of further standardizing around the QuantizationConfig base, as well as supporting mapping with nested models
There was a problem hiding this comment.
@jeejeelee Here's a WIP of what that might look like: #14635
| def _configure_packed_modules_mapping(): | ||
| """ | ||
| Pass packed_modules_mapping by reference to quant_config so that | ||
| quant_config can properly match fused modules | ||
|
|
||
| Note that model attributes are passed by reference to quant_config, | ||
| enabling them to be updated by model_class.__new__ (ex. chatglm, qwen) | ||
| """ | ||
| packed_mapping = getattr(model_class, "packed_modules_mapping", None) | ||
| if packed_mapping is not None: | ||
| # pass packed_modules_mapping by reference to quant_config | ||
| quant_config.packed_modules_mapping = packed_mapping | ||
| else: | ||
| logger.warning( | ||
| "The model class %s has not defined `packed_modules_mapping`, " | ||
| "this may lead to incorrect mapping of quantized or ignored " | ||
| "modules", model_class.__name__) |
There was a problem hiding this comment.
Why is this needed after we added SupportsQuant (#13104), I thought getting the packed_modules_mapping from the model to the quant config was the main purpose of that. cc @kylesayrs
There was a problem hiding this comment.
The _configure_packed_modules_mapping function needs to remain in place until SupportsQuant has been added to all applicable models
|
Close due to #14635 |
Motivation
Some models, such as QWEN25-VL, have modified their layer hierarchy compared to their original
transformersimplementation. This change causes quantization's skip modules to become ineffective, leading to incorrect initialization of linear methods.Reproduce code
TODO
Investigate other quantization method (e.g. AWQ)
Optimize the implementation logic