Add FP8 quantization ignored_layers support in llama#6592
Add FP8 quantization ignored_layers support in llama#6592cli99 wants to merge 2 commits intovllm-project:mainfrom
ignored_layers support in llama#6592Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge). To run full CI, you can do one of these:
🚀 |
|
Thanks for this, but we should not make this type of logic in Instead, let's have This will avoid any changes to I can whip this up quickly if you want |
|
@robertgshaw2-neuralmagic, if you can add the change to Fp8Config.get_quant_method to take layer_name, that would be great. Thanks. |
|
Suggested implementation here #6657 |
Summary:
https://github.com/neuralmagic/AutoFP8 supports "ignored_layers" in quantization and the saved out "quantization_config" has the information. E.g.,
does not quantize the self attention module in the first layer.
However vLLM currently does not respect the "ignored_layers" field and applies uniform quantization to all the modules in all layers.
#6515 added non-uniform support for quantization through
compressed-tensors. This PR adds the support for "ignored_layers" in Llama models and leverage theprefixparams added in #6515.