[Misc] feat: add eetq quantization #3614

dtlzhuangz · 2024-03-25T13:10:50Z

EETQ is int8 per-channel weight only quantization owned by Netease Fuxi AI Lab. The high performance gemm kernels are derived from FasterTransformer and TensorRT-LLM. We fit it into vllm and realize Llama and Baichuan model including their tensor parallel. We will implement the other models in the future.

The speed is over gptq-int8 when input length is small and below is our testing result when input-len = 512 and output-len = 64 on 4090.
It does not need any calibration.
The accuracy degradation is negligible.
You can quantize your model easily via

from eetq import AutoEETQForCausalLM
from transformers import AutoTokenizer
model_name = "/path/to/your/model"
quant_path = "/path/to/quantized/model"
tp = 1
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoEETQForCausalLM.from_pretrained(model_name)
model.quantize(quant_path, tp)
tokenizer.save_pretrained(quant_path)

We find it is hard to implement an online quantization since the model weights are saved in the parallellinear classes. (I have tried to modify the code, but I am not able to do it in an elegent way. It would be great if you can help us solve the problem.)
Since the compiling time of cutlass kernel is long(about 15 minutes). We recommend install eetq via the source https://github.com/NetEase-FuXi/EETQ. (If eetq is not installed, it will not affect the usage of other function).
8bit quantization #3261
int8 supported? #2455

twaka · 2024-04-17T06:13:27Z

vllm/model_executor/layers/quantization/eetq.py

+            output = w8_a16_gemm(x, qweight, weight_scales)
+        else:
+            raise ImportError("You have not installed EETQ. Please refer to https://github.com/NetEase-FuXi/EETQ/tree/main")
+


It may need to add bias if model has bias.

twaka · 2024-04-17T06:23:09Z

Thank you for bringing this up! I think this is great addition to quantization methods.
Personally tested with Qwen and result looked promising when fixing the bias part.

dtlzhuangz added 4 commits March 25, 2024 12:50

[feat] add eetq quantization support

d75a7e7

update readme

c8c53d0

Merge branch 'main' into main

1c9839f

Merge branch 'vllm-project:main' into main

154c73e

WoosukKwon added the quantization label Apr 4, 2024

twaka reviewed Apr 17, 2024

View reviewed changes

bugfix: bias is not None

06b2ea0

dtlzhuangz force-pushed the main branch from e11b0ae to 06b2ea0 Compare April 17, 2024 13:43

dtlzhuangz closed this Jun 17, 2024

hibukipanim mentioned this pull request Jul 3, 2024

[Roadmap] vLLM Roadmap Q3 2024 #5805

Open

46 tasks

dtlzhuangz mentioned this pull request Jul 15, 2024

add Qwen2 NetEase-FuXi/EETQ#24

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Misc] feat: add eetq quantization #3614

[Misc] feat: add eetq quantization #3614

dtlzhuangz commented Mar 25, 2024 •

edited

Loading

twaka Apr 17, 2024

twaka commented Apr 17, 2024

[Misc] feat: add eetq quantization #3614

[Misc] feat: add eetq quantization #3614

Conversation

dtlzhuangz commented Mar 25, 2024 • edited Loading

twaka Apr 17, 2024

Choose a reason for hiding this comment

twaka commented Apr 17, 2024

dtlzhuangz commented Mar 25, 2024 •

edited

Loading