Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Misc] feat: add eetq quantization #3614

Closed
wants to merge 5 commits into from

Conversation

dtlzhuangz
Copy link

@dtlzhuangz dtlzhuangz commented Mar 25, 2024

EETQ is int8 per-channel weight only quantization owned by Netease Fuxi AI Lab. The high performance gemm kernels are derived from FasterTransformer and TensorRT-LLM. We fit it into vllm and realize Llama and Baichuan model including their tensor parallel. We will implement the other models in the future.

  • The speed is over gptq-int8 when input length is small and below is our testing result when input-len = 512 and output-len = 64 on 4090.
  • It does not need any calibration.
  • The accuracy degradation is negligible.
  • You can quantize your model easily via

from eetq import AutoEETQForCausalLM
from transformers import AutoTokenizer
model_name = "/path/to/your/model"
quant_path = "/path/to/quantized/model"
tp = 1
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoEETQForCausalLM.from_pretrained(model_name)
model.quantize(quant_path, tp)
tokenizer.save_pretrained(quant_path)

  • We find it is hard to implement an online quantization since the model weights are saved in the parallellinear classes. (I have tried to modify the code, but I am not able to do it in an elegent way. It would be great if you can help us solve the problem.)
  • Since the compiling time of cutlass kernel is long(about 15 minutes). We recommend install eetq via the source https://github.com/NetEase-FuXi/EETQ. (If eetq is not installed, it will not affect the usage of other function).
  • 8bit quantization #3261
  • int8 supported? #2455

image

output = w8_a16_gemm(x, qweight, weight_scales)
else:
raise ImportError("You have not installed EETQ. Please refer to https://github.com/NetEase-FuXi/EETQ/tree/main")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may need to add bias if model has bias.

@twaka
Copy link
Contributor

twaka commented Apr 17, 2024

Thank you for bringing this up! I think this is great addition to quantization methods.
Personally tested with Qwen and result looked promising when fixing the bias part.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants