Include:
Methods | Quantize | PPL Eval | Task Eval | Save |
---|---|---|---|---|
OmniQuant | yes | yes | TODO | yes |
AffineQuant | yes | yes | TODO | yes |
LRQuant | yes | yes | TODO | yes |
RPTQ | TODO | TODO | TODO | yes |
Slim-Plus | yes | yes | TODO | yes |
I-LLM | yes | yes | TODO | yes |
DuQuant | yes | yes | TODO | yes |
conda create -n quant_omniquant python=3.10 -y
conda activate quant_omniquant
git clone https://github.com/SSshuishui/quant_omniquant_series.git
cd quant_omniquant_series
pip install --upgrade pip
pip install -e .
We also leverage the kernel from AutoGPTQ to achieve real quantization. So you should also install the bug-fixed AutoGPTQ as follows::
git clone https://github.com/ChenMnZ/AutoGPTQ-bugfix
pip install -v .
We provide full script to run OmniQuant in ./scripts/
. We use LLaMa-8B as an example here:
- Obtain the channel-wise scales and shifts required for initialization: you can generate channel-wise scales and shifts by yourself:
python generate_act_scale_shift.py --model /PATH/TO/LLaMA2/
- Weight-only quantization
# W3A16
python main.py \
--method omniquant \
--model /PATH/TO/LLaMA2/ \
--epochs 20 \
--eval_ppl --wbits 3 --abits 16 --lwc
# W3A16g128
python main.py \
--method omniquant \
--model /PATH/TO/LLaMA2/ \
--epochs 20 \
--eval_ppl --wbits 3 --abits 16 --group_size 128 --lwc
- weight-activation quantization
# W4A4
python main.py \
--method omniquant \
--model /PATH/TO/LLaMA2/ \
--epochs 20 \
--eval_ppl --wbits 4 --abits 4 --lwc --let \
--tasks piqa,arc_easy,arc_challenge,boolq,hellaswag,winogrande
- evaluation take LLaMa-7B with W3A16g128 quantization as an example:
python main.py \
--method omniquant \
--model /PATH/TO/LLaMA2/ \
--epochs 0 \
--eval_ppl --wbits 3 --abits 16 --group_size 128 --lwc \
--resume /PATH/TO/Pretrained/Parameters
- Weight-only quantization
# W3A16
python main.py \
--method affinequant \
--model /PATH/TO/LLaMA2 \
--epochs 20 \
--eval_ppl --wbits 3 --abits 16 --lwc --use_ln_matrix --sf 1e-2
# W3A16g128
python main.py \
--method affinequant \
--model /PATH/TO/LLaMA2/llama2-8b \
--epochs 20 \
--eval_ppl --wbits 3 --abits 16 --group_size 128 --lwc --use_ln_matrix --sf 1e-2
- weight-activation quantization
# W4A4
python main.py \
--method affinequant \
--model /PATH/TO/LLaMA2/llama2-8b \
--epochs 20 \
--eval_ppl --wbits 4 --abits 4 --lwc --let --aug_loss --use_matrix --sf 0.1 \
--tasks hendrycksTest,piqa,arc_easy,arc_challenge,boolq,hellaswag,winogrande
- evaluation take LLaMa-7B with W3A16g128 quantization as an example:
python main.py \
--method affinequant \
--model /PATH/TO/LLaMA2/ \
--epochs 0 --log_dir ./log/test \
--eval_ppl --wbits 3 --abits 16 --group_size 128 --lwc --let --use_ln_matrix --sf 1e-2 \
--resume /PATH/TO/Pretrained/Parameters
# W4A4 ppl
python main.py \
--method lrquant \
--model /PATH/TO/LLaMA2/ \
--epochs 20 \
--eval_ppl --wbits 4 --abits 4 --lwc --let \
# W4A4 zero-shot
python main.py \
--method lrquant \
--model /PATH/TO/LLaMA2/ \
--epochs 20 \
--wbits 4 --abits 4 --lwc --let \
--tasks piqa,arc_easy,arc_challenge,boolq,hellaswag,winogrande
# W4A4 tta
python main.py \
--method lrquant \
--model /PATH/TO/LLaMA2/ \
--epochs 20 \
--eval_ppl --wbits 4 --abits 4 --lwc --let --tta\
python main.py \
--method rptq \
--model /PATH/TO/LLaMA2/ \
--eval_ppl --wbits 4 --abits 4 \
--tasks lambada_openai,piqa,arc_easy,arc_challenge,openbookqa,boolq
Only quantize KV cache
python main.py \
--method rptq \
--model /PATH/TO/LLaMA2/ \
--log_dir ./log/llama2-8b-kv \
--wbits 4 --abits 4 --only_quant_kv \
--eval_ppl --tasks lambada_openai,piqa,arc_easy,arc_challenge,openbookqa,boolq
# W2A16G128
python main.py \
--method slim++ \
--model /PATH/TO/LLaMA2/ \
--eval_ppl \
--epochs 50 \
--wbits 2 --abits 16 --group_size 128 --lwc \
--tasks piqa,arc_easy,arc_challenge,boolq,hellaswag,winogrande
python main.py \
--method slim++ \
--model /PATH/TO/LLaMA2/ \
--eval_ppl \
--epochs 50 \
--wbits 2 --abits 16 --group_size 128 --lwc --aug_loss
# W3A16G128
python main.py \
--method slim++ \
--model /PATH/TO/LLaMA2/ \
--eval_ppl \
--epochs 50 \
--wbits 3 --abits 16 --group_size 128 --lwc \
--tasks piqa,arc_easy,arc_challenge,boolq,hellaswag,winogrande
python main.py \
--method slim++ \
--model /PATH/TO/LLaMA2/ \
--eval_ppl \
--epochs 50 \
--wbits 3 --abits 16 --group_size 128 --lwc --aug_loss
#FSBR (Fully-Smooth Block-Reconstruction)
python main.py \
--method illm \
--model /PATH/TO/LLaMA2/ \
--eval_ppl \
--epochs 20 \
--wbits 4 --abits 4 --lwc --let \
--fully_quant
python main.py \
--method illm \
--model /PATH/TO/LLaMA2/ \
--eval_ppl \
--epochs 0 \
--wbits 4 --abits 4 --lwc --let \
--fully_quant --illm
1. python get_rot.py
2. DuQuant
python main.py \
--method duquant \
--model /PATH/TO/LLaMA2/ \
--block_size 128 \
--max_rotation_step 256 \
--epochs 0 \
--wbits 4 \
--abits 4 \
--lwc \
--alpha 0.6 \
--smooth \
--lac 0.9 \
--swc 0.8 \
--eval_ppl \
--task arc_easy,arc_challenge,hellaswag,winogrande,boolq,piqa\
- DuQuant + lwc
python main.py \
--method duquant \
--model /PATH/TO/LLaMA2/ \
--block_size 128 \
--max_rotation_step 256 \
--epochs 20 \
--wbits 4 \
--abits 4 \
--lwc \
--alpha 0.5 \
--smooth \
--lac 0.9 \
--eval_ppl \
--task arc_easy,arc_challenge,hellaswag,winogrande,boolq,piqa
More detailed and optional arguments:
--model
: the local model path or huggingface format.--wbits
: weight quantization bits.--abits
: activation quantization bits.--group_size
: group size of weight quantization. If no set, use per-channel quantization for weight as default.--lwc
: activate the Learnable Weight Clipping (LWC).--let
: activate the Learnable Equivalent Transformation (LET).--lwc_lr
: learning rate of LWC parameters, 1e-2 as default.--let_lr
: learning rate of LET parameters, 5e-3 as default.--epochs
: training epochs. You can set it as 0 to evaluate pre-trained OmniQuant checkpoints.--nsamples
: number of calibration samples, 128 as default.--eval_ppl
: evaluating the perplexity of quantized models.--tasks
: evaluating zero-shot tasks.--resume
: loading pre-trained OmniQuant parameters.--multigpu
: to inference larger network on multiple GPUs--real_quant
: real quantization, which can see memory reduce. Note that due to the limitations of AutoGPTQ kernels, the real quantization of weight-only quantization can only lead memory reduction, but with slower inference speed.--save_dir
: saving the quantization model for further exploration.
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
OmniQuant:Omnidirectionally Calibrated Quantization for Large Language Models
AffineQuant:Affine Transformation Quantization for Large Language Models
LRQuant:Learnable and Robust Post-Training Quantization for Large Language Models
RPTQ: Reorder-Based Post-Training Quantization for Large Language Models