Interger-only inference

Include:

Methods	Quantize	PPL Eval	Task Eval	Save
OmniQuant	yes	yes	TODO	yes
AffineQuant	yes	yes	TODO	yes
LRQuant	yes	yes	TODO	yes
RPTQ	TODO	TODO	TODO	yes
Slim-Plus	yes	yes	TODO	yes
I-LLM	yes	yes	TODO	yes
DuQuant	yes	yes	TODO	yes

Install

conda create -n quant_omniquant python=3.10 -y
conda activate quant_omniquant
git clone https://github.com/SSshuishui/quant_omniquant_series.git
cd quant_omniquant_series
pip install --upgrade pip 
pip install -e .

We also leverage the kernel from AutoGPTQ to achieve real quantization. So you should also install the bug-fixed AutoGPTQ as follows::

git clone https://github.com/ChenMnZ/AutoGPTQ-bugfix
pip install -v .

Usage

We provide full script to run OmniQuant in ./scripts/. We use LLaMa-8B as an example here:

Obtain the channel-wise scales and shifts required for initialization: you can generate channel-wise scales and shifts by yourself:

python generate_act_scale_shift.py --model /PATH/TO/LLaMA2/

For OmniQuant

Weight-only quantization

# W3A16
python main.py \
--method omniquant \
--model /PATH/TO/LLaMA2/  \
--epochs 20 \
--eval_ppl --wbits 3 --abits 16 --lwc

# W3A16g128
python main.py \
--method omniquant \
--model /PATH/TO/LLaMA2/  \
--epochs 20 \
--eval_ppl --wbits 3 --abits 16 --group_size 128 --lwc

weight-activation quantization

# W4A4
python main.py \
--method omniquant \
--model /PATH/TO/LLaMA2/  \
--epochs 20 \
--eval_ppl --wbits 4 --abits 4 --lwc --let \
--tasks piqa,arc_easy,arc_challenge,boolq,hellaswag,winogrande

evaluation take LLaMa-7B with W3A16g128 quantization as an example:

python main.py \
--method omniquant \
--model /PATH/TO/LLaMA2/  \
--epochs 0 \
--eval_ppl --wbits 3 --abits 16 --group_size 128 --lwc \
--resume /PATH/TO/Pretrained/Parameters

For AffineQuant

Weight-only quantization

# W3A16
python main.py \
--method affinequant \
--model /PATH/TO/LLaMA2 \
--epochs 20 \
--eval_ppl --wbits 3 --abits 16 --lwc --use_ln_matrix --sf 1e-2

# W3A16g128
python main.py \
--method affinequant \
--model /PATH/TO/LLaMA2/llama2-8b  \
--epochs 20 \
--eval_ppl --wbits 3 --abits 16 --group_size 128 --lwc --use_ln_matrix --sf 1e-2

weight-activation quantization

# W4A4
python main.py \
--method affinequant \
--model /PATH/TO/LLaMA2/llama2-8b  \
--epochs 20 \
--eval_ppl --wbits 4 --abits 4 --lwc --let --aug_loss --use_matrix --sf 0.1 \
--tasks hendrycksTest,piqa,arc_easy,arc_challenge,boolq,hellaswag,winogrande

evaluation take LLaMa-7B with W3A16g128 quantization as an example:

python main.py \
--method affinequant \
--model /PATH/TO/LLaMA2/  \
--epochs 0 --log_dir ./log/test \
--eval_ppl --wbits 3 --abits 16 --group_size 128 --lwc --let --use_ln_matrix --sf 1e-2 \
--resume /PATH/TO/Pretrained/Parameters

For LRQuant

# W4A4 ppl
python main.py \
--method lrquant \
--model /PATH/TO/LLaMA2/  \
--epochs 20 \
--eval_ppl --wbits 4 --abits 4 --lwc --let \

# W4A4 zero-shot
python main.py \
--method lrquant \
--model /PATH/TO/LLaMA2/  \
--epochs 20 \
--wbits 4 --abits 4 --lwc --let \
--tasks piqa,arc_easy,arc_challenge,boolq,hellaswag,winogrande

# W4A4 tta
python main.py \
--method lrquant \
--model /PATH/TO/LLaMA2/  \
--epochs 20 \
--eval_ppl --wbits 4 --abits 4 --lwc --let --tta\

For RPTQ

python main.py \
--method rptq \
--model /PATH/TO/LLaMA2/  \
--eval_ppl --wbits 4 --abits 4 \
--tasks lambada_openai,piqa,arc_easy,arc_challenge,openbookqa,boolq

Only quantize KV cache

python main.py \
--method rptq \
--model /PATH/TO/LLaMA2/  \
--log_dir ./log/llama2-8b-kv \
--wbits 4 --abits 4 --only_quant_kv \
--eval_ppl --tasks lambada_openai,piqa,arc_easy,arc_challenge,openbookqa,boolq

For Slim++

# W2A16G128
python main.py \
--method slim++ \
--model /PATH/TO/LLaMA2/ \
--eval_ppl \
--epochs 50 \
--wbits 2 --abits 16 --group_size 128 --lwc \
--tasks piqa,arc_easy,arc_challenge,boolq,hellaswag,winogrande

python main.py \
--method slim++ \
--model /PATH/TO/LLaMA2/ \
--eval_ppl \
--epochs 50 \
--wbits 2 --abits 16 --group_size 128 --lwc --aug_loss

# W3A16G128
python main.py \
--method slim++ \
--model /PATH/TO/LLaMA2/ \
--eval_ppl \
--epochs 50  \
--wbits 3 --abits 16 --group_size 128 --lwc \
--tasks piqa,arc_easy,arc_challenge,boolq,hellaswag,winogrande

python main.py \
--method slim++ \
--model /PATH/TO/LLaMA2/ \
--eval_ppl \
--epochs 50 \
--wbits 3 --abits 16 --group_size 128 --lwc --aug_loss

For I-LLM

#FSBR (Fully-Smooth Block-Reconstruction)

python main.py \
--method illm \
--model /PATH/TO/LLaMA2/ \
--eval_ppl \
--epochs 20 \
--wbits 4 --abits 4 --lwc --let \
--fully_quant

Interger-only inference

python main.py \
--method illm \
--model /PATH/TO/LLaMA2/ \
--eval_ppl \
--epochs 0 \
--wbits 4 --abits 4 --lwc --let \
--fully_quant --illm

For DuQuant

1. python get_rot.py 2. DuQuant

python main.py \
--method duquant \
--model /PATH/TO/LLaMA2/ \
--block_size 128 \
--max_rotation_step 256 \
--epochs 0 \
--wbits 4 \
--abits 4 \
--lwc \
--alpha 0.6 \
--smooth \
--lac 0.9 \
--swc 0.8 \
--eval_ppl \
--task arc_easy,arc_challenge,hellaswag,winogrande,boolq,piqa\

DuQuant + lwc

python main.py \
--method duquant \
--model /PATH/TO/LLaMA2/ \
--block_size 128 \
--max_rotation_step 256 \
--epochs 20 \
--wbits 4 \
--abits 4 \
--lwc \
--alpha 0.5 \
--smooth \
--lac 0.9 \
--eval_ppl \
--task arc_easy,arc_challenge,hellaswag,winogrande,boolq,piqa

More detailed and optional arguments:

--model: the local model path or huggingface format.
--wbits: weight quantization bits.
--abits: activation quantization bits.
--group_size: group size of weight quantization. If no set, use per-channel quantization for weight as default.
--lwc: activate the Learnable Weight Clipping (LWC).
--let: activate the Learnable Equivalent Transformation (LET).
--lwc_lr: learning rate of LWC parameters, 1e-2 as default.
--let_lr: learning rate of LET parameters, 5e-3 as default.
--epochs: training epochs. You can set it as 0 to evaluate pre-trained OmniQuant checkpoints.
--nsamples: number of calibration samples, 128 as default.
--eval_ppl: evaluating the perplexity of quantized models.
--tasks: evaluating zero-shot tasks.
--resume: loading pre-trained OmniQuant parameters.
--multigpu: to inference larger network on multiple GPUs
--real_quant: real quantization, which can see memory reduce. Note that due to the limitations of AutoGPTQ kernels, the real quantization of weight-only quantization can only lead memory reduction, but with slower inference speed.
--save_dir: saving the quantization model for further exploration.

Related Project

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

OmniQuant:Omnidirectionally Calibrated Quantization for Large Language Models

AffineQuant:Affine Transformation Quantization for Large Language Models

LRQuant:Learnable and Robust Post-Training Quantization for Large Language Models

RPTQ: Reorder-Based Post-Training Quantization for Large Language Models

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
lm_eval		lm_eval
models		models
quantize		quantize
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
categories.py		categories.py
datautils.py		datautils.py
eval_ppl_utils.py		eval_ppl_utils.py
generate_act_scale_shift.py		generate_act_scale_shift.py
get_rot.py		get_rot.py
main.py		main.py
mmlu_eval.py		mmlu_eval.py
pyproject.toml		pyproject.toml
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Install

Usage

For OmniQuant

For AffineQuant

For LRQuant

For RPTQ

For Slim++

For I-LLM

Interger-only inference

For DuQuant

Related Project

About

Releases

Packages

Languages

License

SSshuishui/quant_omniquant_series

Folders and files

Latest commit

History

Repository files navigation

Install

Usage

For OmniQuant

For AffineQuant

For LRQuant

For RPTQ

For Slim++

For I-LLM

Interger-only inference

For DuQuant

Related Project

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages