Fluctuation-based Adaptive Structured Pruning for Large Language Models [arXiv]
Yongqi An, Xu Zhao, Tao yu, Ming Tang, Jinqiao Wang
Institute of Automation, Chinese Academy of Sciences
- No training required: Our method can obtain a better compressed LLM without any retraining.
- Adaptive compression structure: Each module and layer has adaptive pruning ratio.
- Efficient compression: 3 to 5 minutes on a single GPU, with no additional time required.
- Better performance: Better performance on a variety of language benchmarks, with additional gains in specific task datasets.
- Quick Start
- Configuration Instruction
- Language Modeling Evaluation
- Zero-shot Evaluation
- Acknowledgement
- Citation
Installation instructions can be found in INSTALL.md.
bash script/llama_7b.sh $GPU_ID
This script would compress the LLaMA-7B model with ~20% parameters pruned by FLAP. All the pre-trained models and the dataset would be automatically downloaded, so you do not need to manually download the resource. When running this script for the first time, it will require some time to download the model and the dataset.
LLaMA-7B pruning with ~20% parameters pruned:
python main.py \
--model decapoda-research/llama-7b-hf \
--prune_method flap \
--pruning_ratio 0.2 \
--remove_heads -1 \
--metrics WIFV \
--structure AL-AM \
--nsamples 1024 \
--save_model "llm_weights/flap_p0.2_WIFV_ALAM_llama_7b/" \
--eval \
Arguments:
--model
: The identifier for the LLaMA model on the Hugging Face model hub. The model name is used forAutoModelForCausalLM.from_pretrained
to load the pre-trained LLM. For example, if you want to use the LLaMA with 7 billion parameters, than passdecapoda-research/llama-7b-hf
to--model
.--cache_dir
: Directory for loading or storing LLM weights. The default isllm_weights
.--prune_method
: We have implemented three pruning methods, as referenced in the paper, namely [flap
,wanda_sp
,mag_sp
]. The default isflap
.--pruning_ratio
: Denotes the percentage of weights to be pruned.--remove_heads
: How many heads should be removed, only used inUL-MM
andAL-MM
to manual the ratio of Self-attn and MLP.--metrics
: The pruning metric to choose, as referenced in the paper, namely [IFV
,WIFV
,WIFN
,N/A
]. The default isWIFV
.--structure
: The global compressed model structure to choose, as referenced in the paper, namely [UL-UM
,UL-MM
,AL-MM
,AL-AM
]. The default isAL-AM
.--unstr
: Whether to true prune the model or only mask the weight, default isFalse
.--eval
: Whether to eval the model on Wikitext2 to calculate the perplexity, default isFalse
.--save_model
: Specifies the directory where the pruned model will be stored.
After pruning and post-training, we follow lm-evaluation-harness for evaluation.
A brief quantitative language modeling performance for LLaMA-family:
A brief quantitative zero-shot performance results for LLaMA-7B:
More results can be found in the paper.
- Logo is generated by DALLE·3.
- The README.md: our README.md references LLM-Pruner, thanks to them for providing a readable and beautifully formatted README document.
- The evaluation of the LLM: lm-evaluation-harness.
- LLaMA: https://github.com/facebookresearch/llama.
- Vicuna: https://github.com/lm-sys/FastChat.
If you find this project useful, please cite
@misc{an2023fluctuationbased,
title={Fluctuation-based Adaptive Structured Pruning for Large Language Models},
author={Yongqi An and Xu Zhao and Tao Yu and Ming Tang and Jinqiao Wang},
year={2023},
eprint={2312.11983},
archivePrefix={arXiv},
primaryClass={cs.CL}
}