Skip to content

Compressed LLMs for Efficient Text Generation [ICLR'24 Workshop]

Notifications You must be signed in to change notification settings

Nota-NetsPresso/shortened-llm

Repository files navigation

Shortened LLM by Nota AI

Official codebase for Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods [ArXiv] [ICLR 2024 Workshop on ME-FoMo][Blog Post].

  • We perform one-shot pruning by removing unimportant Transformer blocks in LLMs. Compared to recent baselines, our depth pruning achieves faster inference while yielding comparable or superior performance.
  • In retraining pruned models for quality recovery, continued pretraining (CPT) on a large corpus markedly outperforms LoRA-based tuning, particularly at severe pruning ratios.

teaser

Installation

conda create -n shortened-llm python=3.9
conda activate shortened-llm
git clone https://github.com/Nota-NetsPresso/shortened-llm.git
cd shortened-llm
pip install -r requirement.txt
Note on package versions:
  • Part of the below repositories is included for evaluation:
    • src/LLMPruner: horseee/LLM-Pruner version 213ffa4
    • src/lm_eval: EleutherAI/lm-evaluation-harness version 3326c54
  • Torch version used in our experiments: 2.0.1 for RTX3090 & A100; 2.1.1 for H100.
(optional) GPTQ Support:
  • Post-training quantization can be further applied to our pruned model.
  • We applied GPTQ on the pruned & re-trained models.
  • To install the required packages, we recommend installation from source as follows:
    git clone https://github.com/AutoGPTQ/AutoGPTQ.git
    cd AutoGPTQ
    git checkout v0.7.1
    pip install -vvv -e .

Models from Aggressive Pruning & CPT Retraining (arXiv-v2):

Source
Model
Pruning
Ratio
Pruning
Criterion
🤗Hugging Face
Link
Vicuna-v1.3-7B 20% PPL nota-ai/cpt_st-vicuna-v1.3-5.5b-ppl
Vicuna-v1.3-7B 45% PPL nota-ai/cpt_st-vicuna-v1.3-3.7b-ppl
Vicuna-v1.3-7B 60% PPL nota-ai/cpt_st-vicuna-v1.3-2.7b-ppl
Vicuna-v1.3-7B 80% PPL nota-ai/cpt_st-vicuna-v1.3-1.5b-ppl
Click to see the results:
  • EleutherAI/lm-evaluation-harness version 3326c54
results

Models from Moderate Pruning & LoRA Retraining (arXiv-v1):

Source
Model
Pruning
Ratio
Pruning
Criterion
🤗Hugging Face
Link
LLaMA-1-7B 20% PPL nota-ai/st-llama-1-5.5b-ppl
LLaMA-1-7B 20% Taylor+ nota-ai/st-llama-1-5.5b-taylor
Vicuna-v1.3-7B 20% PPL nota-ai/st-vicuna-v1.3-5.5b-ppl
Vicuna-v1.3-7B 20% Taylor+ nota-ai/st-vicuna-v1.3-5.5b-taylor
Vicuna-v1.3-13B 21% PPL nota-ai/st-vicuna-v1.3-10.5b-ppl
Vicuna-v1.3-13B 21% Taylor+ nota-ai/st-vicuna-v1.3-10.5b-taylor
Click to see the results:
  • EleutherAI/lm-evaluation-harness version 3326c54
results

Examples

The scripts perform (1) block pruning ➔ (2) LoRA-based retraining ➔ (3) zero-shot evaluation.

  • Pruning criterion: PPL (top); Taylor+ (bottom).
  • LLaMA-1-7b (based on LlamaForCausalLM)
    bash script/prune_llama-7b_crit-ppl.sh
    bash script/prune_llama-7b_crit-taylor.sh
  • Llama-2-7b (based on LlamaForCausalLM)
    bash script/prune_llama2-7b_crit-ppl.sh
    bash script/prune_llama2-7b_crit-taylor.sh
  • Llama-3-8B (based on LlamaForCausalLM)
    bash script/prune_llama3-8b_crit-ppl.sh
    bash script/prune_llama3-8b_crit-taylor.sh
  • Vicuna-7b-v1.3 (based on LlamaForCausalLM)
    bash script/prune_vicuna-7b_crit-ppl.sh
    bash script/prune_vicuna-7b_crit-taylor.sh
  • Vicuna-13b-v1.3 (based on LlamaForCausalLM)
    bash script/prune_vicuna-13b_crit-ppl.sh
    bash script/prune_vicuna-13b_crit-taylor.sh
  • CatPPT-base (based on MistralForCausalLM)
    bash script/prune_CatPPT_crit-ppl.sh
    bash script/prune_CatPPT_crit-taylor.sh
  • Gemma-2b (based on GemmaForCausalLM)
    bash script/prune_gemma-2b_crit-ppl_yesBOS.sh
    bash script/prune_gemma-2b_crit-taylor_yesBOS.sh
  • Gemma-7b (based on GemmaForCausalLM)
    bash script/prune_gemma-7b_crit-ppl_yesBOS.sh
    bash script/prune_gemma-7b_crit-taylor_yesBOS.sh

Other Scripts

  • To test other pruning ratios, use:

    bash script/prune.sh
  • To obtain baselines using the magnitude pruning criterion, use:

    bash script/prune_llama-7b_crit-magnitude.sh
    bash script/prune_vicuna-7b_crit-magnitude.sh
    bash script/prune_vicuna-13b_crit-magnitude.sh
  • To measure (1) PPL on WikiText2 & PTB, and (2) accuracy on seven commonsense reasoning tasks, use: (EleutherAI/lm-evaluation-harness version 3326c54)

    bash script/evaluate.sh
  • (Optional) Any post-training quantization method can be applied to our pruned models. The example script quantizes our pruned models using GPTQ and measures their performance with script/evaluate.sh:

    bash script/quantize_gptq_vicuna-7b.sh
  • To measure latency & throughput, use:

    bash script/measure_time.sh
  • To measure VRAM requirements, use:

    bash script/measure_vram.sh
  • To measure GPU compute utilization, use:

    bash script/measure_gpuutil.sh

Gradio Demo: Width✄ vs. Depth✄

The demo compares the use of LLM-Pruner (Ma et al., 2023; width pruning) and Shortened LLaMA (Ours; depth pruning) for the LLaMA-1-7B model:

pip install transformers==4.33.1 # to run LLM-Pruner's model
python src/app.py
Click to see a demo screenshot (on an A100 80GB GPU): demo

License

  • All rights related to this repository and the compressed models are reserved by Nota Inc.
  • The intended use is strictly limited to research and non-commercial projects.

Acknowledgments

Citation

@article{kim2024shortened,
  title={Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods},
  author={Kim, Bo-Kyeong and Kim, Geonmin and Kim, Tae-Ho and Castells, Thibault and Choi, Shinkook and Shin, Junho and Song, Hyoung-Kyu},
  journal={arXiv preprint arXiv:2402.02834},      
  year={2024},
  url={https://arxiv.org/abs/2402.02834}
}
@article{kim2024mefomo,
  title={Shortened LLaMA: A Simple Depth Pruning for Large Language Models},
  author={Kim, Bo-Kyeong and Kim, Geonmin and Kim, Tae-Ho and Castells, Thibault and Choi, Shinkook and Shin, Junho and Song, Hyoung-Kyu},
  journal={ICLR Workshop on Mathematical and Empirical Understanding of Foundation Models (ME-FoMo)},
  year={2024},
  url={https://openreview.net/forum?id=18VGxuOdpu}
}