Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models.
Rocktim Jyoti Das*, Mingjie Sun*, Liqun Ma, Zhiqiang Shen*
Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi.
Carnegie Mellon University, Pittsburgh.
- Introduction
- Install
- GBLM-Pruned Weights
- Model Zoo
- Usage
- Zero-Shot Harness Evaluation
- Acknowledgement
- Issues
- License
- Citation
Gradient information has been overlooked by prior methods for neural model pruning. Even the original Optimal Brain Surgeon work ignored first-order term in the derivation of OBS framework for model pruning. It was done under the assumption that gradients at the minimum vanish and cease to offer any valuable information. In this work, we revisited and refined the OBS framework by incorporation consideration of first-order-term. Based on our analysis, we propose our gradient based pruning metric.
The installation instructions are provided here.
Please check out our Model Zoo for all public GBLM-Pruner compressed model checkpoints, and the instructions of how to use the weights.
Our method require computation of gradient magnitude for calculation of pruning metric. The gradient for a model can be computed as follows:
bash run_grad_compute.sh
Overview of the arguments in the bash file:
--model
: The identifier or the path for the LLaMA model.--llama_version
: Version of Llama model using (for LLaMA-1 use 1 and for LLaMA-2 use 2)--nsamples
: No of calibration samples.
After computation of the model gradient, the pruned model can be obtained using the following command.
bash run_gblm_prune.sh
Overview of the arguments in the bash file:
--model
: The identifier or the path for the LLaMA model.--gradient_path
: Path to the pre-computed gradient--prune_method
: Pruning method to be used.--nsamples
: No of calibration samples.--seed
: Random seed.--sparsity_ratio
: Percentage of the weights to be pruned.--sparsity_type
: Specify the sparsity type.--save
: Path to store results.
We use the EleutherAI LM Harness implementation for the zero-shot evaluation on Harness. We used the same instructions provided here for producing our results. We used the following command for reproducing our results.
python main.py \
--model hf-causal-experimental \
--model_args pretrained=/path/to/model \
--tasks task_name \
--device cuda:0 \
--no_cache
This codebase is built upon SparseGPT and Wanda.
Please don't hesitate to contact us if you encounter any code-related issues or wish to discuss the paper. You can reach out to us via the GitHub issues or through email at [email protected].
This project is released under the MIT license. Please see the LICENSE file for more information.
If you found this work useful, please consider citing:
@misc{das2023size,
title={Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models},
author={Rocktim Jyoti Das and Liqun Ma and Zhiqiang Shen},
year={2023},
eprint={2311.04902},
archivePrefix={arXiv},
primaryClass={cs.CL}
}