Awesome LLMs Pruning

Integrating useful resources into one repository for large models pruning papers, including one sentence take-away summary, explanation notes such as paper's challenges, blogs or videos, paper tags, source code links and venue.

Please feel free to pull requests or open an issue to add papers.

🔥 Keep updating... Please star it if you find it helpful:)


`Unstructured`	`Magnitude`	`Sparsity e.g. layer or global`	`Data-free`	`Without`	`Frozen`
`Structured`	`Taylor e.g.` `Hessian`	`FLOPs`	`Calibration`	`Efficient e.g.` `LoRA`	`Update`
`Semi-structured`	`Fisher`	`Latency`	`Small`	`Extensive`	`-`
`Other`	`Trainable`	`Energy`	`Medium`	`Scratch`	`-`
`-`	`Other`	`Other`	`Large`	`Other`	`-`

2024

Title & Take-away	Note	Code
Compact Language Models via Pruning and Knowledge Distillation Prune LLMs structurally along different axes such as layer, neuron, head, and embedding channel, similar to NAS that searches over different dimensions. Difference lies in the defined search space that for pruning a pre-trained large model as search space (simpler) while NAS searches over a manually-pre-defined search space (more complex) from scratch. Different proxy importance scores are estimated per axis for pruning separately. Retraining with knowledge distillation requires up to 40x fewer training tokens.	Challenge Blog	PyTorch
Keyformer: KV Cache reduction through attention sparsification for Efficient Generative Inference Keyformer, a successor to H2O (see below), uses a Gumbel softmax-based score function instead of solely attention scores in H2O, for dynamically identifying and retaining top-k key tokens, to reduce KV cache size. A sliding window drawn from Sparse Transformer is used to retain (not prune) w recent representative tokens, yileding a mixture of recent and key tokens.	Challenge Blog Summary	PyTorch
A Simple and Effective Pruning Approach for Large Language Models A pruning metric termed Wanda that considers both weight magnitudes and input activation norms to prune weights per-output basis instead of layer-wise, requiring no retraining or weight update. A simplified version of SparseGPT.	Challenge Blog Reviews	PyTorch
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits 1.58 bits to quantize every single parameter of the LLM in ternary -1, 0, or +1. This can be viewed as an 1-bit binarization -1 or 1 along with unstructured pruning 0. It matches the full-precision Transformer LLM with the same model size and training tokens when trained from scratch, with 1.58-bit weights and 8-bit activations.	Challenge Discussion	PyTorch
BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation Adaptively allocate optimal sparsity ratio of each layer within a transformer block by minizming block-wise reconstruction error. To do so, a parameter-efficient algorithm is developed with ony optimizing few learnable coefficients e.g., 100. Pre-trained weights are frozen.	Challenge Reviews	PyTorch
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning In first stage, training-aware pruning learns masks satisfying specified target by imposing regularization on ~0.4B tokens; then retrain on other ~5B tokens of RedPajama dataset. Dynamic batch loading method to update the composition of sampled data per mini-batch across different domains.	Challenge Blog Reviews	PyTorch
Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity Allocate non-uniform sparsity ratios across different layers guided by the principle that a layer with higher proportion of outliers should have a lower sparsity, then apply the more tailored layer-wise sparsity directly into Wanda and SparseGPT.	Challenge Reviews	PyTorch
Plug-and-Play: An Efficient Post-training Pruning Method for Large Language Models Propose a new pruning criteria named RIA for LLMs. In N:M structures, introduce a column permutation matrix for score matrix to maximize the total retained weight importance. No retraining.	Challenge Reviews	PyTorch
Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models Retrain LLMs' weights with lightweight LoRA, and optimize structured-pruning masks with efficient trainable parameters in differentiable way on instruction-tuning Alpaca dataset. Collaborative prompt is used to help pruning task.	Challenge Reviews	PyTorch
Scaling Laws for Sparsely-Connected Foundation Models Discover scaling law of weight sparsity, formulating the scaling relationships between weight sparsity, non-zero parameter numbers, and training data size. Revealing an increasing optimal sparsity with more training data and offering insights for improved computational efficiency.	Challenge Reviews	-
The LLM Surgeon This paper introduces LLM Surgeon, a method that enhances the efficiency of second-order Hessian-based pruning techniques, such as Optimal Brain Surgeon, by employing Kronecker-factored approximations of the Fisher information matrix. The approach establishes closed-form solutions. Prune OPT models and Llamav2-7B by 20%-30% achieves a negligible loss in performance.	Challenge Reviews	PyTorch
Accurate Retraining-free Pruning for Pretrained Encoder-based Language Models Retraining-free pruning for encoder-based language model such as BERT to preserve the knowledge of PLMs through sublayer-wise iterative pruning, from the bottom to the top sublayer.	Challenge Reviews	PyTorch
Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs Dynamic Sparse No Training (DSNT) involves iterative pruning-and-growing steps that only updating sparse mask yet mask adaptation by minimizing reconstruction error e.g. proxy of perplexity; Enable a higher 60% or 70% sparsity rate; Training-free.	Challenge Reviews	PyTorch
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity An effective software framework for tensor cores (do not allow skipping arbitrary element-level computations) based unstructured SpMM, leveraging on-chip resources for efficient sparse data extraction and computation/memory-access overlapping. Improving memory bandwidth utilization in GPU.	-	Python/C++
Shortened LLaMA: A Simple Depth Pruning for Large Language Models First identify unimportant Transformer blocks (bigger and coarse units), then perform one-shot pruning with Perplexity (PPL) as pruning criteria and light LoRA retraining. Show fast inference and good zero-shot capabilities.	Challenge	PyTorch

2023

Title & Take-away	Note	Code
H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models During prompt and generation phase, dynamically prune the unimportant tokens based on accumulated attention scores, yet maintaining a constant small Key-Value Cache (KV cache ) size with k tokens.	Challenge	PyTorch
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot Post-training method for pruning LLMs in one-shot without any retraining. Updating weights by solving a layer-wise weight reconstruction problem.	Challenge Blog	PyTorch
LLM-Pruner: On the Structural Pruning of Large Language Models First discover all coupled structures following Depgraph, then estimate grouped importance of coupled structure on calibration, then prune less important groups, and last finetune with efficient LoRA on Alpaca dataset consists of 50K instruction-response pairs.	Challenge	PyTorch
The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter Revisiting magnitude pruning and several interesting findings on pruning large scaled models. Most performances are reported with fine-tuned downstream tasks, except for that on modern-scale LLMs where no retraining is performed.	Challenge	PyTorch