This is the GitHub repo for our survey paper Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models.
- Awesome Resource-Efficient LLM Papers
Date | Keywords | Paper | Venue |
---|---|---|---|
2024 | Mixed Precision Training | FP8-LM: Training FP8 Large Language Models | Arxiv |
2022 | Mixed Precision Training | BLOOM: A 176B-Parameter Open-Access Multilingual Language Model | Arxiv |
2018 | Mixed Precision Training | Bert: Pre-training of deep bidirectional transformers for language understanding | ACL |
2017 | Mixed Precision Training | Mixed Precision Training | ICLR |
Date | Keywords | Paper | Venue |
---|---|---|---|
2024 | Data Augmentation | LLMRec: Large Language Models with Graph Augmentation for Recommendation | WSDM |
2024 | Data augmentation | LLM-DA: Data Augmentation via Large Language Models for Few-Shot Named Entity Recognition | Arxiv |
2023 | Data augmentation | MixGen: A New Multi-Modal Data Augmentation | WACV |
2023 | Data augmentation | Augmentation-Aware Self-Supervision for Data-Efficient GAN Training | NeurIPS |
2023 | Data augmentation | Improving End-to-End Speech Processing by Efficient Text Data Utilization with Latent Synthesis | EMNLP |
2023 | Data augmentation | FaMeSumm: Investigating and Improving Faithfulness of Medical Summarization | EMNLP |
Date | Keywords | Paper | Venue |
---|---|---|---|
2023 | Training objective | Challenges and Applications of Large Language Models | Arxiv |
2023 | Training objective | Efficient Data Learning for Open Information Extraction with Pre-trained Language Models | EMNLP |
2023 | Masked language-image modeling | Scaling Language-Image Pre-training via Masking | CVPR |
2022 | Masked image modeling | Masked Autoencoders Are Scalable Vision Learners | CVPR |
2019 | Masked language modeling | MASS: Masked Sequence to Sequence Pre-training for Language Generation | ICML |
Date | Keywords | Paper | Venue |
---|---|---|---|
2024 | Hardware optimization | LoL-PIM: Long-Context LLM Decoding with Scalable DRAM-PIM System | ArXiv |
2024 | Hardware Optimization | LUT TENSOR CORE: Lookup Table Enables Efficient Low-Bit LLM Inference Acceleration | Arxiv |
2023 | Hardware offloading | FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU | PMLR |
2023 | Hardware offloading | Fast distributed inference serving for large language models | arXiv |
2022 | Collaborative inference | Petals: Collaborative Inference and Fine-tuning of Large Models | arXiv |
2022 | Hardware offloading | DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale | IEEE SC22 |
Date | Keywords | Paper | Venue |
---|---|---|---|
2023 | Other Systems | Tabi: An Efficient Multi-Level Inference System for Large Language Models | EuroSys |
2023 | Other Systems | Near-Duplicate Sequence Search at Scale for Large Language Model Memorization Evaluation | PACMMOD |
Metric | Description | Example Usage |
---|---|---|
FLOPs (Floating-point operations) | the number of arithmetic operations on floating-point numbers | [FLOPs] |
Training Time | the total duration required for training, typically measured in wall-clock minutes, hours, or days | [minutes, days] [hours] |
Inference Time/Latency | the average time required generate an output after receiving an input, typically measured in wall-clock time or CPU/GPU/TPU clock time in milliseconds or seconds | [end-to-end latency in seconds] [next token generation latency in milliseconds] |
Throughput | the rate of output tokens generation or tasks completion, typically measured in tokens per second (TPS) or queries per second (QPS) | [tokens/s] [queries/s] |
Speed-Up Ratio | the improvement in inference speed compared to a baseline model | [inference time speed-up] [throughput speed-up] |
Metric | Description | Example Usage |
---|---|---|
Number of Parameters | the number of adjustable variables in the LLM’s neural network | [number of parameters] |
Model Size | the storage space required for storing the entire model | [peak memory usage in GB] |
Metric | Description | Example Usage |
---|---|---|
Energy Consumption | the electrical power used during the LLM’s lifecycle | [kWh] |
Carbon Emission | the greenhouse gas emissions associated with the model’s energy usage | [kgCO2eq] |
The following are available software packages designed for real-time tracking of energy consumption and carbon emission.
You might also find the following helpful for predicting the energy usage and carbon footprint before actual training or
Metric | Description | Example Usage |
---|---|---|
Dollars per parameter | the total cost of training (or running) the LLM by the number of parameters |
Metric | Description | Example Usage |
---|---|---|
Communication Volume | the total amount of data transmitted across the network during a specific LLM execution or training run | [communication volume in TB] |
Metric | Description | Example Usage |
---|---|---|
Compression Ratio | the reduction in size of the compressed model compared to the original model | [compress rate] [percentage of weights remaining] |
Loyalty/Fidelity | the resemblance between the teacher and student models in terms of both predictions consistency and predicted probability distributions alignment | [loyalty] [fidelity] |
Robustness | the resistance to adversarial attacks, where slight input modifications can potentially manipulate the model's output | [after-attack accuracy, query number] |
Pareto Optimality | the optimal trade-offs between various competing factors | [Pareto frontier (cost and accuracy)] [Pareto frontier (performance and FLOPs)] |
Benchmark | Description | Paper |
---|---|---|
General NLP Benchmarks | an extensive collection of general NLP benchmarks such as GLUE, SuperGLUE, WMT, and SQuAD, etc. | A Comprehensive Overview of Large Language Models |
Dynaboard | an open-source platform for evaluating NLP models in the cloud, offering real-time interaction and a holistic assessment of model quality with customizable Dynascore | Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking |
EfficientQA | an open-domain Question Answering (QA) challenge at NeurIPS 2020 that focuses on building accurate, memory-efficient QA systems | NeurIPS 2020 EfficientQA Competition: Systems, Analyses and Lessons Learned |
SustaiNLP 2020 Shared Task | a challenge for development of energy-efficient NLP models by assessing their performance across eight NLU tasks using SuperGLUE metrics and evaluating their energy consumption during inference | Overview of the SustaiNLP 2020 Shared Task |
ELUE (Efficient Language Understanding Evaluation) | a benchmark platform for evaluating NLP model efficiency across various tasks, offering online metrics and requiring only a Python model definition file for submission | Towards Efficient NLP: A Standard Evaluation and A Strong Baseline |
VLUE (Vision-Language Understanding Evaluation) | a comprehensive benchmark for assessing vision-language models across multiple tasks, offering an online platform for evaluation and comparison | VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models |
Long Range Arena (LAG) | a benchmark suite evaluating efficient Transformer models on long-context tasks, spanning diverse modalities and reasoning types while allowing evaluations under controlled resource constraints, highlighting real-world efficiency | Long Range Arena: A Benchmark for Efficient Transformers |
Efficiency-aware MS MARCO | an enhanced MS MARCO information retrieval benchmark that integrates efficiency metrics like per-query latency and cost alongside accuracy, facilitating a comprehensive evaluation of IR systems | Moving Beyond Downstream Task Accuracy for Information Retrieval Benchmarking |
If you find this paper list useful in your research, please consider citing:
@article{bai2024beyond,
title={Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models},
author={Bai, Guangji and Chai, Zheng and Ling, Chen and Wang, Shiyu and Lu, Jiaying and Zhang, Nan and Shi, Tingwei and Yu, Ziyang and Zhu, Mengdan and Zhang, Yifei and others},
journal={arXiv preprint arXiv:2401.00625},
year={2024}
}