- Cost Model
- Distributed Training
- Estimating GPU Memory Consumption of Deep Learning Models by Yanjie Gao et al., ESEC/FSE 2020
-
Cost Model for NAS/Cloud
- Daydream: Accurately Estimating the Efficacy of Optimizations for DNN Training by Hongyu Zhu et al., USENIX ATC 2020
- Habitat: A Runtime-Based Computational Performance Predictor for Deep Neural Network Training by Geoffrey X. Yu et al., USENIX ATC 2021
- To bridge neural network design and real-world performance: A behaviour study for neural networks by Xiaohu Tang et al., MLSys 2021
- perf4sight: A toolflow to model CNN training performance on Edge GPUs by Aditya Rajagopal et al., ArXiv 2021
- nn-Meter: Towards Accurate Latency Prediction of Deep-Learning Model Inference on Diverse Edge Devices by Li Lyna Zhang et al., MobiSys 2021
- Empirical Analysis and Modeling of Compute Times of CNN Operations on AWS Cloud by Ubaid Ullah Hafeez et al., IISWC 2020
- Paleo: A Performance Model for Deep Neural Networks by Hang Qi et al., ICLR 2017
- Augur: Modeling the Resource Requirements of Convolutional Neural Networks on Mobile Devices by Zongqing Lu et al., Proceedings of the 25th ACM international conference on Multimedia 2017
- Performance Modelling of Deep Learning on Intel Many Integrated Core Architectures by Andre Viebke et al., ArXiv 2019
-
Cost model for kernel compilation
- A learned Performance Model for Tensor Processing Units by Samuel J. Kaufman et al., MLSys 2021
- Iteration Time Prediction for CNN in Multi-GPU Platform: Modeling and Analysis by Ziqian Pei et al., IEEE Access 2019
- Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning by Lianmin Zheng et al., ArXiv 2022
- Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism by Mohammad Shoeybi et al., ArXiv 2019
- ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning by Samyam Rajbhandari et al., SC 2021
- Improving the Accuracy, Scalability, and Performance of Graph Neural Networks with Roc by Zhihao Jia et al., MLSys 2020
- A Distributed Multi-GPU System for Fast Graph Processing by Zhihao Jia et al., VLDB 2017
- Device Placement Optimization with Reinforcement Learning by Azalia Mirhoseini et al., ICML 2017
- DUET: A Compiler-Runtime Subgraph Scheduling Approach for Tensor Programs on a Coupled CPU-GPU Architecture by Minjia et al., IEEE IPDPS 2021
- gradient checkpoint
- Training Deep Nets with Sublinear Memory Cost by Tianqi Chen et al., arXiv 2016
- Efficient Rematerialization for Deep Networks by Ravi Kumar et al., NeurIPS 2019
- Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization by Paras Jain et al., MLsys 2020
- Dynamic Tensor Rematerialization by Marisa Kirisame et al., ICLR 2021
- gradient checkpoint + distributed training
- Reducing Activation Recomputation in Large Transformer Models by Vijay Korthikanti et al., arXiv 2022
- kernel fusion
- Data movement is all you need: A case study on optimizing transformers by Andrei Ivanov et al., MLSys 2021
- compression/quantization
- Gist: Efficient Data Encoding for Deep Neural Network Training by Animesh Jain et al., ISCA 2018
- Gradient Compression Supercharged High-Performance Data Parallel DNN Training by Youhui Bai et al., SOSP 2021
- GACT: Activation compressed training for generic network architectures by Xiaoxuan Liu et al., ICML 2022
- On the Utility of Gradient Compression in Distributed Training Systems by Saurabh Agarwal et al., MLsys 2022
- swapping
- Optimal GPU-CPU Offloading Strategies for Deep Neural Network Training by Olivier Beaumont et al., European Conference on Parallel Processing 2020
- SwapAdvisor: Push Deep Learning Beyond the GPU Memory Limit via Smart Swapping by Huang et al., ASPLOS 2020
- Harmony: Overcoming the hurdles of GPU memory capacity to train massive DNN models on commodity servers by Youjie Li et al., VLDB2022.
- STRONGHOLD: Fast and Affordable Billion-Scale Deep Learning Model Training by Xiaoyang Sun et al., SC2022
- ZeRO-Offload: Democratizing Billion-Scale Model Training by Jie Ren et al., USENIX ATC'21
- swapping + pipeline parallelism
- Harmony: Overcoming the hurdles of GPU memory capacity to train massive DNN models on commodity servers by Youjie Li et al., VLDB 2022.
- swapping + gradient checkpointing
- Capuchin: Tensor-based gpu memory management for deep learning by Peng, Xuan, et al., ASPLOS 2020
- Efficient Combination of Rematerialization and Offloading for Training DNNs by Olivier Beaumont et al., NeurIPS 2021
- POET: Training Neural Networks on Tiny Devices with Integrated Rematerialization and Paging by Shishir G. Patil, ICML 2022
- Memory Allocator
- OLLA: Optimizing the Lifetime and Location of Arrays to Reduce the Memory Usage of Neural Networks by Benoit Steiner et al., Arxiv 2022
- Efficient Optimizer
- ZeRO: Memory Optimizations Toward Training Trillion Parameter Models by Samyam Rajbhandari et al., SC'20
- 1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed by Hanlin Tang et al., ICML 2021
- Hardware related
- FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness by Tri Dao et al., NeurIPS 2022
- PyTorch Internal
- Profiler Trace File
- Characterizing Deep Learning Training Workloads on Alibaba-PAI by Mengdi Wang et al., IISWC 2019