Loghub maintains a collection of system logs, which are freely accessible for research purposes. Some of the logs are production data released from previous studies, while some others are collected from real systems in our lab environment. Wherever possible, the logs are NOT sanitized, anonymized or modified in any way. All these logs amount to over 77GB in total.
🔭 If you use the loghub datasets in your research for publication, please kindly cite the following paper.
- Shilin He, Jieming Zhu, Pinjia He, Michael R. Lyu. Loghub: A Large Collection of System Log Datasets towards Automated Log Analytics. Arxiv, 2020.
| Software System | Description | Labeled | Time Span | #Messages | Data Size | 
|---|---|---|---|---|---|
| Distributed systems | |||||
| HDFS_1 | Hadoop distributed file system log | ✔️ | 38.7 hours | 11,175,629 | 1.47GB | 
| HDFS_2 | Hadoop distributed file system log | N.A. | 71,118,073 | 16.06GB | |
| Hadoop | Hadoop mapreduce job log | ✔️ | N.A. | 394,308 | 48.61MB | 
| Spark | Spark job log | N.A. | 33,236,604 | 2.75GB | |
| Zookeeper | ZooKeeper service log | 26.7 days | 74,380 | 9.95MB | |
| OpenStack | OpenStack infrastructure log | ✔️ | N.A. | 207,820 | 58.61MB | 
| Supercomputers | |||||
| BGL | Blue Gene/L supercomputer log | ✔️ | 214.7 days | 4,747,963 | 708.76MB | 
| HPC | High performance cluster log | N.A. | 433,489 | 32.00MB | |
| Thunderbird | Thunderbird supercomputer log | ✔️ | 244 days | 211,212,192 | 29.60GB | 
| Operating systems | |||||
| Windows | Windows event log | 226.7 days | 114,608,388 | 26.09GB | |
| Linux | Linux system log | 263.9 days | 25,567 | 2.25MB | |
| Mac | Mac OS log | 7.0 days | 117,283 | 16.09MB | |
| Mobile systems | |||||
| Android | Android framework log | N.A. | 1,555,005 | 183.37MB | |
| HealthApp | Health app log | 10.5 days | 253,395 | 22.44MB | |
| Server applications | |||||
| Apache | Apache web server error log | 263.9 days | 56,481 | 4.90MB | |
| OpenSSH | OpenSSH server log | 28.4 days | 655,146 | 70.02MB | |
| Standalone software | |||||
| Proxifier | Proxifier software log | N.A. | 21,329 | 2.42MB | 
We host only a small sample (2k lines) of each dataset on Github. If you are interested in these datasets, please download the raw logs at Zenodo.
🔭 We proudly announce that the loghub datasets have been downloaded 48000+ times by more than 380+ organizations (incomplete list) from both industry and academia.
- [ASE'19] Jinyang Liu, Jieming Zhu, Shilin He, Pinjia He, Zibin Zheng, Michael R. Lyu. Logzip: Extracting Hidden Structures via Iterative Clustering for Log Compression. To appear in IEEE/ACM International Conference on Automated Software Engineering (ASE), 2019.
- [ICSE'19] Jieming Zhu, Shilin He, Jinyang Liu, Pinjia He, Qi Xie, Zibin Zheng, Michael R. Lyu. Tools and Benchmarks for Automated Log Parsing. International Conference on Software Engineering (ICSE), 2019.
- [TKDE'18] Min Du, Feifei Li. Spell: Online Streaming Parsing of Large Unstructured System Logs. IEEE Transactions on Knowledge and Data Engineering (TKDE), 2018.
- [TDSC'18] Pinjia He, Jieming Zhu, Shilin He, Jian Li, Michael R. Lyu. Towards Automated Log Parsing for Large-Scale Log Data Analysis. IEEE Transactions on Dependable and Secure Computing (TDSC), 2018.
- [CCS'17] Min Du, Feifei Li, Guineng Zheng, Vivek Srikumar. DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning. ACM Conference on Computer and Communications Security (CCS), 2017.
- [ICWS'17] Pinjia He, Jieming Zhu, Zibin Zheng, Michael R. Lyu. Drain: An Online Log Parsing Approach with Fixed Depth Tree. IEEE International Conference on Web Services (ICWS), 2017.
- [ICSE'16] Qingwei Lin, Hongyu Zhang, Jian-Guang Lou, Yu Zhang, Xuewei Chen. Log Clustering Based Problem Identification for Online Service Systems. International Conference on Software Engineering (ICSE), 2016.
- [DSN'16] Pinjia He, Jieming Zhu, Shilin He, Jian Li, Michael R. Lyu. An Evaluation Study on Log Parsing and Its Use in Log Mining. IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2016.
- [ISSRE'16] Shilin He, Jieming Zhu, Pinjia He, Michael R. Lyu. Experience Report: System Log Analysis for Anomaly Detection. IEEE International Symposium on Software Reliability Engineering (ISSRE), 2016.
- [KDD'09] Adetokunbo Makanju, A. Nur Zincir-Heywood, Evangelos E. Milios. Clustering Event Logs Using Iterative Partitioning. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2009.
- [SOSP'09] Wei Xu, Ling Huang, Armando Fox, David A. Patterson, Michael I. Jordan. Detecting Large-Scale System Problems by Mining Console Logs. ACM Symposium on Operating Systems Principles (SOSP), 2009.
- [DSN'07] Adam J. Oliner, Jon Stearley. What Supercomputers Say: A Study of Five System Logs. IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2007.
We have some links to additional log datasets that are related to security research.
- VizSec Datasets: https://vizsec.org/data
- Security Repo: http://www.secrepo.com
- Public Security Log Sharing Site: http://log-sharing.dreamhosters.com
- The Computer Failure Data Repository: https://www.usenix.org/cfdr
- EDGAR Log File Data Set: https://www.sec.gov/dera/data/edgar-log-file-data-set.html
For any questions or feedback, please raise an issue here.
The log datasets are freely available for research purposes.
