Root cause analysis (RCA) is a task of identifying the underlying causes of system faults/failures by analyzing the system monitoring data. LEMMA-RCA is a collection of multi-modal datasets with various real system faults to facilitate future research in RCA. It is also a multi-domain dataset, encompassing real-world applications such as microservice and water treatment/distribution systems. The datasets are released under the CC BY-NC 4.0 license and hosted on Huggingface, the codes are available on Github.
Each dataset contains various system faults simulated from real-world scenarios. For details, please check our website.
LEMMA-RCA covers two domains and we provide both the raw data and preprocessed data. The detailed data statistics can be found in Lemma-RCA Webpage.
- For the raw data version, we provide all json files where the microservice system stores both the metric data, the log data and even trace data. Users are expected to extract these two modalities by themself. The goal of the raw data is to provide the users more choice of preprocessing the raw data.
- For the preprocessed data, we have extracted the metric data and unstructured log data for each pod. The users may use their own methods to preprocessed log data or use the provided code to preprocess the log data and convert it to time-series data. For instance, the code to preprocess the data in the IT domain is stored in the following directory.
If you want to directly test the performance of these baseline methods, you may choose to download the preprocessed data.
cd ./IT/data_preprocessing
- IT Operations (Product Review and Cloud Computing)
- OT Operations (Water Treatment/Distribution)
LEMMA-RCA datasets are evaluated with eight causal learning baselines in four settings: online/offline with single/multiple modality data.
Example: Using FastPC to evalute the Performance of Case 20211203 in Product Review
You need to download both log and metric data if you would like to test the performance of FastPC on multi-modal data.
Notice: If you want to use metric data only, you can skip step 2 to step 5 and move directly to step 6 to detect root cause with metric data.
cd ./IT/data_preprocessing
Step 3: Extract useful log information (such as pod/node names, log messages, etc.) from original elasticsearch log (json format)
python json2message.py
Notice: Some of the arguments may need to change
--path, the input directory of the json format log data
--output_dir, the output directory of all log messages
--output_dir2, the output directory of pod-level log messages for each pod
--output_dir3, the output directory of node-level log messages for each node
python drain3_parse.py ./output/log_prep_node/ -o "./drain3_result/node"
python drain3_parse.py ./output/log_prep_pod/ -o "./drain3_result/pod"
--input_dir, default="./output/log_prep_node/" or "./output/log_prep_pod/"
--output_dir, default="./drain3_result/node" or "./drain3_result/pod"
python log_frequency_extraction.py --log_dir ./input_path/ --output_dir ./output_path
python log_golden_frequency.py --root_path ./input_path/ --output_dir ./output_path --save_dir ./output_path
- Notice that you need to change the path of data, dataset name and output directory.
python test_FastPC_pod_metric.py --dataset 20211203 --path_dir CHANGE_PATH_TO_DATASET_DIRECTORY --output_dir CHANGE_PATH_TO_OUTPUT_DIRECTORY
You may also test the performance of FastPC with log data or two modalities with the following command:
python test_FastPC_pod_log.py ## for log data only
python test_FastPC_pod_combine.py ## for both metric and log data
If you encounter the error regarding "name 'LIBSPOT' is not defined", please double-check if you are running the code in the directory of FastPC. We observe such an error if the command is 'python FastPC/test_FastPC_pod_log.py' running in the directory of './rca_baselines/Baseline/offline/'.
The results will be stored in the csv file as follows:
./Baseline/offline/output/Pod_level_combine_ranking.csv
The root cause for 20211203 (MongoDB-v1) can be found in the readme.pptx file in the folder of downloaded preprocessed data.
Creative Commons Attribution-NonCommercial (CC BY-NC) 4.0 International License