This repo provides related information for XES3G5M: A Knowledge Tracing Benchmark Dataset with Auxiliary Information.
You can download XES3G5M from https://drive.google.com/file/d/1eFiIYyh5O2V90RA0brammGH6EpHvPDQe/view, it means you agree to the license when you download the data.
After preprocessing, we provide the files generated by pyKT and the metadata files that involve auxiliary information. Specifically, the published data is stored in a data directory named XES3G5M. In the directory, there are the following files and folders:
The kc_level
folder for the models which need KC level data to train and test:
-
train_valid_sequences.csv: this is the main data file for KT task to conduct offline model training and validation. Each student interaction sequence at question level is first expanded into KC level when a question is associated with multiple KCs. After that, each sequence is truncated into sub-sequences of length of 200. The data has been randomly split into 5 folds, the columns of this file are described as follows:
-
fold: the unique id of each fold, ranging from 0 to 4;
-
uid: the unique internal id of the user;
-
questions: the sequence of internal question ids;
-
concepts: the corresponding sequence of internal KC ids, in this file, we use only the leaf nodes of the KCs in the KC routes;
-
responses: the correctness of the students' answers, "1" means right, "0" means wrong;
-
timestamps: the regular start timestamps of the interactions, the timestamps are millisecond level;
-
selectmasks: if the length of current sequence is less than 200, we pad it to 200, "-1" indicates the corresponding interaction is padded or only used as history which needs to be ignored when calculating the loss and evaluating the models' performance;
-
is_repeat: "1" indicates that the current KC and its previous KC belong to the same question, "0" means this is a new question.
-
-
test_question_window_sequences: the test data file for question level prediction. Each students' sequence is truncated into 200 length. This data file is used to predict each KC for each question and aggregate the KC predictions in 4 different ways as mentioned in pykt. The columns in this file are consistent with them in file "train_valid_sequences.csv", the additional columns mean that:
- qidxs: the interaction index sequence in the test dataset, each question at a particular timestamp has only one index;
- rest: the rest KC counts of the interaction in the original sequences, it will be used in the predictions' fusion;
- orirow: the corresponding row numbers of the interactions in the test dataset.
-
test.csv: this is the test data file to evaluate the KC level models' performance for multi-step ahead prediction scenario. Each row represents a test student interaction sequence. There are 3,613 student interaction sequences in this file. The columns in this file are same to file "test_question_window_sequences" except "cidxs":
- cidxs: the interaction index sequence in the test dataset, the KCs from the same question at a particular timestamp have different indexes.
The question_level
folder for the models which need question level data to train and test:
- train_valid_sequences_quelevel.csv: this is the data file for KT models which need question level data to train. The columns are the same as "train_valid_sequences.csv", also note that in concepts column, "_" is used to split KCs of questions that have multi KCs;
- test_window_sequences_quelevel.csv: the question level data file for KT models to predict their performance. It has the same columns with "train_valid_sequences_quelevel.csv";
- test_quelevel.csv: this is the original test data file to evaluate the question level models' performance for multi-step ahead prediction scenario. The columns are consistent with them in file "train_valid_sequences_quelevel.csv".
The metadata
folder of the metadata for questions:
-
questions.json: this file contains the detailed information of questions, the format of the data is a dictionary, each question is a key-value item, the key is the question id, the value is also a dictionary, each item in the dictionary means that:
- content: the question textual content;
- kc_routes: the textual of the KC routes;
- answer: the right answer of the question;
- options: the options of the question if the type of the question is multi-choice, when it is a fill-in-the-blank question, the value of this key is null;
- analysis: the textual content of the detailed analysis for resolving the question;
- type: the question type. 0: fill-in-the-blank question; 1: multi-choice question.
-
kc_routes_map.json: this is the mapping file between the indexes of all KCs and their corresponding textual;
-
embeddings: we additionally provide the embedding files of our questions, analysis and KCs. Specifically, we further pretrain the large-scale language model RoBERTa via the exercise data from a K-12 online learning platform in China. Then we obtain the semantic representations by averaging all the word-level representations of each question, analysis and KC. The embedding fold includes "qid2content_emb.json", "qid2analysis_emb.json", and "cid2content_emb.json" which are the embeddings of questions, analysis and KCs respectively. In each JSON file, each key is the question ID or KC ID, and the value is the corresponding embedding;
-
images: this is the folder of images for all questions. The name of the image file is corresponding to the question, each question may have 0 or more images, the images' formats are "question_qid-image_index" in question contents and options, and "analysis_qid-image_index" in question analysis, at the same time, the links of the images in the contents are also replaced to the name.