Hippocorpus is constructed for investigating the difference in the narrative flow between relating life experiences and telling imaginative stories. We construct NIR by pruning the imaginative stories in Hippocorpus and retaining those stories about real-life events written by crowdworkers at two different times as pre-retold stories and post-retold stories. We summarize the following five event types from the story pairs in the dataset: Consistent, Inconsistent, Additional, Forgotten, and Unforgotten.
Each object of the JSON files is consisted of event_id(i.e., object key), pair_id, story_type, subject, predicate, object, time, event_type, and the support evidences of the event.
{
"59": {
"pair_id": "3P4RDNWND6SXR9D7TBY1P0EI0KHJIR",
"story_type": "post-retold",
"explicitness": "explicit",
"subject_token_ids": [
24,
25,
26,
27,
28,
29,
30
],
"predicate": null,
"predicate_token_ids": [
31,
32,
33
],
"object_token_ids": [
35
],
"time_token_ids": [],
"event_type": "additional",
"supports": []
},
...
}
Since we construct our dataset--NIR by exteding the Hippocorpus, we need to download the hippocorpus first.
- Go to http://aka.ms/hippocorpus
- Login your Microsoft accouot.
- Download
hippoCorpusV2.csv
and save it to the parent directory(i.e.data/
).
gdown https://drive.google.com/uc?id=13F_9A8Z1jL9Eg4IwtRospfec7HQnubOC -O ../NIR.json
gdown https://drive.google.com/uc?id=1kaViqs9FDzArV_e8F7i7TZfeoEKkpRnc -O ../errors.csv
We use spacy tok tokenize the stories. Thus, we need to download the spacy model.
python -m spacy download en_core_web_sm
Since we only release the annotation that uses the tokenized result of the Hippocorpus, we provide the script for tokenization and preprocessing to ensure the result is the same as ours.
python tokenize_hippocorpus.py
After parsing the Hipporcorpus, we also provide the script to merge the hippocorpus and the NIR for convenience.
python merge.py