scripts

fix: package installation and data construction

Aug 21, 2022

a3a3e7d · Aug 21, 2022

Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md	feat: data construction	Aug 16, 2022
build.sh	build.sh	feat: data construction	Aug 16, 2022
merge.py	merge.py	fix: package installation and data construction	Aug 21, 2022
tokenize_hippocorpus.py	tokenize_hippocorpus.py	fix: package installation and data construction	Aug 21, 2022

README.md

NIR

Introduction

Hippocorpus is constructed for investigating the difference in the narrative flow between relating life experiences and telling imaginative stories. We construct NIR by pruning the imaginative stories in Hippocorpus and retaining those stories about real-life events written by crowdworkers at two different times as pre-retold stories and post-retold stories. We summarize the following five event types from the story pairs in the dataset: Consistent, Inconsistent, Additional, Forgotten, and Unforgotten.

Format

Each object of the JSON files is consisted of event_id(i.e., object key), pair_id, story_type, subject, predicate, object, time, event_type, and the support evidences of the event.

Example

{
  "59": {
    "pair_id": "3P4RDNWND6SXR9D7TBY1P0EI0KHJIR",
    "story_type": "post-retold",
    "explicitness": "explicit",
    "subject_token_ids": [
      24,
      25,
      26,
      27,
      28,
      29,
      30
    ],
    "predicate": null,
    "predicate_token_ids": [
      31,
      32,
      33
    ],
    "object_token_ids": [
      35
    ],
    "time_token_ids": [],
    "event_type": "additional",
    "supports": []
  },
  ...
}

Steps

1. Download the corpus--Hippocorpus

Since we construct our dataset--NIR by exteding the Hippocorpus, we need to download the hippocorpus first.

Go to http://aka.ms/hippocorpus
Login your Microsoft accouot.
Download hippoCorpusV2.csv and save it to the parent directory(i.e. data/).

2. Download NIR dataset & Hippocorpus correction file

gdown https://drive.google.com/uc?id=13F_9A8Z1jL9Eg4IwtRospfec7HQnubOC -O ../NIR.json
gdown https://drive.google.com/uc?id=1kaViqs9FDzArV_e8F7i7TZfeoEKkpRnc -O ../errors.csv

3. Download spacy model

We use spacy tok tokenize the stories. Thus, we need to download the spacy model.

python -m spacy download en_core_web_sm

4. Tokenize the Stories in Hippocorpus

Since we only release the annotation that uses the tokenized result of the Hippocorpus, we provide the script for tokenization and preprocessing to ensure the result is the same as ours.

python tokenize_hippocorpus.py

5. Merge NIR and Hippocorpus

After parsing the Hipporcorpus, we also provide the script to merge the hippocorpus and the NIR for convenience.

python merge.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

scripts

scripts

README.md

NIR

Introduction

Format

Example

Steps

1. Download the corpus--Hippocorpus

2. Download NIR dataset & Hippocorpus correction file

3. Download spacy model

4. Tokenize the Stories in Hippocorpus

5. Merge NIR and Hippocorpus

Files

scripts

Directory actions

More options

Directory actions

More options

Latest commit

History

scripts

Folders and files

parent directory

README.md

NIR

Introduction

Format

Example

Steps

1. Download the corpus--Hippocorpus

2. Download NIR dataset & Hippocorpus correction file

3. Download spacy model

4. Tokenize the Stories in Hippocorpus

5. Merge NIR and Hippocorpus