A dataset for evaluation of noisy label methods
NoisyNER is a dataset for the evaluation of methods to handle noisy labels when training machine learning models. It is from the NLP/Information Extraction domain and was created through a realistic distant supervision technique. Some highlights and interesting aspects of the data are:
- Seven sets of labels with differing noise patterns to evaluate different noise levels on the same instances
- Full parallel clean labels available to compute upper performance bounds or study scenarios where a small amount of gold-standard data can be leveraged
- Skewed label distribution (typical for Named Entity Recognition tasks)
- For some label sets: noise level higher than the true label probability
- Sequential dependencies between the labels
For more details on the dataset and its creation process, please refer to our publication https://ojs.aaai.org/index.php/AAAI/article/view/16938 (published at AAAI'21).
- Clone this repository
- Download the original Estonian NER dataset from https://doi.org/10.15155/1-00-0000-0000-0000-00073L .The official dataset seems to be non available at the moment (temporarily according to the authors), you can also download the file estner.cnll.zip here: https://code.google.com/archive/p/patnlp/downloads . Thanks to @WinterShiver for pointing this out.
- Extract the downloaded .zip file and save the "estner.cnll" file in the "data" subdirectory
- Run
python prepare_data.py
with Python3
You will then find in the "data" directory all the dataset files.
For each of the 7 noisy label sets, we provide the full dataset (with the file ending *_all.tsv) as well as an 80/10/10 train/dev/test split (with the file ending *_train.tsv/*_dev.tsv/*_test.tsv). The splits for the original, clean dataset are estner_clean_{train,dev,test}.tsv.
All files are tsv files with the same structure. The structure follows the CoNLL standard for NER datasets. Each line corresponds to one word or token. The first column gives the actual token. The last column gives the label. The two middle columns give additional, grammatical features which used to be leveraged by NLP methods but are often ignored by modern neural methods.
For more details, please refer to our publication. If you have any questions or if you run into any issues, feel free to contact us.
When you work with this dataset, please consider citing us as
Hedderich, Zhu and Klakow.
Analysing the Noise Model Error for Realistic Noisy Label Data.
AAAI 2021
@inproceedings{hedderich2021analysing,
title={Analysing the Noise Model Error for Realistic Noisy Label Data},
author={Hedderich, Michael A and Zhu, Dawei and Klakow, Dietrich},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={35},
number={9},
pages={7675--7684},
year={2021}
}
This noisy label dataset is based on an existing NER dataset for Estonian. Please cite this work as well.
The original dataset and the clean labels are from
Laur, S. (2013).
Nimeüksuste korpus. Center of Estonian Language Resources.
@inproceedings{tkachenko-etal-2013-named,
title = "Named Entity Recognition in {E}stonian",
author = "Tkachenko, Alexander and Petmanson, Timo and Laur, Sven",
booktitle = "Proceedings of the 4th Biennial International Workshop on {B}alto-{S}lavic Natural Language Processing",
year = "2013",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/W13-2412",
}
The original dataset is licensed under CC-BY-NC. We provide our noisy labels under CC-BY 4.0.