$\color{red}{CoR}\color{yellow}{oS}\color{blue}{eOf}$ : An annotated Corpus of Romanian Sexist and Offensive Language
A collection of Romanian sexist and offensive samples, including approximately 40k samples, of which ≈10% are sexist, and ≈11% offensive.
This project is organized into the following folders:
- corpus (contains tweet id, sampling technique, annotator id and gender, non-aggregated annotations, and majority vote labels)
- docs (annotation guidelines and keywords used to query the data)
Contributors names and contact info:
Diana Constantina Höfels: [email protected]
Dr. Irina Diana Mădroane: [email protected]
The corpus can be used under the terms of CC-BY-SA.
Accepted at LREC2022
CoRoSeOf - An Annotated Corpus of Romanian Sexist and Offensive Tweets
Kindly provide proper citations and references to acknowledge our contributions when utilizing or mentioning our work in your endeavors:
@InProceedings{hoefels-ltekin-mdroane:2022:LREC,
author = {Hoefels, Diana Constantina, Çöltekin, Çağrı and Mădroane, Irina Diana},
title = {CoRoSeOf - An Annotated Corpus of Romanian Sexist and Offensive Tweets},
booktitle = {Proceedings of the Language Resources and Evaluation Conference},
month = {June},
year = {2022},
address = {Marseille, France},
publisher = {European Language Resources Association},
pages = {2269--2281},
abstract = {This paper introduces CoRoSeOf, a large corpus of Romanian social media manually annotated for sexist and offensive language. We describe the annotation process of the corpus, provide initial analyses, and baseline classification results for sexism detection on this data set. The resulting corpus contains 39 245 tweets, annotated by multiple annotators (with an agreement rate of Fleiss’κ= 0.45), following the sexist label set of a recent study. The automatic sexism detection yields scores similar to some of the earlier studies (macro averaged F1 score of 83.07\% on binary classification task). We release the corpus with a permissive license.},
url = {https://aclanthology.org/2022.lrec-1.243}
}
The annotators team (in alphabetical order), Anamaria Andrei, Raluca Ardeaun, Edward Bojboi, Octavia Cojocaru, Cristiana Giurcă, Costel Olaru, Roberta Recalo, Diana Stanciu, Tiberiu Tomescu and Carmen Tuns, from Interdisciplinary Center of Gender Studies - West University of Timișoara.
This study utilized Twitter data sets and the content provided remains subject to the terms and conditions of Twitter Twitter's Developer Agreement & Policy, and must agree to the Twitter Terms of Service, Privacy Policy, Developer Agreement, and Developer Policy.