Skip to content

CoRoSeOf: An annotated Corpus of Romanian Sexist and Offensive Language

License

Notifications You must be signed in to change notification settings

DianaHoefels/CoRoSeOf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

58 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

$\color{red}{CoR}\color{yellow}{oS}\color{blue}{eOf}$: An annotated Corpus of Romanian Sexist and Offensive Language

A collection of Romanian sexist and offensive samples, including approximately 40k samples, of which ≈10% are sexist, and ≈11% offensive.

New: More research on sexism employing CoRoSeOf: TBA Soon !!!

Folder Structure

This project is organized into the following folders:

  • corpus (contains tweet id, sampling technique, annotator id and gender, non-aggregated annotations, and majority vote labels)
  • docs (annotation guidelines and keywords used to query the data)

Authors

Contributors names and contact info:

Diana Constantina Höfels: [email protected]

Dr. Çağrı Çöltekin

Dr. Irina Diana Mădroane: [email protected]

License

The corpus can be used under the terms of CC-BY-SA.

Journal Paper

Accepted at LREC2022

CoRoSeOf - An Annotated Corpus of Romanian Sexist and Offensive Tweets

Kindly provide proper citations and references to acknowledge our contributions when utilizing or mentioning our work in your endeavors:


@InProceedings{hoefels-ltekin-mdroane:2022:LREC,
  author    = {Hoefels, Diana Constantina,  Çöltekin, Çağrı  and  Mădroane, Irina Diana},
  title     = {CoRoSeOf - An Annotated Corpus of Romanian Sexist and Offensive Tweets},
  booktitle      = {Proceedings of the Language Resources and Evaluation Conference},
  month          = {June},
  year           = {2022},
  address        = {Marseille, France},
  publisher      = {European Language Resources Association},
  pages     = {2269--2281},
  abstract  = {This paper introduces CoRoSeOf, a large corpus of Romanian social media manually annotated for sexist and offensive language. We describe the annotation process of the corpus, provide initial analyses, and baseline classification results for sexism detection on this data set. The resulting corpus contains 39 245 tweets, annotated by multiple annotators (with an agreement rate of Fleiss’κ= 0.45), following the sexist label set of a recent study. The automatic sexism detection yields scores similar to some of the earlier studies (macro averaged F1 score of 83.07\% on binary classification task). We release the corpus with a permissive license.},
  url       = {https://aclanthology.org/2022.lrec-1.243}
}

Acknowledgements

The annotators team (in alphabetical order), Anamaria Andrei, Raluca Ardeaun, Edward Bojboi, Octavia Cojocaru, Cristiana Giurcă, Costel Olaru, Roberta Recalo, Diana Stanciu, Tiberiu Tomescu and Carmen Tuns, from Interdisciplinary Center of Gender Studies - West University of Timișoara.

This study utilized Twitter data sets and the content provided remains subject to the terms and conditions of Twitter Twitter's Developer Agreement & Policy, and must agree to the Twitter Terms of Service, Privacy Policy, Developer Agreement, and Developer Policy.

About

CoRoSeOf: An annotated Corpus of Romanian Sexist and Offensive Language

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages