Skip to content

ctomoiaga/ro-datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 

Repository files navigation

🇷🇴 RO LLMs Datasets and Tools

🤗 Hugging Face

List of datasets, tools and procedures to help anyone training LLMs and other types of models related to the Romanian language


Links

Links Comments Description
OpenLLM-RO / Github / arXiv Most datsets are translated which leads to lower quality models Romanian community that builds open Romanian models and tries to collect these models in a single place.
FulG / arXiv Used CCNet for processing CommonCrawl filtered and processed for Romanian language
RoMedQa Needs verification RoQLlama: A Lightweight Romanian Adapted Language Model
FineWeb / Blog Good documentation CommonCrawl filtering procedure and dataset for English, it can be adapted for Romanian by using datatrove
OSCAR Good CommonCrawl processing CommonCrawl filtered and processed dataset available
Readerbench Collection of romanian datasets and models Training classifiers and using already trained ones
CommonCrawl Statistics Statistics of CommonCrawl
OPUS Open Parallel Corpora

Datasets

Some datasets are very large and are not processed ( may contain multiple languages ). Detecting Romanian language is usually easy. Most if not all datasets are not instruct ready, additional steps are needed to add instructions

Dataset Notes
OpenLLM-RO Most datasets are translations, check datasets and trained models, they feel a bit lower quality due to poor translations
RoMedQa A dataset of single-choice questions regarding the medical field in the Romanian language. It consists of advanced biology questions used in entrance examinations in medical schools in Romania. Each question has five possible answer choices, numbered from 1 to 5, with only one correct answer.
mOSCAR Multilingual OSCAR dataset, needs further processing. See additional datasets on their HF space.
faur-ai/fulg CommonCrawl filtered and processed for Romanian. Quality suffers due to extraction methods used :(
mC4 RO CommonCrawl the C4 version Docs
PleIAs/common_corpus Common Corpus is the largest open and permissible licensed text dataset, comprising over 2 trillion tokens (2,003,039,184,047 tokens). It is a diverse dataset, consisting of books, newspapers, scientific articles, government and legal documents, code, and more.
readerbench/ro-human-machine-60k Needs further processing for instruct fine-tuning an LLM
CC-100 Attempt to recreate the dataset used for training XLM-R. This corpus comprises of monolingual data for 100+ languages
Wikipedia RO Wikipedia, the Romanian part
cosmadrian/romath I did not had the time to check it's value. A Mathematical Reasoning Benchmarking Suite from Descriptions in 🇷🇴 Romanian 🇷🇴
cosmadrian/rocode Small, but useful. RoCode: A Dataset for Measuring Code Intelligence from Problem Definitions in Romanian
BlackKakapo Multiple datasets, including instruct ones Ex: qaworld-ro
RoITD Romanian IT Dataset (RoITD) resembling SQuAD 1.1. RoITD consists of 9575 Romanian QA pairs formulated by crowd workers. QA pairs are based on 5043 articles from Romanian Wikipedia articles describing IT and household products.
30K-Romanian-Captions This dataset is a translation in romanian of the flickr 30k captions dataset
Ro-STS & Others Some datasets are translated, including RO-STS. RO-STS - the Semantic Textual Similarity dataset for the Romanian language
Romanian Emotion Dataset The second version of the Romanian Emotions Dataset (RED) containing 5449 tweets annotated in a multi-label fashion with the following 7 emotions: Anger (Furie), Fear (Frică), Joy (Bucurie), Sadness (Tristețe), Surprise (Surpriză), Trust (Încredere) and Neutral (Neutru).
lavi13/wiki_qa_instructions_ro Randomly selected ~10k set of entries from the Wikipedia dataset and Mixtral 8x7B to extract Q&A pairs

Tools

Link Comments
Trafilatura Good for extracting text from web
Datatrove Datatrove and examples with the Fineweb pipeline, can be addapted for Ro
LLM Inference Banchmark Really usefull benchmark on LLM inference with multiple hardware including MACs

About

LLM training datatsets for Romanian language

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published