🇷🇴 RO LLMs Datasets and Tools

🤗 Hugging Face •

List of datasets, tools and procedures to help anyone training LLMs and other types of models related to the Romanian language

Links

Links	Comments	Description
OpenLLM-RO / Github / arXiv	Most datsets are translated which leads to lower quality models	Romanian community that builds open Romanian models and tries to collect these models in a single place.
FulG / arXiv	Used CCNet for processing	CommonCrawl filtered and processed for Romanian language
RoMedQa	Needs verification	RoQLlama: A Lightweight Romanian Adapted Language Model
FineWeb / Blog	Good documentation	CommonCrawl filtering procedure and dataset for English, it can be adapted for Romanian by using datatrove
OSCAR	Good CommonCrawl processing	CommonCrawl filtered and processed dataset available
Readerbench	Collection of romanian datasets and models	Training classifiers and using already trained ones
CommonCrawl Statistics	Statistics of CommonCrawl
OPUS	Open Parallel Corpora

Datasets

Some datasets are very large and are not processed ( may contain multiple languages ). Detecting Romanian language is usually easy. Most if not all datasets are not instruct ready, additional steps are needed to add instructions

Dataset	Notes
OpenLLM-RO	Most datasets are translations, check datasets and trained models, they feel a bit lower quality due to poor translations
RoMedQa	A dataset of single-choice questions regarding the medical field in the Romanian language. It consists of advanced biology questions used in entrance examinations in medical schools in Romania. Each question has five possible answer choices, numbered from 1 to 5, with only one correct answer.
mOSCAR	Multilingual OSCAR dataset, needs further processing. See additional datasets on their HF space.
faur-ai/fulg	CommonCrawl filtered and processed for Romanian. Quality suffers due to extraction methods used :(
mC4 RO	CommonCrawl the C4 version Docs
PleIAs/common_corpus	Common Corpus is the largest open and permissible licensed text dataset, comprising over 2 trillion tokens (2,003,039,184,047 tokens). It is a diverse dataset, consisting of books, newspapers, scientific articles, government and legal documents, code, and more.
readerbench/ro-human-machine-60k	Needs further processing for instruct fine-tuning an LLM
CC-100	Attempt to recreate the dataset used for training XLM-R. This corpus comprises of monolingual data for 100+ languages
Wikipedia RO	Wikipedia, the Romanian part
cosmadrian/romath	I did not had the time to check it's value. A Mathematical Reasoning Benchmarking Suite from Descriptions in 🇷🇴 Romanian 🇷🇴
cosmadrian/rocode	Small, but useful. RoCode: A Dataset for Measuring Code Intelligence from Problem Definitions in Romanian
BlackKakapo	Multiple datasets, including instruct ones Ex: qaworld-ro
RoITD	Romanian IT Dataset (RoITD) resembling SQuAD 1.1. RoITD consists of 9575 Romanian QA pairs formulated by crowd workers. QA pairs are based on 5043 articles from Romanian Wikipedia articles describing IT and household products.
30K-Romanian-Captions	This dataset is a translation in romanian of the flickr 30k captions dataset
Ro-STS & Others	Some datasets are translated, including RO-STS. RO-STS - the Semantic Textual Similarity dataset for the Romanian language
Romanian Emotion Dataset	The second version of the Romanian Emotions Dataset (RED) containing 5449 tweets annotated in a multi-label fashion with the following 7 emotions: Anger (Furie), Fear (Frică), Joy (Bucurie), Sadness (Tristețe), Surprise (Surpriză), Trust (Încredere) and Neutral (Neutru).
lavi13/wiki_qa_instructions_ro	Randomly selected ~10k set of entries from the Wikipedia dataset and Mixtral 8x7B to extract Q&A pairs

Tools

Link	Comments
Trafilatura	Good for extracting text from web
Datatrove	Datatrove and examples with the Fineweb pipeline, can be addapted for Ro
LLM Inference Banchmark	Really usefull benchmark on LLM inference with multiple hardware including MACs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

🇷🇴 RO LLMs Datasets and Tools

Links

Datasets

Tools

Files

README.md

Latest commit

History

README.md

File metadata and controls

🇷🇴 RO LLMs Datasets and Tools

Links

Datasets

Tools