🤗 Hugging Face •
List of datasets, tools and procedures to help anyone training LLMs and other types of models related to the Romanian language
Links | Comments | Description |
---|---|---|
OpenLLM-RO / Github / arXiv | Most datsets are translated which leads to lower quality models | Romanian community that builds open Romanian models and tries to collect these models in a single place. |
FulG / arXiv | Used CCNet for processing | CommonCrawl filtered and processed for Romanian language |
RoMedQa | Needs verification | RoQLlama: A Lightweight Romanian Adapted Language Model |
FineWeb / Blog | Good documentation | CommonCrawl filtering procedure and dataset for English, it can be adapted for Romanian by using datatrove |
OSCAR | Good CommonCrawl processing | CommonCrawl filtered and processed dataset available |
Readerbench | Collection of romanian datasets and models | Training classifiers and using already trained ones |
CommonCrawl Statistics | Statistics of CommonCrawl | |
OPUS | Open Parallel Corpora |
Some datasets are very large and are not processed ( may contain multiple languages ). Detecting Romanian language is usually easy. Most if not all datasets are not instruct ready, additional steps are needed to add instructions
Dataset | Notes |
---|---|
OpenLLM-RO | Most datasets are translations, check datasets and trained models, they feel a bit lower quality due to poor translations |
RoMedQa | A dataset of single-choice questions regarding the medical field in the Romanian language. It consists of advanced biology questions used in entrance examinations in medical schools in Romania. Each question has five possible answer choices, numbered from 1 to 5, with only one correct answer. |
mOSCAR | Multilingual OSCAR dataset, needs further processing. See additional datasets on their HF space. |
faur-ai/fulg | CommonCrawl filtered and processed for Romanian. Quality suffers due to extraction methods used :( |
mC4 RO | CommonCrawl the C4 version Docs |
PleIAs/common_corpus | Common Corpus is the largest open and permissible licensed text dataset, comprising over 2 trillion tokens (2,003,039,184,047 tokens). It is a diverse dataset, consisting of books, newspapers, scientific articles, government and legal documents, code, and more. |
readerbench/ro-human-machine-60k | Needs further processing for instruct fine-tuning an LLM |
CC-100 | Attempt to recreate the dataset used for training XLM-R. This corpus comprises of monolingual data for 100+ languages |
Wikipedia RO | Wikipedia, the Romanian part |
cosmadrian/romath | I did not had the time to check it's value. A Mathematical Reasoning Benchmarking Suite from Descriptions in 🇷🇴 Romanian 🇷🇴 |
cosmadrian/rocode | Small, but useful. RoCode: A Dataset for Measuring Code Intelligence from Problem Definitions in Romanian |
BlackKakapo | Multiple datasets, including instruct ones Ex: qaworld-ro |
RoITD | Romanian IT Dataset (RoITD) resembling SQuAD 1.1. RoITD consists of 9575 Romanian QA pairs formulated by crowd workers. QA pairs are based on 5043 articles from Romanian Wikipedia articles describing IT and household products. |
30K-Romanian-Captions | This dataset is a translation in romanian of the flickr 30k captions dataset |
Ro-STS & Others | Some datasets are translated, including RO-STS. RO-STS - the Semantic Textual Similarity dataset for the Romanian language |
Romanian Emotion Dataset | The second version of the Romanian Emotions Dataset (RED) containing 5449 tweets annotated in a multi-label fashion with the following 7 emotions: Anger (Furie), Fear (Frică), Joy (Bucurie), Sadness (Tristețe), Surprise (Surpriză), Trust (Încredere) and Neutral (Neutru). |
lavi13/wiki_qa_instructions_ro | Randomly selected ~10k set of entries from the Wikipedia dataset and Mixtral 8x7B to extract Q&A pairs |
Link | Comments |
---|---|
Trafilatura | Good for extracting text from web |
Datatrove | Datatrove and examples with the Fineweb pipeline, can be addapted for Ro |
LLM Inference Banchmark | Really usefull benchmark on LLM inference with multiple hardware including MACs |