We currently support 59+ commonly used datasets for LLMs.
Each dataset is either a multiple-choice dataset or a generation dataset. You can find the difference between them at here
Some datasets have multiple subsets. For example, Massive Multitask Language Understanding (mmlu
) dataset contains 57 different subsets categorized into four categories: stem
, social_sciences
, humanities
, and other
.
While some other dataset is a subset of another dataset. For example, Choice Of Plausible Alternatives (copa
) is a subset of super_glue
.
See how to load datasets with subsets.
Some datasets support Chain-of-Thought reasoning. For example, Grade School Math 8K (gsm8k
) supports three types of CoT: base
, least_to_most
, and pal
.
- 🔥 Recently supported datasets:
imbue_code
,imbue_public
, andimbue_private
. - The evaluation code can be found here: benchmarking llama3.
Dataset | Subsets / Collections | Evaluation Type | CoT | Notes |
AGIEval(agieval , alias of agieval_single_choice and agieval_cot ) |
English: sat-en , sat-math , lsat-ar , lsat-lr , lsat-rc , logiqa-en , aqua-rat , sat-en-without-passage |
MultipleChoice | ||
gaokao-chinese , gaokao-geography , gaokao-history , gaokao-biology , gaokao-chemistry , gaokao-english , logiqa-zh |
||||
jec-qa-kd , jec-qa-ca , math , gaokao-physics , gaokao-mathcloze , gaokao-mathqa |
Generation | ✅ | ||
Alpacal Eval (alpaca_eval ) |
/ | Generation | Single GPTEval | |
Adversarial Natural Language Inference (anli ) |
Round2 (default) |
MultipleChoice | ||
AI2's Reasoning Challenge (arc ) |
ARC-Easy , ARC-Challenge |
MultipleChoice | Normalization | |
BIG-Bench Hard (bbh ) |
boolean_expressions , ... |
Generation | ✅ | |
Boolean Questions (boolq ) |
super_glue | MultipleChoice | ||
CommitmentBank (cb ) |
super_glue | MultipleChoice | ||
C-Eval (ceval ) |
stem: advanced_mathematics , college_chemistry , ... |
MultipleChoice | ||
social science: business_administration , college_economics , ... |
||||
humanities: art_studies , chinese_language_and_literature , ... |
||||
other: accountant , basic_medicine , ... |
||||
Massive Multitask Language Understanding in Chinese (cmmlu ) |
stem: anatomy , astronomy , ... |
MultipleChoice | ||
social science: ancient_chinese , business_ethics , ... |
||||
humanities: arts , chinese_history , ... |
||||
other: agronomy , chinese_driving_rule , ... |
||||
CNN Dailymail (cnn_dailymail ) |
3.0.0 (default), ... |
Generation | ||
Reasoning About Colored Objects (color_objects ) |
bigbench (reasoning_about_colored_objects) | Generation | ||
Commonsense QA (commonsenseqa ) |
/ | MultipleChoice | ||
Choice Of Plausible Alternatives (copa ) |
super_glue | MultipleChoice | ||
Conversational Question Answering (coqa ) |
/ | Generation | Download: train, dev | |
CrowS-Pairs (crows_pairs ) |
/ | MultipleChoice | ||
Discrete Reasoning Over the content of Paragraphs (drop ) |
/ | Generation | ||
GAOKAO (gaokao ) |
Chinese: 2010-2022_Chinese_Modern_Lit , 2010-2022_Chinese_Lang_and_Usage_MCQs |
Generation | Metric: Exam scoring | |
English: 2010-2022_English_Reading_Comp , 2010-2022_English_Fill_in_Blanks , ... |
||||
2010-2022_Math_II_MCQs , 2010-2022_Math_I_MCQs , ... |
||||
Google-Proof Q&A (GPQA ) |
gpqa_main (default), gpqa_extended , ... |
MultipleChoice | ✅ | Tutorial |
Grade School Math 8K (gsm8k ) |
main (default), socratic |
Generation | ✅ | Code exec |
HaluEval(halueval ) |
dialogue_samples , qa_samples , summarization_samples |
Generation | ||
HellaSWAG (hellaswag ) |
/ | MultipleChoice | ||
HumanEval (humaneval ) |
/ | Generation | Pass@K | |
Instruction-Following Evaluation (ifeval ) |
/ | Generation | ||
Imbue Code Comprehension (imbue_code ) |
/ | MultipleChoice | ||
Imbue High Quality Private Evaluations (imbue_private ) |
/ | MultipleChoice | ||
Imbue High Quality Public Evaluations (imbue_public ) |
/ | MultipleChoice | ||
LAnguage Modeling Broadened to Account for Discourse Aspects (lambada ) |
default (default), de , ... (source: EleutherAI/lambada_openai) |
Generation | ||
Mathematics Aptitude Test of Heuristics (math ) |
/ | Generation | ✅ | |
Mostly Basic Python Problems (mbpp ) |
full (default), sanitized |
Generation | Pass@K | |
Massive Multitask Language Understanding(mmlu ) |
stem: abstract_algebra , astronomy , ... |
MultipleChoice | ||
social_sciences: econometrics , high_school_geography , ... |
||||
humanities: formal_logic , high_school_european_history , ... |
||||
other: anatomy , business_ethics , ... |
||||
Multi-turn Benchmark (mt_bench ) |
/ | Generation | Multi-turn GPTEval | |
Natural Questions(nq ) |
/ | Generation | ||
OpenBookQA (openbookqa ) |
main (default), additional |
MultipleChoice | Normalization | |
Penguins In A Table (penguins_in_a_table ) |
bigbench | MultipleChoice | ||
Physical Interaction: Question Answering (piqa ) |
/ | MultipleChoice | ||
Question Answering in Context (quac ) |
/ | Generation | ||
ReAding Comprehension (race ) |
high , middle |
MultipleChoice | Normalization | |
Real Toxicity Prompts (real_toxicity_prompts ) |
/ | Generation | Perspective Toxicity | |
Recognizing Textual Entailment (rte ) |
super_glue | MultipleChoice | ||
Social Interaction QA (siqa ) |
/ | MultipleChoice | ||
Stanford Question Answering Dataset (squad, squad_v2 ) |
/ | Generation | ||
Story Cloze Test (story_cloze ) |
2016 (default), 2018 |
MultipleChoice | Manually download | |
TL;DR (tldr ) |
/ | Generation | ||
TriviaQA (triviaqa ) |
rc.wikipedia.nocontext (default), rc , rc.nocontext , ... |
Generation | ||
TruthfulQA (truthfulqa_mc ) |
multiple_choice (default), generation (not supported) |
MultipleChoice | ||
Vicuna Bench (vicuna_bench ) |
/ | Generation | GPTEval | |
WebQuestions (webq ) |
/ | Generation | ||
Words in Context (wic ) |
super_glue | MultipleChoice | ||
Winogender Schemas (winogender ) |
main , gotcha |
MultipleChoice | Group by gender | |
WSC273 (winograd ) |
wsc273 (default), wsc285 |
MultipleChoice | ||
WinoGrande (winogrande ) |
winogrande_debiased (default), ... |
MultipleChoice | ||
Conference on Machine Translation (wmt21, wmt19, ... ) |
en-ro , ro-en , ... |
Generation | ||
Winograd Schema Challenge (wsc ) |
super_glue | MultipleChoice | ||
A Multilingual Dataset for Causal Commonsense Reasoning (xcopa ) |
et , zh , ... |
MultipleChoice | ||
Large-Scale Multilingual Abstractive Summarization (xlsum ) |
english , french , ... |
Generation | ||
Cross-lingual Natural Language Inference (xnli ) |
en , es , ... |
MultipleChoice | ||
Extreme Summarization (xsum ) |
/ | Generation | ||
Crosslingual Winograd (xwinograd ) |
en , fr , ... |
MultipleChoice |