Skip to content

Latest commit

 

History

History
483 lines (470 loc) · 14.7 KB

supported-datasets.md

File metadata and controls

483 lines (470 loc) · 14.7 KB

Supported Datasets of LLMBox

We currently support 59+ commonly used datasets for LLMs.

Understanding Evaluation Type

Each dataset is either a multiple-choice dataset or a generation dataset. You can find the difference between them at here

Understanding Subsets

Some datasets have multiple subsets. For example, Massive Multitask Language Understanding (mmlu) dataset contains 57 different subsets categorized into four categories: stem, social_sciences, humanities, and other.

While some other dataset is a subset of another dataset. For example, Choice Of Plausible Alternatives (copa) is a subset of super_glue.

See how to load datasets with subsets.

Understanding CoT

Some datasets support Chain-of-Thought reasoning. For example, Grade School Math 8K (gsm8k) supports three types of CoT: base, least_to_most, and pal.

Supported Datasets

  • 🔥 Recently supported datasets: imbue_code, imbue_public, and imbue_private.
  • The evaluation code can be found here: benchmarking llama3.
Dataset Subsets / Collections Evaluation Type CoT Notes
AGIEval(agieval, alias of agieval_single_choice and agieval_cot) English: sat-en, sat-math, lsat-ar, lsat-lr, lsat-rc, logiqa-en, aqua-rat, sat-en-without-passage MultipleChoice
gaokao-chinese, gaokao-geography, gaokao-history, gaokao-biology, gaokao-chemistry, gaokao-english, logiqa-zh
jec-qa-kd, jec-qa-ca, math, gaokao-physics, gaokao-mathcloze, gaokao-mathqa Generation
Alpacal Eval (alpaca_eval) / Generation Single GPTEval
Adversarial Natural Language Inference (anli) Round2 (default) MultipleChoice
AI2's Reasoning Challenge (arc) ARC-Easy, ARC-Challenge MultipleChoice Normalization
BIG-Bench Hard (bbh) boolean_expressions, ... Generation
Boolean Questions (boolq) super_glue MultipleChoice
CommitmentBank (cb) super_glue MultipleChoice
C-Eval (ceval) stem: advanced_mathematics, college_chemistry, ... MultipleChoice
social science: business_administration, college_economics, ...
humanities: art_studies, chinese_language_and_literature, ...
other: accountant, basic_medicine, ...
Massive Multitask Language Understanding in Chinese (cmmlu) stem: anatomy, astronomy, ... MultipleChoice
social science: ancient_chinese, business_ethics, ...
humanities: arts, chinese_history, ...
other: agronomy, chinese_driving_rule, ...
CNN Dailymail (cnn_dailymail) 3.0.0 (default), ... Generation
Reasoning About Colored Objects (color_objects) bigbench (reasoning_about_colored_objects) Generation
Commonsense QA (commonsenseqa) / MultipleChoice
Choice Of Plausible Alternatives (copa) super_glue MultipleChoice
Conversational Question Answering (coqa) / Generation Download: train, dev
CrowS-Pairs (crows_pairs) / MultipleChoice
Discrete Reasoning Over the content of Paragraphs (drop) / Generation
GAOKAO (gaokao) Chinese: 2010-2022_Chinese_Modern_Lit, 2010-2022_Chinese_Lang_and_Usage_MCQs Generation Metric: Exam scoring
English: 2010-2022_English_Reading_Comp, 2010-2022_English_Fill_in_Blanks, ...
2010-2022_Math_II_MCQs, 2010-2022_Math_I_MCQs, ...
Google-Proof Q&A (GPQA) gpqa_main (default), gpqa_extended, ... MultipleChoice Tutorial
Grade School Math 8K (gsm8k) main (default), socratic Generation Code exec
HaluEval(halueval) dialogue_samples, qa_samples, summarization_samples Generation
HellaSWAG (hellaswag) / MultipleChoice
HumanEval (humaneval) / Generation Pass@K
Instruction-Following Evaluation (ifeval) / Generation
Imbue Code Comprehension (imbue_code) / MultipleChoice
Imbue High Quality Private Evaluations (imbue_private) / MultipleChoice
Imbue High Quality Public Evaluations (imbue_public) / MultipleChoice
LAnguage Modeling Broadened to Account for Discourse Aspects (lambada) default (default), de, ... (source: EleutherAI/lambada_openai) Generation
Mathematics Aptitude Test of Heuristics (math) / Generation
Mostly Basic Python Problems (mbpp) full (default), sanitized Generation Pass@K
Massive Multitask Language Understanding(mmlu) stem: abstract_algebra, astronomy, ... MultipleChoice
social_sciences: econometrics, high_school_geography, ...
humanities: formal_logic, high_school_european_history, ...
other: anatomy, business_ethics, ...
Multi-turn Benchmark (mt_bench) / Generation Multi-turn GPTEval
Natural Questions(nq) / Generation
OpenBookQA (openbookqa) main (default), additional MultipleChoice Normalization
Penguins In A Table (penguins_in_a_table) bigbench MultipleChoice
Physical Interaction: Question Answering (piqa) / MultipleChoice
Question Answering in Context (quac) / Generation
ReAding Comprehension (race) high, middle MultipleChoice Normalization
Real Toxicity Prompts (real_toxicity_prompts) / Generation Perspective Toxicity
Recognizing Textual Entailment (rte) super_glue MultipleChoice
Social Interaction QA (siqa) / MultipleChoice
Stanford Question Answering Dataset (squad, squad_v2) / Generation
Story Cloze Test (story_cloze) 2016 (default), 2018 MultipleChoice Manually download
TL;DR (tldr) / Generation
TriviaQA (triviaqa) rc.wikipedia.nocontext (default), rc, rc.nocontext, ... Generation
TruthfulQA (truthfulqa_mc) multiple_choice (default), generation (not supported) MultipleChoice
Vicuna Bench (vicuna_bench) / Generation GPTEval
WebQuestions (webq) / Generation
Words in Context (wic) super_glue MultipleChoice
Winogender Schemas (winogender) main, gotcha MultipleChoice Group by gender
WSC273 (winograd) wsc273 (default), wsc285 MultipleChoice
WinoGrande (winogrande) winogrande_debiased (default), ... MultipleChoice
Conference on Machine Translation (wmt21, wmt19, ...) en-ro, ro-en, ... Generation
Winograd Schema Challenge (wsc) super_glue MultipleChoice
A Multilingual Dataset for Causal Commonsense Reasoning (xcopa) et, zh, ... MultipleChoice
Large-Scale Multilingual Abstractive Summarization (xlsum) english, french, ... Generation
Cross-lingual Natural Language Inference (xnli) en, es, ... MultipleChoice
Extreme Summarization (xsum) / Generation
Crosslingual Winograd (xwinograd) en, fr, ... MultipleChoice