Skip to content

Commit

Permalink
AGIEval (EleutherAI#1359)
Browse files Browse the repository at this point in the history
* add agieval

* fix typo

* add cloze / math exactmatch agieval tasks, rename

* update exact-match agieval tasks, allow for multiple-correct answers

* add more detail to readme

* don't parse_math_answer twice

---------

Co-authored-by: Alex Bäuerle <[email protected]>
  • Loading branch information
haileyschoelkopf and Sparkier authored Mar 11, 2024
1 parent 4ab0759 commit a3e56af
Show file tree
Hide file tree
Showing 23 changed files with 578 additions and 0 deletions.
114 changes: 114 additions & 0 deletions lm_eval/tasks/agieval/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# AGIEval

### Paper

Title: AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

Abstract: https://arxiv.org/abs/2304.06364.pdf

AGIEval is a human-centric benchmark specifically designed to evaluate the general abilities of foundation models in tasks pertinent to human cognition and problem-solving.
This benchmark is derived from 20 official, public, and high-standard admission and qualification exams intended for general human test-takers, such as general college admission tests (e.g., Chinese College Entrance Exam (Gaokao) and American SAT), law school admission tests, math competitions, lawyer qualification tests, and national civil service exams.

Homepage: https://github.com/ruixiangcui/AGIEval

### Citation

```
@misc{zhong2023agieval,
title={AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models},
author={Wanjun Zhong and Ruixiang Cui and Yiduo Guo and Yaobo Liang and Shuai Lu and Yanlin Wang and Amin Saied and Weizhu Chen and Nan Duan},
year={2023},
eprint={2304.06364},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```

Please make sure to cite all the individual datasets in your paper when you use them. We provide the relevant citation information below:

```
@inproceedings{ling-etal-2017-program,
title = "Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems",
author = "Ling, Wang and
Yogatama, Dani and
Dyer, Chris and
Blunsom, Phil",
booktitle = "Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2017",
address = "Vancouver, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/P17-1015",
doi = "10.18653/v1/P17-1015",
pages = "158--167",
abstract = "Solving algebraic word problems requires executing a series of arithmetic operations{---}a program{---}to obtain a final answer. However, since programs can be arbitrarily complicated, inducing them directly from question-answer pairs is a formidable challenge. To make this task more feasible, we solve these problems by generating answer rationales, sequences of natural language and human-readable mathematical expressions that derive the final answer through a series of small steps. Although rationales do not explicitly specify programs, they provide a scaffolding for their structure via intermediate milestones. To evaluate our approach, we have created a new 100,000-sample dataset of questions, answers and rationales. Experimental results show that indirect supervision of program learning via answer rationales is a promising strategy for inducing arithmetic programs.",
}
@inproceedings{hendrycksmath2021,
title={Measuring Mathematical Problem Solving With the MATH Dataset},
author={Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt},
journal={NeurIPS},
year={2021}
}
@inproceedings{Liu2020LogiQAAC,
title={LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning},
author={Jian Liu and Leyang Cui and Hanmeng Liu and Dandan Huang and Yile Wang and Yue Zhang},
booktitle={International Joint Conference on Artificial Intelligence},
year={2020}
}
@inproceedings{zhong2019jec,
title={JEC-QA: A Legal-Domain Question Answering Dataset},
author={Zhong, Haoxi and Xiao, Chaojun and Tu, Cunchao and Zhang, Tianyang and Liu, Zhiyuan and Sun, Maosong},
booktitle={Proceedings of AAAI},
year={2020},
}
@article{Wang2021FromLT,
title={From LSAT: The Progress and Challenges of Complex Reasoning},
author={Siyuan Wang and Zhongkun Liu and Wanjun Zhong and Ming Zhou and Zhongyu Wei and Zhumin Chen and Nan Duan},
journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
year={2021},
volume={30},
pages={2201-2216}
}
```

### Groups and Tasks

#### Groups

- `agieval`: Evaluates all tasks listed below.

- `agieval_en`: Evaluates all English subtasks: `agieval_aqua_rat`, `agieval_gaokao_english`, `agieval_logiqa_en`, `agieval_lsat_*`, `agieval_sat_*`, `agieval_math`

- `agieval_cn`: Evaluates all Chinese subtasks:
`agieval_gaokao_biology`, `agieval_gaokao_chemistry`, `agieval_gaokao_chinese`, `agieval_gaokao_geography`,
`agieval_gaokao_history`, `agieval_gaokao_mathqa`, `agieval_gaokao_mathcloze`, `agieval_gaokao_physics`, `agieval_jec_qa_ca`, `agieval_jec_qa_kd`, `agieval_logiqa_zh`

- `agieval_nous`: Evaluates a specific subset of AGIEval tasks (multiple-choice and english-only), namely those in https://github.com/teknium1/LLM-Benchmark-Logs/blob/main/benchmark-logs/Mistral-7B-Base.md

#### Tasks

- `agieval_aqua_rat`
- `agieval_gaokao_biology`
- `agieval_gaokao_chemistry`
- `agieval_gaokao_chinese`
- `agieval_gaokao_english`
- `agieval_gaokao_geography`
- `agieval_gaokao_history`
- `agieval_gaokao_mathqa`
- `agieval_gaokao_mathcloze`
- `agieval_gaokao_physics`
- `agieval_jec_qa_ca`
- `agieval_jec_qa_kd`
- `agieval_logiqa_en`
- `agieval_logiqa_zh`
- `agieval_lsat_ar`
- `agieval_lsat_lr`
- `agieval_lsat_rc`
- `agieval_sat_en`
- `agieval_sat_en_without_passage`
- `agieval_sat_math`
- `agieval_math`
24 changes: 24 additions & 0 deletions lm_eval/tasks/agieval/aqua-rat.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
group:
- agieval
- agieval_en
- agieval_nous
task: agieval_aqua_rat
dataset_path: hails/agieval-aqua-rat
dataset_name: null
output_type: multiple_choice
training_split: null
validation_split: null
test_split: test
doc_to_text: "{{query}}"
doc_to_target: "{{gold}}"
doc_to_choice: "{{choices}}"
process_results: !function utils.process_results_mcqa
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
6 changes: 6 additions & 0 deletions lm_eval/tasks/agieval/gaokao-biology.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
include: aqua-rat.yaml
group:
- agieval
- agieval_cn
task: agieval_gaokao_biology
dataset_path: hails/agieval-gaokao-biology
6 changes: 6 additions & 0 deletions lm_eval/tasks/agieval/gaokao-chemistry.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
include: aqua-rat.yaml
group:
- agieval
- agieval_cn
task: agieval_gaokao_chemistry
dataset_path: hails/agieval-gaokao-chemistry
6 changes: 6 additions & 0 deletions lm_eval/tasks/agieval/gaokao-chinese.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
include: aqua-rat.yaml
group:
- agieval
- agieval_cn
task: agieval_gaokao_chinese
dataset_path: hails/agieval-gaokao-chinese
6 changes: 6 additions & 0 deletions lm_eval/tasks/agieval/gaokao-english.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
include: aqua-rat.yaml
group:
- agieval
- agieval_en # categorizing as EN because the AGIEval codebase lists this as in `english_qa_tasks`
task: agieval_gaokao_english
dataset_path: hails/agieval-gaokao-english
6 changes: 6 additions & 0 deletions lm_eval/tasks/agieval/gaokao-geography.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
include: aqua-rat.yaml
group:
- agieval
- agieval_cn
task: agieval_gaokao_geography
dataset_path: hails/agieval-gaokao-geography
6 changes: 6 additions & 0 deletions lm_eval/tasks/agieval/gaokao-history.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
include: aqua-rat.yaml
group:
- agieval
- agieval_cn
task: agieval_gaokao_history
dataset_path: hails/agieval-gaokao-history
25 changes: 25 additions & 0 deletions lm_eval/tasks/agieval/gaokao-mathcloze.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
group:
- agieval
- agieval_cn
task: agieval_gaokao_mathcloze
dataset_path: hails/agieval-gaokao-mathcloze
dataset_name: null
output_type: generate_until
training_split: null
validation_split: null
test_split: test
doc_to_text: "{{query}}"
doc_to_target: "{{answer}}"
process_results: !function utils.process_results
generation_kwargs:
max_gen_toks: 32
do_sample: False
temperature: 0.0
until:
- "Q:"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
6 changes: 6 additions & 0 deletions lm_eval/tasks/agieval/gaokao-mathqa.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
include: aqua-rat.yaml
group:
- agieval
- agieval_cn
task: agieval_gaokao_mathqa
dataset_path: hails/agieval-gaokao-mathqa
6 changes: 6 additions & 0 deletions lm_eval/tasks/agieval/gaokao-physics.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
include: aqua-rat.yaml
group:
- agieval
- agieval_cn
task: agieval_gaokao_physics
dataset_path: hails/agieval-gaokao-physics
6 changes: 6 additions & 0 deletions lm_eval/tasks/agieval/jec-qa-ca.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
include: aqua-rat.yaml
group:
- agieval
- agieval_cn
task: agieval_jec_qa_ca
dataset_path: hails/agieval-jec-qa-ca
6 changes: 6 additions & 0 deletions lm_eval/tasks/agieval/jec-qa-kd.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
include: aqua-rat.yaml
group:
- agieval
- agieval_cn
task: agieval_jec_qa_kd
dataset_path: hails/agieval-jec-qa-kd
7 changes: 7 additions & 0 deletions lm_eval/tasks/agieval/logiqa-en.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
include: aqua-rat.yaml
group:
- agieval
- agieval_nous
- agieval_en
task: agieval_logiqa_en
dataset_path: hails/agieval-logiqa-en
6 changes: 6 additions & 0 deletions lm_eval/tasks/agieval/logiqa-zh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
include: aqua-rat.yaml
group:
- agieval
- agieval_cn
task: agieval_logiqa_zh
dataset_path: hails/agieval-logiqa-zh
7 changes: 7 additions & 0 deletions lm_eval/tasks/agieval/lsat-ar.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
include: aqua-rat.yaml
group:
- agieval
- agieval_nous
- agieval_en
task: agieval_lsat_ar
dataset_path: hails/agieval-lsat-ar
7 changes: 7 additions & 0 deletions lm_eval/tasks/agieval/lsat-lr.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
include: aqua-rat.yaml
group:
- agieval
- agieval_nous
- agieval_en
task: agieval_lsat_lr
dataset_path: hails/agieval-lsat-lr
7 changes: 7 additions & 0 deletions lm_eval/tasks/agieval/lsat-rc.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
include: aqua-rat.yaml
group:
- agieval
- agieval_nous
- agieval_en
task: agieval_lsat_rc
dataset_path: hails/agieval-lsat-rc
25 changes: 25 additions & 0 deletions lm_eval/tasks/agieval/math.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
group:
- agieval
- agieval_en
task: agieval_math
dataset_path: hails/agieval-math
dataset_name: null
output_type: generate_until
training_split: null
validation_split: null
test_split: test
doc_to_text: "{{query}}"
doc_to_target: "{{answer}}"
process_results: !function utils.process_results
generation_kwargs:
max_gen_toks: 32
do_sample: False
temperature: 0.0
until:
- "Q:"
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
metadata:
version: 1.0
7 changes: 7 additions & 0 deletions lm_eval/tasks/agieval/sat-en-without-passage.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
include: aqua-rat.yaml
group:
- agieval
- agieval_nous
- agieval_en
task: agieval_sat_en_without_passage
dataset_path: hails/agieval-sat-en-without-passage
7 changes: 7 additions & 0 deletions lm_eval/tasks/agieval/sat-en.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
include: aqua-rat.yaml
group:
- agieval
- agieval_nous
- agieval_en
task: agieval_sat_en
dataset_path: hails/agieval-sat-en
7 changes: 7 additions & 0 deletions lm_eval/tasks/agieval/sat-math.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
include: aqua-rat.yaml
group:
- agieval
- agieval_nous
- agieval_en
task: agieval_sat_math
dataset_path: hails/agieval-sat-math
Loading

0 comments on commit a3e56af

Please sign in to comment.