Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update README of AutoGenBench MATH scenario #1891

Merged
merged 1 commit into from
Mar 7, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 17 additions & 1 deletion samples/tools/autogenbench/scenarios/MATH/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,25 @@ autogenbench tabulate Results/math_two_agents

By default, only a small subset (17 of 5000) MATH problems are exposed. Edit `Scripts/init_tasks.py` to expose more tasks.

*Note*: Scoring is done by prompting the LLM (ideally GPT-4) with both the proposed answer and the ground truth answer, and asking the LLM to grade itself.
## Note on automated evaluation
In this scenario, we adopted an automated evaluation pipeline (from [AutoGen](https://arxiv.org/abs/2308.08155) evaluation) that uses LLM to compare the results. Thus, the metric above is only an estimation of the agent's performance on math problems. We also find a similar practice of using LLM as judger for MATH dataset from the [Cumulative Reasoning](https://arxiv.org/abs/2308.04371) paper ([code](https://github.com/iiis-ai/cumulative-reasoning/blob/main/MATH/math-cr-4shot.py)).

The static checking from MATH dataset requires an exact match ('comparing 2.0 and 2 results in False'). We haven't found an established way that accurately compares the answer, so human involvement is still needed to confirm the result. In AutoGen, the conversation will end at “TERMINATE” by default. To enable an automated way of answer extraction and evaluation, we prompt an LLM with 1. the given problem 2. the ground truth answer 3. the last response from the solver, to extract the answer and compare it with the ground truth answer.

We evaluate the 17 problems for 3 times and go through these problems manually to check the answers. Compared with the automated result evaluation (the model is gpt-4-0613), we find that in 2/3 trials, the automated evaluation determined 1 correct answer as wrong (False Negative). This means 49/51 problems are evaluated correctly. We also went through 200 random sampled problems from whole dataset to check the results.
There are 1 False Negative and 2 False Positives.

We note that False Positive is also possible due to the hallucination of LLMs, and the variety of problems.

## References
**Measuring Mathematical Problem Solving With the MATH Dataset**<br/>
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, Jacob Steinhardt<br/>
[https://arxiv.org/abs/2103.03874](https://arxiv.org/abs/2103.03874)

**AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation**<br/>
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang and Chi Wang<br/>
[https://arxiv.org/abs/2308.08155](https://arxiv.org/abs/2308.08155)

**Cumulative Reasoning with Large Language Models**<br/>
Yifan Zhang, Jingqin Yang, Yang Yuan, Andrew Chi-Chih Yao<br/>
[https://arxiv.org/abs/2308.04371](https://arxiv.org/abs/2308.04371)
Loading