Skip to content

Commit 4699fc9

Browse files
afourneyqingyun-wuLeoLjl
authored
Adds the GAIA benchark to the Testbed. This PR depends on microsoft#792 (microsoft#810)
* Re-added completion logging when using older versions of autogen. * Extended scenario definitions and templating to include folders. * Prepare collate_human_eval.py for working with group chat scenarios. * Converted HumanEval to the folder-based approach, and added GroupChat scenarios. * Fixed the default termination message. * Fixed another termination condition. * Updated compatible autogen versions. * Added initial support for GAIA benchmark. * Fixed a bug in executing the finalize scripts. * Generalized the template further to support multiple folder copy operations. * Refined GAIA support, and broke scenarios down by difficulty. * Added some experimental scripts for computing metrics over GAIA. This is a first version, and will likely need refinement. * Added instructions for cloning GAIA * Updated README to fix some typos. * Added a script to format GAIA reslts for the leaderboard. * Update samples/tools/testbed/scenarios/GAIA/Templates/BasicTwoAgents/scenario.py Co-authored-by: LeoLjl <[email protected]> --------- Co-authored-by: Qingyun Wu <[email protected]> Co-authored-by: LeoLjl <[email protected]>
1 parent 6cd238e commit 4699fc9

File tree

7 files changed

+502
-1
lines changed

7 files changed

+502
-1
lines changed

samples/tools/testbed/README.md

+24-1
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,8 @@ The Testbed also requires Docker (Desktop or Engine) AND the __python docker__ l
1919
To run the Testbed, simply execute
2020
``python run_scenarios.py scenarios/Examples``
2121

22-
The default is to run each scenario once time. To run each scenario 10 times, use:
22+
The default is to run each scenario once. To run each scenario 10 times, use:
23+
2324
``python run_scenarios.py --repeat 10 scenarios/Examples ``
2425

2526
The run_scenarios.py script also allows a number of command-line arguments to control various parameters of execution. Type ``python run_scenarios.py -h`` to explore these options:
@@ -193,3 +194,25 @@ python ./run_scenarios.py scenarios/HumanEval/human_eval_two_agents_gpt35.jsonl
193194
python utils/collate_human_eval.py ./results/human_eval_two_agents_gpt35 | python utils/metrics_human_eval.py > human_eval_results_gpt35.csv
194195
cat human_eval_results_gpt35.csv
195196
```
197+
198+
## (Example) Running GAIA
199+
200+
The Testbed can also be used to run the recently released [GAIA benchmark](https://huggingface.co/gaia-benchmark). This integration is presently experimental, and needs further validation. In this scenario, agents are presented with a series of questions that may include file references, or multi-modal input. Agents then must provide a `FINAL ANSWER`, which is considered correct if it (nearly) exactly matches an unambiguously accepted answer.
201+
202+
Accessing this scenario-type requires downloading and converting the GAIA dataset, running the Testbed, collating the results, and finally computing the metrics. The following commands will accomplish this, running each test instance once with GPT-4:
203+
204+
```
205+
# Clone the GAIA dataset repo (assuming a 'repos' folder in your home directory)
206+
cd ~/repos
207+
git clone https://huggingface.co/datasets/gaia-benchmark/GAIA
208+
209+
# Expand GAIA
210+
cd ~/repos/autogen/samples/tools/testbed
211+
python ./utils/expand_gaia.py ~/repos/GAIA
212+
213+
# Run GAIA
214+
python ./run_scenarios.py ./scenarios/GAIA/gaia_validation_level_1__two_agents_gpt4.jsonl
215+
216+
# Compute Metrics
217+
python utils/collate_gaia_csv.py ./results/gaia_validation_level_1__two_agents_gpt4 | python utils/metrics_gaia.py
218+
```
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
__EXPECTED_ANSWER__
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
import os
2+
import json
3+
import autogen
4+
from datetime import datetime
5+
import testbed_utils
6+
7+
testbed_utils.init()
8+
##############################
9+
10+
11+
GAIA_SYSTEM_MESSAGE = (
12+
"You are a helpful AI assistant, and today's date is "
13+
+ datetime.now().date().isoformat()
14+
+ """.
15+
I will ask you a question. Answer this question using your coding and language skills.
16+
In the following cases, suggest python code (presented in a coding block beginning ```python) or shell script (presented in a coding block beginning ```sh) for the user to execute:
17+
1. When you need to collect info, use the code to output the info you need, for example, browse or search the web, download/read a file, print the content of a webpage or a file, check the operating system. After sufficient info is printed and the task is ready to be solved based on your language skill, you can solve the task by yourself.
18+
2. When you need to perform some task with code, use the code to perform the task and output the result. Finish the task smartly.
19+
Answer the question step if you need to. If a plan is not provided, explain your plan first. Be clear which step uses code, and which step uses your language skill.
20+
The user cannot provide any other feedback or perform any other action beyond executing the code appearing in the code block. The user can't modify your code, so do not suggest incomplete code which requires users to modify. Don't use a code block if it's not intended to be executed by the user. Don't include multiple code blocks in one response. Do not ask users to copy and paste code or results. Instead, use the 'print' function for the output when relevant. Check the execution result reported by the user.
21+
If the result indicates there is an error, fix the error and output the code again. Suggest the full code instead of partial code or code changes. If the error can't be fixed or if the task is not solved even after the code is executed successfully, analyze the problem, revisit your assumption, collect additional info you need, and think of a different approach to try.
22+
When you find an answer, report your thoughts, and finish your answer with the following template: FINAL ANSWER: [YOUR FINAL ANSWER].
23+
YOUR FINAL ANSWER should be a number OR as few words as possible OR a comma separated list of numbers and/or strings.
24+
If you are asked for a number, don't use comma to write your number neither use units such as $ or percent sign unless specified otherwise.
25+
If you are asked for a string, don't use articles, neither abbreviations (e.g. for cities), and write the digits in plain text unless specified otherwise.
26+
If you are asked for a comma separated list, apply the above rules depending of whether the element to be put in the list is a number or a string.
27+
""".strip()
28+
)
29+
30+
31+
config_list = autogen.config_list_from_json(
32+
"OAI_CONFIG_LIST",
33+
filter_dict={"model": ["__MODEL__"]},
34+
)
35+
36+
assistant = autogen.AssistantAgent(
37+
"assistant",
38+
system_message=GAIA_SYSTEM_MESSAGE,
39+
is_termination_msg=lambda x: x.get("content", "").rstrip().find("FINAL ANSWER") >= 0,
40+
llm_config=testbed_utils.default_llm_config(config_list, timeout=180),
41+
)
42+
user_proxy = autogen.UserProxyAgent(
43+
"user_proxy",
44+
human_input_mode="NEVER",
45+
is_termination_msg=lambda x: x.get("content", "").rstrip().find("FINAL ANSWER") >= 0,
46+
code_execution_config={
47+
"work_dir": "coding",
48+
"use_docker": False,
49+
},
50+
max_consecutive_auto_reply=10,
51+
default_auto_reply="",
52+
)
53+
54+
filename = "__FILE_NAME__".strip()
55+
question = """
56+
__PROMPT__
57+
""".strip()
58+
59+
if len(filename) > 0:
60+
question = f"Consider the file '{filename}', which can be read from the current working directory. {question}"
61+
62+
user_proxy.initiate_chat(assistant, message=question)
63+
64+
65+
##############################
66+
testbed_utils.finalize(agents=[assistant, user_proxy])
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
import os
2+
import json
3+
import re
4+
import sys
5+
import argparse
6+
7+
8+
def normalize_answer(a):
9+
# Lower case
10+
# Trim (left and right)
11+
# Replace multiple spaces with one space
12+
# Remove trailing punctuation
13+
return re.sub(r"[\.\!\?]+$", "", re.sub(r"\s+", " ", a.strip().lower()))
14+
15+
16+
def collate(results_dir):
17+
"""
18+
Collate the results of running GAIA
19+
20+
Args:
21+
results_dir (path): The folder were results were be saved.
22+
"""
23+
24+
all_results = list()
25+
max_instances = 0
26+
27+
for test_id in os.listdir(results_dir):
28+
test_path = os.path.join(results_dir, test_id)
29+
30+
# Collect the reslts vector
31+
results = [test_id]
32+
33+
instance = 0
34+
instance_dir = os.path.join(test_path, str(instance))
35+
while os.path.isdir(instance_dir):
36+
expected_answer_file = os.path.join(instance_dir, "expected_answer.txt")
37+
if not os.path.isfile(expected_answer_file):
38+
# Expected ansewr is missing
39+
results.append("")
40+
41+
instance += 1
42+
instance_dir = os.path.join(test_path, str(instance))
43+
continue
44+
45+
expected_answer = "!!!NULL ANSWER!!!"
46+
with open(expected_answer_file, "rt") as fh:
47+
expected_answer = fh.read().strip()
48+
49+
console_log_file = os.path.join(instance_dir, "console_log.txt")
50+
if not os.path.isfile(console_log_file):
51+
# Console log file missing
52+
results.append("")
53+
54+
instance += 1
55+
instance_dir = os.path.join(test_path, str(instance))
56+
continue
57+
58+
with open(console_log_file, "rt") as fh:
59+
console_log = fh.read()
60+
61+
final_answer = ""
62+
m = re.search(r"FINAL ANSWER:(.*?)\n", console_log, re.DOTALL)
63+
if m:
64+
final_answer = m.group(1).strip()
65+
66+
# print(f"Expected Answer: {expected_answer}\nAutogen Answer: {final_answer}\n")
67+
68+
if normalize_answer(expected_answer) == normalize_answer(final_answer):
69+
results.append("1")
70+
else:
71+
results.append("-1")
72+
73+
instance += 1
74+
instance_dir = os.path.join(test_path, str(instance))
75+
76+
max_instances = max(max_instances, instance)
77+
78+
# Buffer the results
79+
all_results.append(results)
80+
81+
# Create a header
82+
header = "TestId"
83+
for i in range(0, max_instances):
84+
header += ",Trial" + str(i)
85+
print(header)
86+
87+
# Print a fully-populated table of results
88+
for r in all_results:
89+
while len(r) < max_instances + 1:
90+
r.append("")
91+
print(",".join(r))
92+
93+
94+
###############################################################################
95+
if __name__ == "__main__":
96+
script_path = os.path.realpath(__file__)
97+
script_name = os.path.basename(script_path)
98+
script_dir = os.path.dirname(script_path)
99+
100+
# Path to the default results directory
101+
# (relative to this script, up on directory, then into the results folder)
102+
default_results_dir = os.path.realpath(
103+
os.path.join(script_dir, os.path.pardir, "results", "gaia_validation_level_1__two_agents_gpt4")
104+
)
105+
106+
parser = argparse.ArgumentParser(
107+
description=f"""
108+
{script_name} will collate the results of the GAIA scenarios and output them to a CSV. The CSV format is as follows:
109+
110+
TestId, Trial0, Trial1, ..., TrialN
111+
uuid_1, x_10, x_11, ..., X_1N
112+
uuid_2, x_20, x_21, ..., X_2N
113+
...
114+
uuid_M, x_M0, x_M1, ..., X_MN
115+
116+
Where uuid_i is the identifier of the ith test question, and x_ij is 1 or -1 depending on if the test passed or failed, respectively. If data for the trial is missing (e.g., due to a runtime error, the value will be an empty string "".
117+
""".strip(),
118+
formatter_class=argparse.RawTextHelpFormatter,
119+
)
120+
121+
parser.add_argument(
122+
"scenario",
123+
nargs="?",
124+
help="Path to the scenario results. (default: " + default_results_dir + ")",
125+
default=default_results_dir,
126+
)
127+
args = parser.parse_args()
128+
collate(args.scenario)
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
import os
2+
import json
3+
import re
4+
import sys
5+
import argparse
6+
7+
8+
def normalize_answer(a):
9+
# Trim (left and right)
10+
# Replace multiple spaces with one space
11+
# Remove trailing punctuation
12+
# Trim again
13+
return re.sub(r"[\.\!\?]+$", "", re.sub(r"\s+", " ", a.strip())).strip()
14+
15+
16+
def collate(results_dir, instance=0):
17+
"""
18+
Collate the results of running GAIA. Print the results in the format acceped by the leaderboard.
19+
20+
Args:
21+
results_dir (path): The folder were results were be saved.
22+
"""
23+
24+
for test_id in os.listdir(results_dir):
25+
test_path = os.path.join(results_dir, test_id)
26+
27+
instance_dir = os.path.join(test_path, str(instance))
28+
console_log_file = os.path.join(instance_dir, "console_log.txt")
29+
30+
final_answer = ""
31+
if os.path.isfile(console_log_file):
32+
with open(console_log_file, "rt") as fh:
33+
console_log = fh.read()
34+
35+
final_answer = ""
36+
m = re.search(r"FINAL ANSWER:(.*?)\n", console_log, re.DOTALL)
37+
if m:
38+
final_answer = normalize_answer(m.group(1))
39+
40+
# Clean up the GAIA logs so they don't have the Docker setup preamble
41+
m = re.search(r"^.*?\r?\n(user_proxy \(to assistant\).*$)", console_log, re.DOTALL)
42+
if m:
43+
console_log = m.group(1)
44+
45+
print(json.dumps({"task_id": test_id, "model_answer": final_answer, "reasoning_trace": console_log}))
46+
47+
48+
###############################################################################
49+
if __name__ == "__main__":
50+
script_path = os.path.realpath(__file__)
51+
script_name = os.path.basename(script_path)
52+
script_dir = os.path.dirname(script_path)
53+
54+
# Path to the default results directory
55+
# (relative to this script, up on directory, then into the results folder)
56+
default_results_dir = os.path.realpath(
57+
os.path.join(script_dir, os.path.pardir, "results", "gaia_validation_level_1__two_agents_gpt4")
58+
)
59+
60+
parser = argparse.ArgumentParser(
61+
description=f"""
62+
{script_name} will collate the results of the GAIA scenarios into the jsonl format that can be submit to the GAIA leaderboard.
63+
64+
NOTE: You will likely need to concatenate resuls for level 1, level 2 and level 3 to form a complete submission.
65+
""".strip(),
66+
formatter_class=argparse.RawTextHelpFormatter,
67+
)
68+
69+
parser.add_argument(
70+
"scenario",
71+
nargs="?",
72+
help="Path to the scenario results. (default: " + default_results_dir + ")",
73+
default=default_results_dir,
74+
)
75+
args = parser.parse_args()
76+
collate(args.scenario)

0 commit comments

Comments
 (0)