From 8b11c39bf65644bf94606991ad2b0155f79f17ea Mon Sep 17 00:00:00 2001
From: Adam Fourney <adamfo@microsoft.com>
Date: Tue, 24 Oct 2023 14:31:27 -0700
Subject: [PATCH 01/13] Initial commit of the autogen testbed environment.

---
 samples/tools/testbed/README.md               | 128 +++++++++
 samples/tools/testbed/includes/ENV.example    |  15 ++
 .../tools/testbed/includes/testbed_utils.py   |  33 +++
 samples/tools/testbed/run_scenarios.py        | 250 ++++++++++++++++++
 .../scenarios/default_two_agents.jsonl        |   6 +
 .../testbed/scenarios/default_two_agents.py   |  28 ++
 6 files changed, 460 insertions(+)
 create mode 100644 samples/tools/testbed/README.md
 create mode 100644 samples/tools/testbed/includes/ENV.example
 create mode 100644 samples/tools/testbed/includes/testbed_utils.py
 create mode 100644 samples/tools/testbed/run_scenarios.py
 create mode 100644 samples/tools/testbed/scenarios/default_two_agents.jsonl
 create mode 100644 samples/tools/testbed/scenarios/default_two_agents.py

diff --git a/samples/tools/testbed/README.md b/samples/tools/testbed/README.md
new file mode 100644
index 000000000000..26365057ac11
--- /dev/null
+++ b/samples/tools/testbed/README.md
@@ -0,0 +1,128 @@
+# Autogen Testbed Environment
+
+The Autogen Testbed environment is a tool for repeatedly running a set of pre-defined Autogen scenarios in a setting with tightly-controlled initial conditions. With each run, Autogen will start from a blank slate, working out what code needs to be written, and what libraries or dependencies to install. The results of each run are logged, can be ingested by analysis or metrics scripts. By default, all runs are conducted in freshly-initialized docker containers, providing the recommended level of consistency and safety.
+
+## Setup
+
+Before you begin, you must configure your API keys for use with the Testbed. These keys extend beyond those typically found in an OAI_CONFIG_LIST, and can include such things as key for the Bing Search API or other services used by the scenarios. There is an example ENV file in ``includes/ENV.example``. To get started:
+
+``cp includes/ENV.example includes/ENV``
+
+Then edit ``includes/ENV`` as needed.
+
+The Testbed also required installation of the __python docker__ library:
+
+``pip install docker``
+
+## Running the Testbed
+
+To run the Testbed, simply execute
+``python run_scenarios.py``
+
+The run_scenarios.py script also allows a number of command-line arguments to control various parameters of execution. Type ``python run_scenarios.py -h`` to explore these options:
+
+```
+run_scenarios.py will run the specified autogen scenarios for a given number of repetitions and record all logs and trace information. When running in a Docker environment (default), each run will begin from a common, tightly controlled, environment. The resultant logs can then be further processed by other scripts to produce metrics.
+
+positional arguments:
+  scenario      The JSONL scenario file to run. If a directory is specified,
+                then all JSONL scenarios in the directory are run. (default:
+                ./scenarios)
+
+options:
+  -h, --help    show this help message and exit
+
+  -r REPEAT, --repeat REPEAT
+                The number of repetitions to run for each scenario (default: 10).
+
+  --native      Run the scenarios natively rather than in docker.
+                NOTE: This is not advisable, and should be done with great caution.
+```
+
+## Results
+
+By default, the Testbed stores results in a folder heirarchy with the following template:
+
+``./results/[scenario]/[instance_id]/[repetition]``
+
+For example, consider the following folders:
+
+``./results/default_two_agents/two_agent_stocks_gpt4/0``
+``./results/default_two_agents/two_agent_stocks_gpt4/1``
+...
+``./results/default_two_agents/two_agent_stocks_gpt4/9``
+
+This folder holds the results for the ``two_agent_stocks_gpt4`` instance of the ``default_two_agents`` scenario. The ``0`` folder contains the results of the first run. The ``1`` folder contains the results of the second run, and so on.
+
+Within each folder, you will find the following files:
+
+- *timestamp.txt*: records the date and time of the run, along with the version of the pyautogen library installed
+- *console_log.txt*: all console output produced by Docker when running autogen. Read this like you would a regular console.
+- *chat_completions.json*: a log of all OpenAI ChatCompletions, as logged by ``autogen.ChatCompletion.start_logging(compact=False)``
+- *[agent]_messages.json*: for each Agent, a log of their messages dictionaries
+- *./coding*: A directory containing all code written by Autogen, and all artifacts produced by that code.
+
+## Scenario Templating
+
+All scenarios are stored in JSONL files in the ``./scenarios'' directory. Each line of a scenario file is a JSON object with the following schema:
+
+```
+{
+   "id": string,
+   "template": filename,
+   "values" {
+       "field_name1": string,
+       "field_name2": string,
+       ...
+       "field_nameN": string
+   }
+}
+```
+
+For example:
+
+```
+{
+    "id": "two_agent_stocks_gpt4",
+    "template": "default_two_agents.py",
+    "values": {
+        "\__MODEL\__": "gpt-4",
+        "\__PROMPT\__": "Plot and save to disk a chart of NVDA and TESLA stock price YTD."
+    }
+}
+```
+
+Where the ``id`` is the instance id used when saving results, ``template`` points to a python file that contains the scenario logic, and ``values`` contains a set of strings to find and replace when expanding the template.
+
+An example templated python file is:
+
+```
+from autogen import AssistantAgent, UserProxyAgent, config_list_from_json
+import os
+import json
+import testbed_utils
+
+testbed_utils.init()
+##############################
+
+config_list = config_list_from_json(
+        "OAI_CONFIG_LIST", filter_dict={"model": ["\__MODEL\__"]},
+)
+
+assistant = AssistantAgent("assistant", llm_config={
+    "request_timeout": 180,
+    "config_list": config_list}
+)
+user_proxy = UserProxyAgent("user_proxy",
+            human_input_mode="NEVER",
+            code_execution_config={
+                "work_dir": "coding",
+                "use_docker": False,
+            },
+            max_consecutive_auto_reply=10)
+user_proxy.initiate_chat(assistant, message="\__PROMPT\__")
+
+
+##############################
+testbed_utils.finalize(assistant, user_proxy)
+```
diff --git a/samples/tools/testbed/includes/ENV.example b/samples/tools/testbed/includes/ENV.example
new file mode 100644
index 000000000000..13a3563b81c7
--- /dev/null
+++ b/samples/tools/testbed/includes/ENV.example
@@ -0,0 +1,15 @@
+export BING_API_KEY=
+export OAI_CONFIG_LIST='
+[
+    {
+        "model": "gpt-4",
+        "api_key": "",
+        "organization": ""
+    },
+    {
+        "model": "gpt-3.5-turbo-16k",
+        "api_key": "",
+        "organization": ""
+    }
+]
+'
diff --git a/samples/tools/testbed/includes/testbed_utils.py b/samples/tools/testbed/includes/testbed_utils.py
new file mode 100644
index 000000000000..896c647dbeb8
--- /dev/null
+++ b/samples/tools/testbed/includes/testbed_utils.py
@@ -0,0 +1,33 @@
+from importlib.metadata import version as lib_version
+from datetime import datetime
+import os
+import autogen
+import json
+
+
+def init():
+    autogen.ChatCompletion.start_logging(compact=False)
+
+    # Print some information about the run
+    with open("timestamp.txt", "wt") as f:
+        f.write("Timestamp: " + datetime.now().isoformat() + "\n")
+        f.write("pyautogen version: " + lib_version("pyautogen") + "\n")
+
+
+def finalize(*args):
+    script_dir = os.path.dirname(os.path.realpath(__file__))
+
+    with open(os.path.join(script_dir, "chat_completions.json"), "wt") as fh:
+        fh.write(json.dumps(autogen.ChatCompletion.logged_history, indent=4))
+        autogen.ChatCompletion.stop_logging()
+
+    def messages_to_json(agent):
+        messages = dict()
+        for item in agent.chat_messages.items():
+            messages[item[0].name] = item[1]
+        return json.dumps(messages, indent=4)
+
+    for arg in args:
+        fname = arg.name + "_messages.json"
+        with open(os.path.join(script_dir, fname), "wt") as fh:
+            fh.write(messages_to_json(arg))
diff --git a/samples/tools/testbed/run_scenarios.py b/samples/tools/testbed/run_scenarios.py
new file mode 100644
index 000000000000..78cadbbe70c1
--- /dev/null
+++ b/samples/tools/testbed/run_scenarios.py
@@ -0,0 +1,250 @@
+import os
+import errno
+import shutil
+import subprocess
+import json
+import sys
+import time
+import pathlib
+import argparse
+
+# Location of the environment directory
+ENV_DIR = "./includes"
+
+
+def run_scenarios(scenario, n_repeats, is_native, results_dir="results"):
+    files = []
+
+    # Figure out which files or folders we are working with
+    if os.path.isfile(scenario):
+        files.append(scenario)
+    elif os.path.isdir(scenario):
+        for f in os.listdir(scenario):
+            scenario_file = os.path.join(scenario, f)
+
+            if not os.path.isfile(scenario_file):
+                continue
+
+            if not scenario_file.lower().endswith(".jsonl"):
+                continue
+
+            files.append(scenario_file)
+    else:
+        raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), scenario)
+
+    # Run all the scenario files
+    for scenario_file in files:
+        scenario_name = os.path.basename(scenario_file).split(".")
+        scenario_name.pop()
+        scenario_name = ".".join(scenario_name)
+
+        scenario_dir = os.path.dirname(os.path.realpath(scenario_file))
+
+        # Each line in the scenario file is an instance. Run it.
+        with open(scenario_file) as fh:
+            for line in fh:
+                instance = json.loads(line)
+
+                scenario_name + "_" + instance["id"]
+
+                # Create a folder to store the results
+
+                # Results base
+                if not os.path.isdir(results_dir):
+                    os.mkdir(results_dir)
+
+                # Results for the scenario
+                results_scenario = os.path.join(results_dir, scenario_name)
+                if not os.path.isdir(results_scenario):
+                    os.mkdir(results_scenario)
+
+                # Results fot the instance
+                results_instance = os.path.join(results_scenario, instance["id"])
+                if not os.path.isdir(results_instance):
+                    os.mkdir(results_instance)
+
+                # Results for the repeats
+                for i in range(0, n_repeats):
+                    results_repetition = os.path.join(results_instance, str(i))
+
+                    # Skip it if it already exists
+                    if os.path.isdir(results_repetition):
+                        print(f"Found folder {results_repetition} ... Skipping.")
+                        continue
+                    print(f"Running scenario {results_repetition}")
+
+                    # Create the folder, and copy the script to a standard name
+                    os.mkdir(results_repetition)
+                    expand_scenario(scenario_dir, instance, os.path.join(results_repetition, "scenario.py"))
+
+                    # Also copy the contents of ENV_DIR
+                    for item in os.listdir(ENV_DIR):
+                        if item.endswith(".example"):
+                            continue
+                        item_path = os.path.join(ENV_DIR, item)
+                        if os.path.isfile(item_path):
+                            shutil.copyfile(item_path, os.path.join(results_repetition, item))
+
+                    # Run the scenario
+                    if is_native:
+                        run_scenario_natively(results_repetition)
+                    else:
+                        run_scenario_in_docker(results_repetition)
+
+
+def expand_scenario(scenario_dir, scenario, output_file):
+    template_fh = open(os.path.join(scenario_dir, scenario["template"]), "rt")
+    output_fh = open(output_file, "wt")
+
+    for line in template_fh:
+        if "values" in scenario:
+            for k, v in scenario["values"].items():
+                line = line.replace(k, v)
+        output_fh.write(line)
+
+    template_fh.close()
+    output_fh.close()
+
+
+def run_scenario_natively(work_dir):
+    # Get the current working directory
+    cwd = os.getcwd()
+
+    # Navigate to the scenario
+    os.chdir(work_dir)
+    print("\n\n" + os.getcwd() + "\n===================================================================")
+
+    # Prepare the run script
+    with open(os.path.join("run.sh"), "wt") as f:
+        f.write(
+            """#
+. ./ENV
+python scenario.py
+rm ENV
+"""
+        )
+
+    # Run the script and log the output
+    with open("console_log.txt", "wb") as f:
+        process = subprocess.Popen(["sh", "run.sh"], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
+        for c in iter(lambda: process.stdout.read(1), b""):
+            f.write(c)
+            os.write(sys.stdout.fileno(), c)  # Write binary to stdout
+
+    # Return where we started
+    os.chdir(cwd)
+    return
+
+
+def run_scenario_in_docker(work_dir, timeout=600):
+    # Create a docker client
+    client = docker.from_env()
+    image_name = "python:3.11"
+
+    # Pull a suitable image
+    try:
+        image = client.images.get(image_name)
+    except docker.errors.ImageNotFound:
+        # pull the image
+        print("Pulling image", image_name)
+        try:
+            image = client.images.pull(image_name)
+        except docker.errors.DockerException:
+            print("Failed to pull image", image_name)
+
+    # Prepare the run script
+    with open(os.path.join(work_dir, "run.sh"), "wt") as f:
+        f.write(
+            """#
+. ./ENV
+pip install pyautogen
+python scenario.py
+rm ENV
+"""
+        )
+
+    print("\n\n" + work_dir + "\n===================================================================")
+
+    # Create and run the container
+    abs_path = str(pathlib.Path(work_dir).absolute())
+    container = client.containers.run(
+        image,
+        command=["sh", "run.sh"],
+        working_dir="/workspace",
+        detach=True,
+        # get absolute path to the working directory
+        volumes={abs_path: {"bind": "/workspace", "mode": "rw"}},
+    )
+
+    # Poll until the container is done, or we've timed out
+    start_time = time.time()
+    while container.status != "exited" and time.time() - start_time < timeout:
+        # Reload the container object
+        container.reload()
+
+    if container.status != "exited":
+        container.stop()
+
+        logs = container.logs().decode("utf-8").rstrip() + "\nDocker timed out."
+        print(logs)
+        with open(os.path.join(work_dir, "console_log.txt"), "wt") as f:
+            f.write(logs)
+
+        container.remove()
+        return
+
+    # get the container logs
+    logs = container.logs().decode("utf-8").rstrip()
+    container.remove()
+
+    print(logs)
+    with open(os.path.join(work_dir, "console_log.txt"), "wt") as f:
+        f.write(logs)
+
+
+###############################################################################
+if __name__ == "__main__":
+    script_name = os.path.basename(__file__)
+    parser = argparse.ArgumentParser(
+        description=f"{script_name} will run the specified autogen scenarios for a given number of repetitions and record all logs and trace information. When running in a Docker environment (default), each run will begin from a common, tightly controlled, environment. The resultant logs can then be further processed by other scripts to produce metrics.".strip()
+    )
+
+    parser.add_argument(
+        "scenario",
+        nargs="?",
+        help="The JSONL scenario file to run. If a directory is specified, then all JSONL scenarios in the directory are run. (default: ./scenarios)",
+        default="scenarios",
+    )
+    parser.add_argument(
+        "-r", "--repeat", type=int, help="The number of repetitions to run for each scenario (default: 10).", default=10
+    )
+    parser.add_argument(
+        "--native",
+        action="store_true",
+        help="Run the scenarios natively rather than in docker. NOTE: This is not advisable, and should be done with great caution.",
+    )
+
+    args = parser.parse_args()
+
+    if args.native:
+        choice = input(
+            'WARNING: Running natively, without Docker, not only poses the usual risks of executing arbitrary AI generated code on your machine, it also makes it impossible to ensure that each test starts from a known and consistent set of initial conditions. For example, if the agents spend time debugging and installing Python libraries to solve the task, then those libraries will be available to all other runs. In other words, earlier runs can influence later runs, leading to many confounds in testing.\n\nAre you absolutely sure you want to continue with native execution? Type "Yes" exactly, and in full, to proceed: '
+        )
+
+        if choice.strip().lower() != "yes":
+            print("Received '" + choice + "'. Exiting.")
+
+    # Import docker if needed
+    is_native = True if args.native else False
+    if not is_native:
+        import docker
+
+    # Warn aboit a common error
+    env_file = os.path.join(ENV_DIR, "ENV")
+    example_file = os.path.join(ENV_DIR, "ENV.example")
+    if not os.path.isfile(env_file):
+        sys.exit(
+            f"The environment file '{env_file}' does not exist. If this is your first time setting up the testbed, you will want to rename '{example_file}' to '{env_file}' and edit it to include your API keys and configurations."
+        )
+
+    run_scenarios(args.scenario, args.repeat, is_native)
diff --git a/samples/tools/testbed/scenarios/default_two_agents.jsonl b/samples/tools/testbed/scenarios/default_two_agents.jsonl
new file mode 100644
index 000000000000..9826ac602ea8
--- /dev/null
+++ b/samples/tools/testbed/scenarios/default_two_agents.jsonl
@@ -0,0 +1,6 @@
+{ "id": "two_agent_stocks_gpt4", "template": "default_two_agents.py", "values": { "__MODEL__": "gpt-4", "__PROMPT__": "Plot and save to disk a chart of NVDA and TESLA stock price YTD." } }
+{ "id": "two_agent_stocks_gpt35", "template": "default_two_agents.py", "values": { "__MODEL__": "gpt-3.5-turbo-16k", "__PROMPT__": "Plot and save to disk a chart of NVDA and TESLA stock price YTD." } }
+{ "id": "two_agent_arxiv_search_gpt4", "template": "default_two_agents.py", "values": { "__MODEL__": "gpt-4", "__PROMPT__": "Find 10 papers on explainable or interpretable AI that were submitted to arXiv within the last year. When printing results, include paper titles, authors, dates, and URLs, but not their abstracts." } }
+{ "id": "two_agent_arxiv_search_gpt35", "template": "default_two_agents.py", "values": { "__MODEL__": "gpt-3.5-turbo-16k", "__PROMPT__": "Find 10 papers on explainable or interpretable AI that were submitted to arXiv within the last year. When printing results, include paper titles, authors, dates, and URLs, but not their abstracts." } }
+{ "id": "two_agent_logo_search_gpt4", "template": "default_two_agents.py", "values": { "__MODEL__": "gpt-4", "__PROMPT__": "Find the logo of the Autogen python library, and save it to disk. If searching the web, use Bing with API key stored in os.environ['BING_API_KEY']" } }
+{ "id": "two_agent_logo_search_gpt35", "template": "default_two_agents.py", "values": { "__MODEL__": "gpt-3.5-turbo-16k", "__PROMPT__": "Find the logo of the Autogen python library, and save it to disk. If searching the web, use Bing with the API key stored in os.environ['BING_API_KEY']" } }
diff --git a/samples/tools/testbed/scenarios/default_two_agents.py b/samples/tools/testbed/scenarios/default_two_agents.py
new file mode 100644
index 000000000000..2e1ec55e28cc
--- /dev/null
+++ b/samples/tools/testbed/scenarios/default_two_agents.py
@@ -0,0 +1,28 @@
+from autogen import AssistantAgent, UserProxyAgent, config_list_from_json
+import os
+import json
+import testbed_utils
+
+testbed_utils.init()
+##############################
+
+config_list = config_list_from_json(
+    "OAI_CONFIG_LIST",
+    filter_dict={"model": ["__MODEL__"]},
+)
+
+assistant = AssistantAgent("assistant", llm_config={"request_timeout": 180, "config_list": config_list})
+user_proxy = UserProxyAgent(
+    "user_proxy",
+    human_input_mode="NEVER",
+    code_execution_config={
+        "work_dir": "coding",
+        "use_docker": False,
+    },
+    max_consecutive_auto_reply=10,
+)
+user_proxy.initiate_chat(assistant, message="__PROMPT__")
+
+
+##############################
+testbed_utils.finalize(assistant, user_proxy)

From ff3c027549973eff86097bfd7cd4fd3073185ebc Mon Sep 17 00:00:00 2001
From: Adam Fourney <adamfo@microsoft.com>
Date: Tue, 24 Oct 2023 14:42:03 -0700
Subject: [PATCH 02/13] Fixed some typos in the Testbed README.md

---
 samples/tools/testbed/README.md | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/samples/tools/testbed/README.md b/samples/tools/testbed/README.md
index 26365057ac11..e2f8cdc8e14c 100644
--- a/samples/tools/testbed/README.md
+++ b/samples/tools/testbed/README.md
@@ -1,16 +1,16 @@
 # Autogen Testbed Environment
 
-The Autogen Testbed environment is a tool for repeatedly running a set of pre-defined Autogen scenarios in a setting with tightly-controlled initial conditions. With each run, Autogen will start from a blank slate, working out what code needs to be written, and what libraries or dependencies to install. The results of each run are logged, can be ingested by analysis or metrics scripts. By default, all runs are conducted in freshly-initialized docker containers, providing the recommended level of consistency and safety.
+The Autogen Testbed environment is a tool for repeatedly running a set of pre-defined Autogen scenarios in a setting with tightly-controlled initial conditions. With each run, Autogen will start from a blank slate, working out what code needs to be written, and what libraries or dependencies to install. The results of each run are logged, and can be ingested by analysis or metrics scripts. By default, all runs are conducted in freshly-initialized docker containers, providing the recommended level of consistency and safety.
 
 ## Setup
 
-Before you begin, you must configure your API keys for use with the Testbed. These keys extend beyond those typically found in an OAI_CONFIG_LIST, and can include such things as key for the Bing Search API or other services used by the scenarios. There is an example ENV file in ``includes/ENV.example``. To get started:
+Before you begin, you must configure your API keys for use with the Testbed. These keys extend beyond those typically found in an OAI_CONFIG_LIST, and can include such things as keys for the Bing Search API or other services used by the scenarios. There is an example ENV file in ``includes/ENV.example``. To get started:
 
 ``cp includes/ENV.example includes/ENV``
 
 Then edit ``includes/ENV`` as needed.
 
-The Testbed also required installation of the __python docker__ library:
+The Testbed also requires installation of the __python docker__ library:
 
 ``pip install docker``
 
@@ -49,10 +49,12 @@ For example, consider the following folders:
 
 ``./results/default_two_agents/two_agent_stocks_gpt4/0``
 ``./results/default_two_agents/two_agent_stocks_gpt4/1``
+
 ...
+
 ``./results/default_two_agents/two_agent_stocks_gpt4/9``
 
-This folder holds the results for the ``two_agent_stocks_gpt4`` instance of the ``default_two_agents`` scenario. The ``0`` folder contains the results of the first run. The ``1`` folder contains the results of the second run, and so on.
+This folder holds the results for the ``two_agent_stocks_gpt4`` instance of the ``default_two_agents`` scenario. The ``0`` folder contains the results of the first run. The ``1`` folder contains the results of the second run, and so on. You can think of the _instance_ as mapping to a prompt, or a unique set of parameters, while the _scenario_ defines the template in which those parameters are input.
 
 Within each folder, you will find the following files:
 

From df27d75ca6774b2f6eed592bcbd1570e86003d0c Mon Sep 17 00:00:00 2001
From: Adam Fourney <adamfo@microsoft.com>
Date: Tue, 24 Oct 2023 15:53:20 -0700
Subject: [PATCH 03/13] Added some stricter termination logic to the two_agent
 scenario, and swiched the logo task from finding Autogen's logo, to finding
 Microsoft's (it's easier)

---
 samples/tools/testbed/scenarios/default_two_agents.jsonl | 4 ++--
 samples/tools/testbed/scenarios/default_two_agents.py    | 8 +++++++-
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/samples/tools/testbed/scenarios/default_two_agents.jsonl b/samples/tools/testbed/scenarios/default_two_agents.jsonl
index 9826ac602ea8..4da04167fa62 100644
--- a/samples/tools/testbed/scenarios/default_two_agents.jsonl
+++ b/samples/tools/testbed/scenarios/default_two_agents.jsonl
@@ -2,5 +2,5 @@
 { "id": "two_agent_stocks_gpt35", "template": "default_two_agents.py", "values": { "__MODEL__": "gpt-3.5-turbo-16k", "__PROMPT__": "Plot and save to disk a chart of NVDA and TESLA stock price YTD." } }
 { "id": "two_agent_arxiv_search_gpt4", "template": "default_two_agents.py", "values": { "__MODEL__": "gpt-4", "__PROMPT__": "Find 10 papers on explainable or interpretable AI that were submitted to arXiv within the last year. When printing results, include paper titles, authors, dates, and URLs, but not their abstracts." } }
 { "id": "two_agent_arxiv_search_gpt35", "template": "default_two_agents.py", "values": { "__MODEL__": "gpt-3.5-turbo-16k", "__PROMPT__": "Find 10 papers on explainable or interpretable AI that were submitted to arXiv within the last year. When printing results, include paper titles, authors, dates, and URLs, but not their abstracts." } }
-{ "id": "two_agent_logo_search_gpt4", "template": "default_two_agents.py", "values": { "__MODEL__": "gpt-4", "__PROMPT__": "Find the logo of the Autogen python library, and save it to disk. If searching the web, use Bing with API key stored in os.environ['BING_API_KEY']" } }
-{ "id": "two_agent_logo_search_gpt35", "template": "default_two_agents.py", "values": { "__MODEL__": "gpt-3.5-turbo-16k", "__PROMPT__": "Find the logo of the Autogen python library, and save it to disk. If searching the web, use Bing with the API key stored in os.environ['BING_API_KEY']" } }
+{ "id": "two_agent_mslogo_search_gpt4", "template": "default_two_agents.py", "values": { "__MODEL__": "gpt-4", "__PROMPT__": "Find Microsoft's logo from 1983, and save it to disk. If searching the web, use Bing with API key stored in os.environ['BING_API_KEY']" } }
+{ "id": "two_agent_mslogo_search_gpt35", "template": "default_two_agents.py", "values": { "__MODEL__": "gpt-3.5-turbo-16k", "__PROMPT__": "Find Microsoft's logo from 1983, and save it to disk. If searching the web, use Bing with the API key stored in os.environ['BING_API_KEY']" } }
diff --git a/samples/tools/testbed/scenarios/default_two_agents.py b/samples/tools/testbed/scenarios/default_two_agents.py
index 2e1ec55e28cc..bd3fc4ec466e 100644
--- a/samples/tools/testbed/scenarios/default_two_agents.py
+++ b/samples/tools/testbed/scenarios/default_two_agents.py
@@ -11,15 +11,21 @@
     filter_dict={"model": ["__MODEL__"]},
 )
 
-assistant = AssistantAgent("assistant", llm_config={"request_timeout": 180, "config_list": config_list})
+assistant = AssistantAgent(
+    "assistant",
+    is_termination_msg=lambda x: x.get("content", "").rstrip().find("TERMINATE") >= 0,
+    llm_config={"request_timeout": 180, "config_list": config_list},
+)
 user_proxy = UserProxyAgent(
     "user_proxy",
     human_input_mode="NEVER",
+    is_termination_msg=lambda x: x.get("content", "").rstrip().find("TERMINATE") >= 0,
     code_execution_config={
         "work_dir": "coding",
         "use_docker": False,
     },
     max_consecutive_auto_reply=10,
+    default_auto_reply="TERMINATE",
 )
 user_proxy.initiate_chat(assistant, message="__PROMPT__")
 

From 728d83104e3fdc73b4b6c709cea53a2d9e7dcf06 Mon Sep 17 00:00:00 2001
From: Adam Fourney <adamfo@microsoft.com>
Date: Fri, 27 Oct 2023 10:30:17 -0700
Subject: [PATCH 04/13] Added documentation to testbed code in preparation for
 PR

---
 samples/tools/testbed/README.md               |  4 +++
 .../tools/testbed/includes/testbed_utils.py   | 31 ++++++++++++++++---
 samples/tools/testbed/run_scenarios.py        | 26 ++++++++++++++++
 .../testbed/scenarios/default_two_agents.py   |  7 +++--
 4 files changed, 62 insertions(+), 6 deletions(-)

diff --git a/samples/tools/testbed/README.md b/samples/tools/testbed/README.md
index e2f8cdc8e14c..8d1a8f96a805 100644
--- a/samples/tools/testbed/README.md
+++ b/samples/tools/testbed/README.md
@@ -19,6 +19,10 @@ The Testbed also requires installation of the __python docker__ library:
 To run the Testbed, simply execute
 ``python run_scenarios.py``
 
+The default it to repeat this scenario 10 times. This can be costly. To run each scenario only once, use:
+``python run_scenarios.py --repeat 1``
+
+
 The run_scenarios.py script also allows a number of command-line arguments to control various parameters of execution. Type ``python run_scenarios.py -h`` to explore these options:
 
 ```
diff --git a/samples/tools/testbed/includes/testbed_utils.py b/samples/tools/testbed/includes/testbed_utils.py
index 896c647dbeb8..288151ab3803 100644
--- a/samples/tools/testbed/includes/testbed_utils.py
+++ b/samples/tools/testbed/includes/testbed_utils.py
@@ -6,6 +6,17 @@
 
 
 def init():
+    """Helper function to initialize logging in a testbed scenario.
+    Specifically, write timestamp and version information, then
+    initialize autogen logging.
+
+    Args:
+        None
+
+    Returns:
+        None
+    """
+
     autogen.ChatCompletion.start_logging(compact=False)
 
     # Print some information about the run
@@ -14,7 +25,19 @@ def init():
         f.write("pyautogen version: " + lib_version("pyautogen") + "\n")
 
 
-def finalize(*args):
+def finalize(agents):
+    """Helper function to finalize logging in a testbed scenario.
+    Calling this function will save all the chat completions logged
+    by Autogen to disk, and will save the messages dictionaries of
+    all agents passed via the agents argument.
+
+    Args:
+        agents (list): a list of the agents whose messages will be logged to disk.
+
+    Returns:
+        None
+    """
+
     script_dir = os.path.dirname(os.path.realpath(__file__))
 
     with open(os.path.join(script_dir, "chat_completions.json"), "wt") as fh:
@@ -27,7 +50,7 @@ def messages_to_json(agent):
             messages[item[0].name] = item[1]
         return json.dumps(messages, indent=4)
 
-    for arg in args:
-        fname = arg.name + "_messages.json"
+    for agent in agents:
+        fname = agent.name + "_messages.json"
         with open(os.path.join(script_dir, fname), "wt") as fh:
-            fh.write(messages_to_json(arg))
+            fh.write(messages_to_json(agent))
diff --git a/samples/tools/testbed/run_scenarios.py b/samples/tools/testbed/run_scenarios.py
index 78cadbbe70c1..42fe2f4aa06d 100644
--- a/samples/tools/testbed/run_scenarios.py
+++ b/samples/tools/testbed/run_scenarios.py
@@ -13,6 +13,17 @@
 
 
 def run_scenarios(scenario, n_repeats, is_native, results_dir="results"):
+    """
+    Run a set testbed scenarios a given number of times.
+
+    Args:
+        scenario (path):    The file or folder containing the scenario JSONL instances. If given a folder, then
+                            all JSONL files in the folder will be loaded and run.
+        n_repeats (int):    The number of times each scenario instance will be repeated
+        is_native (bool):   True if the scenario should be run locally rather than in Docker (proceed with caution!)
+        results_dir (path): The folder were results will be saved.
+    """
+
     files = []
 
     # Figure out which files or folders we are working with
@@ -107,6 +118,13 @@ def expand_scenario(scenario_dir, scenario, output_file):
 
 
 def run_scenario_natively(work_dir):
+    """
+    Run a scenario in the native environment.
+
+    Args:
+        work_dir (path): the path to the working directory previously created to house this sceario instance
+    """
+
     # Get the current working directory
     cwd = os.getcwd()
 
@@ -137,6 +155,14 @@ def run_scenario_natively(work_dir):
 
 
 def run_scenario_in_docker(work_dir, timeout=600):
+    """
+    Run a scenario in a Docker environment.
+
+    Args:
+        work_dir (path): the path to the working directory previously created to house this sceario instance
+        timeout (Optional, int): the number of seconds to allow a Docker container to run before timing out
+    """
+
     # Create a docker client
     client = docker.from_env()
     image_name = "python:3.11"
diff --git a/samples/tools/testbed/scenarios/default_two_agents.py b/samples/tools/testbed/scenarios/default_two_agents.py
index bd3fc4ec466e..ed0c49547c0a 100644
--- a/samples/tools/testbed/scenarios/default_two_agents.py
+++ b/samples/tools/testbed/scenarios/default_two_agents.py
@@ -14,7 +14,10 @@
 assistant = AssistantAgent(
     "assistant",
     is_termination_msg=lambda x: x.get("content", "").rstrip().find("TERMINATE") >= 0,
-    llm_config={"request_timeout": 180, "config_list": config_list},
+    llm_config={
+        "request_timeout": 180,  # Remove for autogen version >= 0.2, and OpenAI version >= 1.0
+        "config_list": config_list,
+    },
 )
 user_proxy = UserProxyAgent(
     "user_proxy",
@@ -31,4 +34,4 @@
 
 
 ##############################
-testbed_utils.finalize(assistant, user_proxy)
+testbed_utils.finalize(agents=[assistant, user_proxy])

From d0ce31b7bbccca93bdab9e0fc74e953bb031271d Mon Sep 17 00:00:00 2001
From: Adam Fourney <adamfo@microsoft.com>
Date: Thu, 2 Nov 2023 00:08:44 -0700
Subject: [PATCH 05/13] Added a variation of HumanEval to the Testbed. It is
 also a reasonable example of how to integrate other benchmarks.

---
 samples/tools/testbed/README.md               | 13 +++
 .../scenarios/human_eval_two_agents.py        | 91 +++++++++++++++++++
 .../tools/testbed/utils/collate_human_eval.py | 91 +++++++++++++++++++
 .../tools/testbed/utils/download_humaneval.py | 67 ++++++++++++++
 4 files changed, 262 insertions(+)
 create mode 100644 samples/tools/testbed/scenarios/human_eval_two_agents.py
 create mode 100644 samples/tools/testbed/utils/collate_human_eval.py
 create mode 100644 samples/tools/testbed/utils/download_humaneval.py

diff --git a/samples/tools/testbed/README.md b/samples/tools/testbed/README.md
index 8d1a8f96a805..46ae72f66e5e 100644
--- a/samples/tools/testbed/README.md
+++ b/samples/tools/testbed/README.md
@@ -132,3 +132,16 @@ user_proxy.initiate_chat(assistant, message="\__PROMPT\__")
 ##############################
 testbed_utils.finalize(assistant, user_proxy)
 ```
+
+
+## Running HumanEval
+
+One sample Testbed scenario type is a variation of the classic [HumanEval](https://github.com/openai/human-eval) benchmark. In this scenario, agents are given access to the unit test results, and are able to continue to debug their code until the problem is solved or they run out of tokens or turns. We can then count how many turns it took to solve the problem (returning -1 if the problem remains unsolved by the end of the conversation).
+
+Accessing this scenario-type requires downloading and converting the datasets, running the Testbed, and finally collating the results. The following commands will accomplish this, running each test instance 3 times with GPT-3.5-Turbo-16k:
+
+```
+python utils/download_humaneval.py
+python ./run_scenarios.py --repeat 3 scenarios/human_eval_two_agents_gpt35.jsonl
+python utils/collate_human_eval.py ./results/human_eval_two_agents_gpt35 > human_eval_two_agents_gpt35_results.csv && cat human_eval_two_agents_gpt35_results.csv
+```
diff --git a/samples/tools/testbed/scenarios/human_eval_two_agents.py b/samples/tools/testbed/scenarios/human_eval_two_agents.py
new file mode 100644
index 000000000000..9516add49c5d
--- /dev/null
+++ b/samples/tools/testbed/scenarios/human_eval_two_agents.py
@@ -0,0 +1,91 @@
+from autogen import AssistantAgent, UserProxyAgent, config_list_from_json
+import os
+import json
+import base64
+import testbed_utils
+
+# NOTE:
+# This scenario runs Human Eval in a slightly unconventional way:
+# The agents have access to the unit tests, and can keep trying
+# until they pass.
+
+testbed_utils.init()
+##############################
+
+work_dir = "coding"
+
+# These come formatted as Base64 to avoid conflicting with the triple-quotes
+TESTS = base64.b64decode("__TEST_BASE64__").decode("utf-8")
+PROMPT = base64.b64decode("__PROMPT_BASE64__").decode("utf-8")
+
+# Write the tests to a file so that the agents can access them
+if not os.path.isdir(work_dir):
+    os.mkdir(work_dir)
+with open(os.path.join(work_dir, "my_tests.py"), "wt") as fh:
+    fh.write(
+        TESTS
+        + """
+
+
+def run_tests(candidate):
+   check(candidate)
+   # We can search for this string in the output
+   print("ALL TESTS PASSED !#!#")
+"""
+    )
+
+
+# Ok, now get autogen to solve it.
+config_list = config_list_from_json(
+    "OAI_CONFIG_LIST",
+    filter_dict={"model": ["__MODEL__"]},
+)
+
+assistant = AssistantAgent(
+    "assistant",
+    is_termination_msg=lambda x: x.get("content", "").rstrip().find("TERMINATE") >= 0,
+    llm_config={
+        "request_timeout": 180,  # Remove for autogen version >= 0.2, and OpenAI version >= 1.0
+        "config_list": config_list,
+    },
+)
+user_proxy = UserProxyAgent(
+    "user_proxy",
+    human_input_mode="NEVER",
+    is_termination_msg=lambda x: x.get("content", "").rstrip().find("TERMINATE") >= 0,
+    code_execution_config={
+        "work_dir": work_dir,
+        "use_docker": False,
+    },
+    max_consecutive_auto_reply=10,
+    default_auto_reply="TERMINATE",
+)
+user_proxy.initiate_chat(
+    assistant,
+    message="""
+The following python code imports the `run_tests(candidate)` function from my_tests.py, and runs
+it on the function `__ENTRY_POINT__`. This will run a set of automated unit tests to verify the
+correct implementation of `__ENTRY_POINT__`. However, `__ENTRY_POINT__` is only partially
+implemented in the code below. Complete the implementation of `__ENTRY_POINT__` and output
+a new stand-alone code block that contains everything needed run the tests, including: importing
+`my_tests`, calling `run_tests(__ENTRY_POINT__)`, as well as __ENTRY_POINT__'s comepelte definition,
+such that this code block can be run direcly in Python.
+
+```python
+from my_tests import run_tests
+
+
+"""
+    + PROMPT
+    + """
+
+
+# Run the unit tests
+run_tests(__ENTRY_POINT__)
+```
+""",
+)
+
+
+##############################
+testbed_utils.finalize(agents=[assistant, user_proxy])
diff --git a/samples/tools/testbed/utils/collate_human_eval.py b/samples/tools/testbed/utils/collate_human_eval.py
new file mode 100644
index 000000000000..8fe0eb05f1c8
--- /dev/null
+++ b/samples/tools/testbed/utils/collate_human_eval.py
@@ -0,0 +1,91 @@
+import os
+import errno
+import shutil
+import subprocess
+import json
+import sys
+import time
+import pathlib
+import argparse
+
+
+def collate(results_dir):
+    """
+    Collate the results of running human eval.
+
+    Args:
+        results_dir (path): The folder were results were be saved.
+    """
+
+    all_results = list()
+    max_instances = 0
+
+    for test_id in os.listdir(results_dir):
+        test_path = os.path.join(results_dir, test_id)
+
+        # Collect the reslts vector
+        results = [test_id]
+
+        instance = 0
+        instance_dir = os.path.join(test_path, str(instance))
+        while os.path.isdir(instance_dir):
+            console_log = os.path.join(instance_dir, "console_log.txt")
+            if os.path.isfile(console_log):
+                with open(console_log, "rt") as fh:
+                    content = fh.read()
+                    if "ALL TESTS PASSED !#!#" in content:
+                        results.append(
+                            str(content.count("assistant (to user_proxy):"))
+                        )  # The number of assistant replies (which is also equal to the number of GPT calls in this case)
+                    else:
+                        results.append("-1")
+
+            else:
+                # Missing results will appear as blanks
+                results.append("")
+
+            instance += 1
+            instance_dir = os.path.join(test_path, str(instance))
+
+        max_instances = max(max_instances, instance)
+
+        # Buffer the results
+        all_results.append(results)
+
+    # Create a header
+    header = "TestId"
+    for i in range(0, max_instances):
+        header += ",Trial" + str(i)
+    print(header)
+
+    # Print a fully-populated table of results
+    for r in all_results:
+        while len(r) < max_instances + 1:
+            r.append("")
+        print(",".join(r))
+
+
+###############################################################################
+if __name__ == "__main__":
+    script_path = os.path.realpath(__file__)
+    script_name = os.path.basename(script_path)
+    script_dir = os.path.dirname(script_path)
+
+    # Path to the default results directory
+    # (relative to this script, up on directory, then into the results folder)
+    default_results_dir = os.path.realpath(
+        os.path.join(script_dir, os.path.pardir, "results", "human_eval_two_agents_gpt4")
+    )
+
+    parser = argparse.ArgumentParser(
+        description=f"{script_name} will collate the results of a the HumanEval scenarios, and output them to a CSV".strip()
+    )
+
+    parser.add_argument(
+        "scenario",
+        nargs="?",
+        help="Path to the scenario results. (default: " + default_results_dir + ")",
+        default=default_results_dir,
+    )
+    args = parser.parse_args()
+    collate(args.scenario)
diff --git a/samples/tools/testbed/utils/download_humaneval.py b/samples/tools/testbed/utils/download_humaneval.py
new file mode 100644
index 000000000000..faf6c3c3b553
--- /dev/null
+++ b/samples/tools/testbed/utils/download_humaneval.py
@@ -0,0 +1,67 @@
+#
+# Run this file to download the human_eval dataset, and create a corresponding testbed scenario:
+# (default: ../scenarios/human_eval_two_agents_gpt4.jsonl and ./scenarios/human_eval_two_agents_gpt35.jsonl)
+#
+
+import requests
+import gzip
+import io
+import json
+import os
+import base64
+
+
+script_path = os.path.realpath(__file__)
+script_name = os.path.basename(script_path)
+script_dir = os.path.dirname(script_path)
+
+# Directory where scenarios are stored
+scenarios_dir = os.path.realpath(os.path.join(script_dir, os.path.pardir, "scenarios"))
+print("Saving HumanEval scenarios to: " + scenarios_dir)
+
+
+# URL of the file to download
+url = "https://github.com/openai/human-eval/raw/master/data/HumanEval.jsonl.gz"
+
+# Send a HTTP request to the URL of the file
+response = requests.get(url)
+
+# Ensure we raise an error if the download failed
+response.raise_for_status()
+
+# Create a BytesIO object from the response content
+buffer = io.BytesIO(response.content)
+
+# Create a scenario file
+fh_gpt4 = open(os.path.join(scenarios_dir, "human_eval_two_agents_gpt4.jsonl"), "wt")
+fh_gpt35 = open(os.path.join(scenarios_dir, "human_eval_two_agents_gpt35.jsonl"), "wt")
+
+# Open the buffer as a .gz file and read it line by line
+with gzip.GzipFile(fileobj=buffer) as f_in:
+    for line in f_in:
+        # Parse each line as JSON
+        data = json.loads(line)
+        print("Converting: " + data["task_id"])
+
+        # Write the GPT-4 scenario
+        # Prompts and tests are saved in base 64 to greatly simplify escaping them as they
+        # move through the various formats and scripts. I welcome a better, more readable, alternative.
+        record = {
+            "id": data["task_id"].replace("/", "_"),
+            "template": "human_eval_two_agents.py",
+            "values": {
+                "__MODEL__": "gpt-4",
+                "__PROMPT_BASE64__": base64.b64encode(data["prompt"].encode("utf-8")).decode("utf-8"),
+                "__ENTRY_POINT__": data["entry_point"],
+                "__TEST_BASE64__": base64.b64encode(data["test"].encode("utf-8")).decode("utf-8"),
+            },
+        }
+        fh_gpt4.write(json.dumps(record).strip() + "\n")
+
+        # Write the GPT 3.5 Version
+        record["values"]["__MODEL__"] = "gpt-3.5-turbo-16k"
+        fh_gpt35.write(json.dumps(record).strip() + "\n")
+
+
+fh_gpt4.close()
+fh_gpt35.close()

From 8b2ea2819cbc1acf216535c944b99de2607c0626 Mon Sep 17 00:00:00 2001
From: Adam Fourney <adamfo@microsoft.com>
Date: Thu, 2 Nov 2023 07:19:08 -0700
Subject: [PATCH 06/13] Removed ChatCompletion.start_logging and related
 features. Added an explicit TERMINATE output to HumanEval to save 1 turn in
 each conversation.

---
 samples/tools/testbed/includes/testbed_utils.py          | 6 ------
 samples/tools/testbed/scenarios/human_eval_two_agents.py | 2 +-
 2 files changed, 1 insertion(+), 7 deletions(-)

diff --git a/samples/tools/testbed/includes/testbed_utils.py b/samples/tools/testbed/includes/testbed_utils.py
index 288151ab3803..6818f96e9624 100644
--- a/samples/tools/testbed/includes/testbed_utils.py
+++ b/samples/tools/testbed/includes/testbed_utils.py
@@ -17,8 +17,6 @@ def init():
         None
     """
 
-    autogen.ChatCompletion.start_logging(compact=False)
-
     # Print some information about the run
     with open("timestamp.txt", "wt") as f:
         f.write("Timestamp: " + datetime.now().isoformat() + "\n")
@@ -40,10 +38,6 @@ def finalize(agents):
 
     script_dir = os.path.dirname(os.path.realpath(__file__))
 
-    with open(os.path.join(script_dir, "chat_completions.json"), "wt") as fh:
-        fh.write(json.dumps(autogen.ChatCompletion.logged_history, indent=4))
-        autogen.ChatCompletion.stop_logging()
-
     def messages_to_json(agent):
         messages = dict()
         for item in agent.chat_messages.items():
diff --git a/samples/tools/testbed/scenarios/human_eval_two_agents.py b/samples/tools/testbed/scenarios/human_eval_two_agents.py
index 9516add49c5d..305831d55732 100644
--- a/samples/tools/testbed/scenarios/human_eval_two_agents.py
+++ b/samples/tools/testbed/scenarios/human_eval_two_agents.py
@@ -30,7 +30,7 @@
 def run_tests(candidate):
    check(candidate)
    # We can search for this string in the output
-   print("ALL TESTS PASSED !#!#")
+   print("ALL TESTS PASSED !#!#\\nTERMINATE")
 """
     )
 

From b0b50eafed09aebf0532f414ddc02a19cf70acaa Mon Sep 17 00:00:00 2001
From: Adam Fourney <adamfo@microsoft.com>
Date: Thu, 2 Nov 2023 10:15:54 -0700
Subject: [PATCH 07/13] Added metrics utils script for HumanEval

---
 samples/tools/testbed/README.md               | 11 ++-
 .../tools/testbed/utils/metrics_human_eval.py | 98 +++++++++++++++++++
 2 files changed, 104 insertions(+), 5 deletions(-)
 create mode 100644 samples/tools/testbed/utils/metrics_human_eval.py

diff --git a/samples/tools/testbed/README.md b/samples/tools/testbed/README.md
index 46ae72f66e5e..e2191bebb4bb 100644
--- a/samples/tools/testbed/README.md
+++ b/samples/tools/testbed/README.md
@@ -1,6 +1,6 @@
 # Autogen Testbed Environment
 
-The Autogen Testbed environment is a tool for repeatedly running a set of pre-defined Autogen scenarios in a setting with tightly-controlled initial conditions. With each run, Autogen will start from a blank slate, working out what code needs to be written, and what libraries or dependencies to install. The results of each run are logged, and can be ingested by analysis or metrics scripts. By default, all runs are conducted in freshly-initialized docker containers, providing the recommended level of consistency and safety.
+The Autogen Testbed environment is a tool for repeatedly running a set of pre-defined Autogen scenarios in a setting with tightly-controlled initial conditions. With each run, Autogen will start from a blank slate, working out what code needs to be written, and what libraries or dependencies to install. The results of each run are logged, and can be ingested by analysis or metrics scripts (see the HumanEval example later in this README). By default, all runs are conducted in freshly-initialized docker containers, providing the recommended level of consistency and safety.
 
 ## Setup
 
@@ -134,14 +134,15 @@ testbed_utils.finalize(assistant, user_proxy)
 ```
 
 
-## Running HumanEval
+## (Example) Running HumanEval
 
-One sample Testbed scenario type is a variation of the classic [HumanEval](https://github.com/openai/human-eval) benchmark. In this scenario, agents are given access to the unit test results, and are able to continue to debug their code until the problem is solved or they run out of tokens or turns. We can then count how many turns it took to solve the problem (returning -1 if the problem remains unsolved by the end of the conversation).
+One sample Testbed scenario type is a variation of the classic [HumanEval](https://github.com/openai/human-eval) benchmark. In this scenario, agents are given access to the unit test results, and are able to continue to debug their code until the problem is solved or they run out of tokens or turns. We can then count how many turns it took to solve the problem (returning -1 if the problem remains unsolved by the end of the conversation, and "" if the run is missing).
 
-Accessing this scenario-type requires downloading and converting the datasets, running the Testbed, and finally collating the results. The following commands will accomplish this, running each test instance 3 times with GPT-3.5-Turbo-16k:
+Accessing this scenario-type requires downloading and converting the HumanEval dataset, running the Testbed, collating the results, and finally computing the metrics. The following commands will accomplish this, running each test instance 3 times with GPT-3.5-Turbo-16k:
 
 ```
 python utils/download_humaneval.py
 python ./run_scenarios.py --repeat 3 scenarios/human_eval_two_agents_gpt35.jsonl
-python utils/collate_human_eval.py ./results/human_eval_two_agents_gpt35 > human_eval_two_agents_gpt35_results.csv && cat human_eval_two_agents_gpt35_results.csv
+python utils/collate_human_eval.py ./results/human_eval_two_agents_gpt35 | python utils/metrics_human_eval.py > human_eval_results_gpt35.csv
+cat human_eval_results_gpt35.csv
 ```
diff --git a/samples/tools/testbed/utils/metrics_human_eval.py b/samples/tools/testbed/utils/metrics_human_eval.py
new file mode 100644
index 000000000000..6b5d833e2ba5
--- /dev/null
+++ b/samples/tools/testbed/utils/metrics_human_eval.py
@@ -0,0 +1,98 @@
+import os
+import sys
+import argparse
+import csv
+
+
+def metrics(results_fh):
+    """
+    Compute metrics from collated HumanEval results.
+
+    Args:
+        results_fh (File Stream): A file stream containing the collated results in CSV.
+    """
+
+    reader = csv.reader(results_fh)
+    first_row = next(reader)  # Read the first line
+
+    num_trials = len(first_row) - 1  # Don't count the first column (TestId)
+    max_turns = 0
+    num_rows = 0
+
+    # Load the results. We'll need to iterate over them a few times.
+    results = list()
+    for row in reader:
+        num_rows += 1
+
+        name = row[0]
+        trials = [(None if v.strip() == "" else int(v)) for v in row[1:]]
+        for v in trials:
+            if v is not None:
+                max_turns = max(max_turns, v)
+        results.append([name, trials])
+
+    # Print the header
+    header = ["Trial"]
+    for i in range(1, max_turns + 1):
+        header.append("cumulative_passes_by_turn_" + str(i))
+    header.append("fails")
+    header.append("missing")
+    print(",".join(header))
+
+    # Compute the metrics
+    def _metrics_for_trial(t):
+        counts = [None]
+        fails = 0
+        missing = 0
+
+        # Compute cumulative passes for each conversation turn
+        for i in range(1, max_turns + 1):
+            counts.append(0)
+            assert len(counts) == i + 1
+
+            for r in results:
+                v = r[1][t]
+                if v is not None:
+                    v = int(v)
+                    if 0 <= v and v <= i:
+                        counts[i] += 1
+
+        # Count missing and failed
+        for r in results:
+            v = r[1][t]
+            if v is None:
+                missing += 1
+            elif int(v) < 0:
+                fails += 1
+
+        # Prepare the row in the format specified by the header
+        return str(t) + "," + ",".join([str(v) for v in counts[1:]]) + "," + str(fails) + "," + str(missing)
+
+    # Print each row
+    for t in range(0, num_trials):
+        print(_metrics_for_trial(t))
+
+
+###############################################################################
+if __name__ == "__main__":
+    script_path = os.path.realpath(__file__)
+    script_name = os.path.basename(script_path)
+    script_dir = os.path.dirname(script_path)
+
+    parser = argparse.ArgumentParser(
+        description=f"{script_name} will compute metrics on the collated results of the HumanEval scenarios. Use collate_human_eval.py to prepare input to this script.".strip()
+    )
+
+    parser.add_argument(
+        "scenario",
+        nargs="?",
+        help="Path to collated results. If '-' or omitted, read from stdin. (default: '-')",
+        default="-",
+    )
+    args = parser.parse_args()
+
+    if args.scenario == "" or args.scenario == "-":
+        metrics(sys.stdin)
+    else:
+        with open(args.scenario, "rt") as fh:
+            metrics(fh)

From c5b9b1c237fa17609b3165a2ea7c9bd2c05e6a94 Mon Sep 17 00:00:00 2001
From: Adam Fourney <adamfo@microsoft.com>
Date: Thu, 2 Nov 2023 10:23:34 -0700
Subject: [PATCH 08/13] Updated the requirements in the README.

---
 samples/tools/testbed/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/samples/tools/testbed/README.md b/samples/tools/testbed/README.md
index e2191bebb4bb..68d9759ac574 100644
--- a/samples/tools/testbed/README.md
+++ b/samples/tools/testbed/README.md
@@ -10,7 +10,7 @@ Before you begin, you must configure your API keys for use with the Testbed. The
 
 Then edit ``includes/ENV`` as needed.
 
-The Testbed also requires installation of the __python docker__ library:
+The Testbed also requires Docker (Desktop or Engine) AND the __python docker__ library. **It will not run in codespaces**, unless you opt for native execution (with is strongly discouraged). To install Docker Desktop see [https://www.docker.com/products/docker-desktop/](https://www.docker.com/products/docker-desktop/). To install the Python library:
 
 ``pip install docker``
 

From a6d8e4d3c223e0a796e3f500dc182f4e5a90ab61 Mon Sep 17 00:00:00 2001
From: Adam Fourney <adamfo@microsoft.com>
Date: Thu, 2 Nov 2023 23:34:59 -0700
Subject: [PATCH 09/13] Added documentation for HumanEval csv schemas

---
 .../tools/testbed/utils/collate_human_eval.py | 14 ++++++++++++-
 .../tools/testbed/utils/metrics_human_eval.py | 20 ++++++++++++++++++-
 2 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/samples/tools/testbed/utils/collate_human_eval.py b/samples/tools/testbed/utils/collate_human_eval.py
index 8fe0eb05f1c8..ed83bb22bbfd 100644
--- a/samples/tools/testbed/utils/collate_human_eval.py
+++ b/samples/tools/testbed/utils/collate_human_eval.py
@@ -78,7 +78,19 @@ def collate(results_dir):
     )
 
     parser = argparse.ArgumentParser(
-        description=f"{script_name} will collate the results of a the HumanEval scenarios, and output them to a CSV".strip()
+        description=f"""
+{script_name} will collate the results of the HumanEval scenarios and output them to a CSV. The CSV format is as follows:
+
+TestId,      Trial0, Trial1, ...,    TrialN
+HumanEval_1, x_10,   x_11,   ...,    X_1N
+HumanEval_2, x_20,   x_21,   ...,    X_2N
+...
+HumanEval_M, x_M0,   x_M1,   ...,    X_MN
+
+
+Where x_ij is the number of AsssitantAgent conversation turns needed to pass all the tests for problem i, in Trial/repetition j. If the agent was not able to pass the tests by the end of the conversation, the value will be -1. If data for the trial is missing, the value will be an empty string "".
+""".strip(),
+        formatter_class=argparse.RawTextHelpFormatter,
     )
 
     parser.add_argument(
diff --git a/samples/tools/testbed/utils/metrics_human_eval.py b/samples/tools/testbed/utils/metrics_human_eval.py
index 6b5d833e2ba5..25d9aa90fda2 100644
--- a/samples/tools/testbed/utils/metrics_human_eval.py
+++ b/samples/tools/testbed/utils/metrics_human_eval.py
@@ -80,7 +80,25 @@ def _metrics_for_trial(t):
     script_dir = os.path.dirname(script_path)
 
     parser = argparse.ArgumentParser(
-        description=f"{script_name} will compute metrics on the collated results of the HumanEval scenarios. Use collate_human_eval.py to prepare input to this script.".strip()
+        description=f"""
+{script_name} will compute metrics on the collated results of the HumanEval scenarios. Use collate_human_eval.py to prepare input to this script.
+
+The output will be formatted as a CSV with the following schema:
+
+Trial, cumulative_passes_by_turn_1, ..., cumulative_passes_by_turn_N, fails, missing
+0      x_01,                             x_0N,                        y_0,   z_0
+1      x_11,                             x_1N,                        y_1,   z_1
+...
+M      x_M1,                             x_MN,                        y_M,   z_M
+
+Where:
+
+  x_ij is the number of HumanEval problems in Trial i that achieved a passing result by conversation turn j.
+  y_i  is the number of HumanEval problems in Trial i that never achieved a passing result (they failed).
+  z_i  is the number of HumanEval problems in Trial i that have missing data.
+
+""".strip(),
+        formatter_class=argparse.RawTextHelpFormatter,
     )
 
     parser.add_argument(

From 3dffbd9550746fe580e28dd5c48866f2d68bcbe1 Mon Sep 17 00:00:00 2001
From: Adam Fourney <adamfo@microsoft.com>
Date: Fri, 3 Nov 2023 11:00:08 -0700
Subject: [PATCH 10/13] Standardized on how the OAI_CONFIG_LIST is handled.

---
 samples/tools/testbed/README.md            |  9 +++--
 samples/tools/testbed/includes/ENV.example | 14 -------
 samples/tools/testbed/run_scenarios.py     | 44 ++++++++++++++++------
 3 files changed, 38 insertions(+), 29 deletions(-)

diff --git a/samples/tools/testbed/README.md b/samples/tools/testbed/README.md
index 68d9759ac574..01ab1f32aeed 100644
--- a/samples/tools/testbed/README.md
+++ b/samples/tools/testbed/README.md
@@ -4,11 +4,9 @@ The Autogen Testbed environment is a tool for repeatedly running a set of pre-de
 
 ## Setup
 
-Before you begin, you must configure your API keys for use with the Testbed. These keys extend beyond those typically found in an OAI_CONFIG_LIST, and can include such things as keys for the Bing Search API or other services used by the scenarios. There is an example ENV file in ``includes/ENV.example``. To get started:
+Before you begin, you must configure your API keys for use with the Testbed. As with other Autogen applications, the Testbed will look for the OpenAI keys in a file in the current working directy, or environment variable named, OAI_CONFIG_LIST. This can be overrriden using a command-line parameter described later.
 
-``cp includes/ENV.example includes/ENV``
-
-Then edit ``includes/ENV`` as needed.
+For some scenarios, additional keys may be required (e.g., keys for the Bing Search API). These can be added to an `ENV` file in the `includes` folder. A sample has been provided in ``includes/ENV.example``. Edit ``includes/ENV`` as needed.
 
 The Testbed also requires Docker (Desktop or Engine) AND the __python docker__ library. **It will not run in codespaces**, unless you opt for native execution (with is strongly discouraged). To install Docker Desktop see [https://www.docker.com/products/docker-desktop/](https://www.docker.com/products/docker-desktop/). To install the Python library:
 
@@ -39,6 +37,9 @@ options:
   -r REPEAT, --repeat REPEAT
                 The number of repetitions to run for each scenario (default: 10).
 
+  -c CONFIG, --config CONFIG
+                The environment variable name or path to the OAI_CONFIG_LIST (default: OAI_CONFIG_LIST).
+
   --native      Run the scenarios natively rather than in docker.
                 NOTE: This is not advisable, and should be done with great caution.
 ```
diff --git a/samples/tools/testbed/includes/ENV.example b/samples/tools/testbed/includes/ENV.example
index 13a3563b81c7..b1f190647d05 100644
--- a/samples/tools/testbed/includes/ENV.example
+++ b/samples/tools/testbed/includes/ENV.example
@@ -1,15 +1 @@
 export BING_API_KEY=
-export OAI_CONFIG_LIST='
-[
-    {
-        "model": "gpt-4",
-        "api_key": "",
-        "organization": ""
-    },
-    {
-        "model": "gpt-3.5-turbo-16k",
-        "api_key": "",
-        "organization": ""
-    }
-]
-'
diff --git a/samples/tools/testbed/run_scenarios.py b/samples/tools/testbed/run_scenarios.py
index 42fe2f4aa06d..939b9c0d46b1 100644
--- a/samples/tools/testbed/run_scenarios.py
+++ b/samples/tools/testbed/run_scenarios.py
@@ -7,12 +7,13 @@
 import time
 import pathlib
 import argparse
+from autogen import config_list_from_json
 
 # Location of the environment directory
-ENV_DIR = "./includes"
+INCLUDES_DIR = "./includes"
 
 
-def run_scenarios(scenario, n_repeats, is_native, results_dir="results"):
+def run_scenarios(scenario, n_repeats, is_native, config_list, results_dir="results"):
     """
     Run a set testbed scenarios a given number of times.
 
@@ -21,6 +22,7 @@ def run_scenarios(scenario, n_repeats, is_native, results_dir="results"):
                             all JSONL files in the folder will be loaded and run.
         n_repeats (int):    The number of times each scenario instance will be repeated
         is_native (bool):   True if the scenario should be run locally rather than in Docker (proceed with caution!)
+        config_list (list): An Autogen OAI_CONFIG_LIST to be used when running scenarios.
         results_dir (path): The folder were results will be saved.
     """
 
@@ -88,14 +90,19 @@ def run_scenarios(scenario, n_repeats, is_native, results_dir="results"):
                     os.mkdir(results_repetition)
                     expand_scenario(scenario_dir, instance, os.path.join(results_repetition, "scenario.py"))
 
-                    # Also copy the contents of ENV_DIR
-                    for item in os.listdir(ENV_DIR):
+                    # Also copy the contents of INCLUDES_DIR
+                    for item in os.listdir(INCLUDES_DIR):
                         if item.endswith(".example"):
                             continue
-                        item_path = os.path.join(ENV_DIR, item)
+                        item_path = os.path.join(INCLUDES_DIR, item)
                         if os.path.isfile(item_path):
                             shutil.copyfile(item_path, os.path.join(results_repetition, item))
 
+                    # Append the config list to the ENV file
+                    config_list_json = json.dumps(config_list)
+                    with open(os.path.join(results_repetition, "ENV"), "at") as fh:
+                        fh.write(f"export OAI_CONFIG_LIST='{config_list_json}'\n")
+
                     # Run the scenario
                     if is_native:
                         run_scenario_natively(results_repetition)
@@ -138,7 +145,7 @@ def run_scenario_natively(work_dir):
             """#
 . ./ENV
 python scenario.py
-rm ENV
+echo SCENARIO COMPLETE !#!#
 """
         )
 
@@ -186,6 +193,7 @@ def run_scenario_in_docker(work_dir, timeout=600):
 pip install pyautogen
 python scenario.py
 rm ENV
+echo SCENARIO COMPLETE !#!#
 """
         )
 
@@ -241,6 +249,13 @@ def run_scenario_in_docker(work_dir, timeout=600):
         help="The JSONL scenario file to run. If a directory is specified, then all JSONL scenarios in the directory are run. (default: ./scenarios)",
         default="scenarios",
     )
+    parser.add_argument(
+        "-c",
+        "--config",
+        type=str,
+        help="The environment variable name or path to the OAI_CONFIG_LIST (default: OAI_CONFIG_LIST).",
+        default="OAI_CONFIG_LIST",
+    )
     parser.add_argument(
         "-r", "--repeat", type=int, help="The number of repetitions to run for each scenario (default: 10).", default=10
     )
@@ -252,6 +267,12 @@ def run_scenario_in_docker(work_dir, timeout=600):
 
     args = parser.parse_args()
 
+    # Load the OAI_CONFIG_LIST
+    config_list = config_list_from_json(env_or_file=args.config)
+    if len(config_list) == 0:
+        raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), args.config)
+
+    # Warn if running natively
     if args.native:
         choice = input(
             'WARNING: Running natively, without Docker, not only poses the usual risks of executing arbitrary AI generated code on your machine, it also makes it impossible to ensure that each test starts from a known and consistent set of initial conditions. For example, if the agents spend time debugging and installing Python libraries to solve the task, then those libraries will be available to all other runs. In other words, earlier runs can influence later runs, leading to many confounds in testing.\n\nAre you absolutely sure you want to continue with native execution? Type "Yes" exactly, and in full, to proceed: '
@@ -266,11 +287,12 @@ def run_scenario_in_docker(work_dir, timeout=600):
         import docker
 
     # Warn aboit a common error
-    env_file = os.path.join(ENV_DIR, "ENV")
-    example_file = os.path.join(ENV_DIR, "ENV.example")
+    env_file = os.path.join(INCLUDES_DIR, "ENV")
+    example_file = os.path.join(INCLUDES_DIR, "ENV.example")
     if not os.path.isfile(env_file):
-        sys.exit(
-            f"The environment file '{env_file}' does not exist. If this is your first time setting up the testbed, you will want to rename '{example_file}' to '{env_file}' and edit it to include your API keys and configurations."
+        shutil.copyfile(example_file, env_file)
+        sys.stderr.write(
+            f"The environment file '{env_file}' does not exist (perhaps this is your first time setting up the testbed). A default environment file has been provided, but you may want to edit it to include your API keys and configurations.\n"
         )
 
-    run_scenarios(args.scenario, args.repeat, is_native)
+    run_scenarios(args.scenario, args.repeat, is_native, config_list)

From 771c2da70046132ce441ffd45a1a23d6f8493961 Mon Sep 17 00:00:00 2001
From: Adam Fourney <adamfo@microsoft.com>
Date: Fri, 3 Nov 2023 16:38:54 -0700
Subject: [PATCH 11/13] Removed dot-slash from 'includes' path for
 cross-platform compatibility

---
 samples/tools/testbed/run_scenarios.py                | 4 ++--
 samples/tools/testbed/scenarios/default_two_agents.py | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/samples/tools/testbed/run_scenarios.py b/samples/tools/testbed/run_scenarios.py
index 939b9c0d46b1..335a12798546 100644
--- a/samples/tools/testbed/run_scenarios.py
+++ b/samples/tools/testbed/run_scenarios.py
@@ -9,8 +9,8 @@
 import argparse
 from autogen import config_list_from_json
 
-# Location of the environment directory
-INCLUDES_DIR = "./includes"
+# Location of the global includes dir. The contents of this directory will be copied to the Docker environment.
+INCLUDES_DIR = "includes"
 
 
 def run_scenarios(scenario, n_repeats, is_native, config_list, results_dir="results"):
diff --git a/samples/tools/testbed/scenarios/default_two_agents.py b/samples/tools/testbed/scenarios/default_two_agents.py
index ed0c49547c0a..c11958fa170a 100644
--- a/samples/tools/testbed/scenarios/default_two_agents.py
+++ b/samples/tools/testbed/scenarios/default_two_agents.py
@@ -15,7 +15,7 @@
     "assistant",
     is_termination_msg=lambda x: x.get("content", "").rstrip().find("TERMINATE") >= 0,
     llm_config={
-        "request_timeout": 180,  # Remove for autogen version >= 0.2, and OpenAI version >= 1.0
+        # "request_timeout": 180,  # Remove for autogen version >= 0.2, and OpenAI version >= 1.0
         "config_list": config_list,
     },
 )

From 06b4e6c67e05f8acd8f72bf02eac78dc9a88f4bb Mon Sep 17 00:00:00 2001
From: Adam Fourney <adamfo@microsoft.com>
Date: Fri, 3 Nov 2023 16:41:04 -0700
Subject: [PATCH 12/13] Missed a file.

---
 samples/tools/testbed/scenarios/human_eval_two_agents.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/samples/tools/testbed/scenarios/human_eval_two_agents.py b/samples/tools/testbed/scenarios/human_eval_two_agents.py
index 305831d55732..f6ca36c87dbd 100644
--- a/samples/tools/testbed/scenarios/human_eval_two_agents.py
+++ b/samples/tools/testbed/scenarios/human_eval_two_agents.py
@@ -45,7 +45,7 @@ def run_tests(candidate):
     "assistant",
     is_termination_msg=lambda x: x.get("content", "").rstrip().find("TERMINATE") >= 0,
     llm_config={
-        "request_timeout": 180,  # Remove for autogen version >= 0.2, and OpenAI version >= 1.0
+        # "request_timeout": 180,  # Remove for autogen version >= 0.2, and OpenAI version >= 1.0
         "config_list": config_list,
     },
 )

From 09c25c4270bd0b224d0f6348c19d542135f4789b Mon Sep 17 00:00:00 2001
From: Adam Fourney <adamfo@microsoft.com>
Date: Fri, 3 Nov 2023 23:25:19 -0700
Subject: [PATCH 13/13] Updated readme to include known-working versions.

---
 samples/tools/testbed/README.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/samples/tools/testbed/README.md b/samples/tools/testbed/README.md
index 01ab1f32aeed..f947a0a5d011 100644
--- a/samples/tools/testbed/README.md
+++ b/samples/tools/testbed/README.md
@@ -2,6 +2,8 @@
 
 The Autogen Testbed environment is a tool for repeatedly running a set of pre-defined Autogen scenarios in a setting with tightly-controlled initial conditions. With each run, Autogen will start from a blank slate, working out what code needs to be written, and what libraries or dependencies to install. The results of each run are logged, and can be ingested by analysis or metrics scripts (see the HumanEval example later in this README). By default, all runs are conducted in freshly-initialized docker containers, providing the recommended level of consistency and safety.
 
+This Testbed sample has been tested in, and is known to work with, Autogen versions 0.1.14 and 0.2.0b1
+
 ## Setup
 
 Before you begin, you must configure your API keys for use with the Testbed. As with other Autogen applications, the Testbed will look for the OpenAI keys in a file in the current working directy, or environment variable named, OAI_CONFIG_LIST. This can be overrriden using a command-line parameter described later.