Skip to content

Commit

Permalink
Added a simple Testbed tool for repeatedly running templated Autogen …
Browse files Browse the repository at this point in the history
…scenarios with tightly-controlled initial conditions. (#455)

* Initial commit of the autogen testbed environment.

* Fixed some typos in the Testbed README.md

* Added some stricter termination logic to the two_agent scenario, and swiched the logo task from finding Autogen's logo, to finding Microsoft's (it's easier)

* Added documentation to testbed code in preparation for PR

* Added a variation of HumanEval to the Testbed. It is also a reasonable example of how to integrate other benchmarks.

* Removed ChatCompletion.start_logging and related features. Added an explicit TERMINATE output to HumanEval to save 1 turn in each conversation.

* Added metrics utils script for HumanEval

* Updated the requirements in the README.

* Added documentation for HumanEval csv schemas

* Standardized on how the OAI_CONFIG_LIST is handled.

* Removed dot-slash from 'includes' path for cross-platform compatibility

* Missed a file.

* Updated readme to include known-working versions.
  • Loading branch information
afourney authored Nov 4, 2023
1 parent c4f8b1c commit 1c4a5e6
Show file tree
Hide file tree
Showing 10 changed files with 920 additions and 0 deletions.
151 changes: 151 additions & 0 deletions samples/tools/testbed/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
# Autogen Testbed Environment

The Autogen Testbed environment is a tool for repeatedly running a set of pre-defined Autogen scenarios in a setting with tightly-controlled initial conditions. With each run, Autogen will start from a blank slate, working out what code needs to be written, and what libraries or dependencies to install. The results of each run are logged, and can be ingested by analysis or metrics scripts (see the HumanEval example later in this README). By default, all runs are conducted in freshly-initialized docker containers, providing the recommended level of consistency and safety.

This Testbed sample has been tested in, and is known to work with, Autogen versions 0.1.14 and 0.2.0b1

## Setup

Before you begin, you must configure your API keys for use with the Testbed. As with other Autogen applications, the Testbed will look for the OpenAI keys in a file in the current working directy, or environment variable named, OAI_CONFIG_LIST. This can be overrriden using a command-line parameter described later.

For some scenarios, additional keys may be required (e.g., keys for the Bing Search API). These can be added to an `ENV` file in the `includes` folder. A sample has been provided in ``includes/ENV.example``. Edit ``includes/ENV`` as needed.

The Testbed also requires Docker (Desktop or Engine) AND the __python docker__ library. **It will not run in codespaces**, unless you opt for native execution (with is strongly discouraged). To install Docker Desktop see [https://www.docker.com/products/docker-desktop/](https://www.docker.com/products/docker-desktop/). To install the Python library:

``pip install docker``

## Running the Testbed

To run the Testbed, simply execute
``python run_scenarios.py``

The default it to repeat this scenario 10 times. This can be costly. To run each scenario only once, use:
``python run_scenarios.py --repeat 1``


The run_scenarios.py script also allows a number of command-line arguments to control various parameters of execution. Type ``python run_scenarios.py -h`` to explore these options:

```
run_scenarios.py will run the specified autogen scenarios for a given number of repetitions and record all logs and trace information. When running in a Docker environment (default), each run will begin from a common, tightly controlled, environment. The resultant logs can then be further processed by other scripts to produce metrics.
positional arguments:
scenario The JSONL scenario file to run. If a directory is specified,
then all JSONL scenarios in the directory are run. (default:
./scenarios)
options:
-h, --help show this help message and exit
-r REPEAT, --repeat REPEAT
The number of repetitions to run for each scenario (default: 10).
-c CONFIG, --config CONFIG
The environment variable name or path to the OAI_CONFIG_LIST (default: OAI_CONFIG_LIST).
--native Run the scenarios natively rather than in docker.
NOTE: This is not advisable, and should be done with great caution.
```

## Results

By default, the Testbed stores results in a folder heirarchy with the following template:

``./results/[scenario]/[instance_id]/[repetition]``

For example, consider the following folders:

``./results/default_two_agents/two_agent_stocks_gpt4/0``
``./results/default_two_agents/two_agent_stocks_gpt4/1``

...

``./results/default_two_agents/two_agent_stocks_gpt4/9``

This folder holds the results for the ``two_agent_stocks_gpt4`` instance of the ``default_two_agents`` scenario. The ``0`` folder contains the results of the first run. The ``1`` folder contains the results of the second run, and so on. You can think of the _instance_ as mapping to a prompt, or a unique set of parameters, while the _scenario_ defines the template in which those parameters are input.

Within each folder, you will find the following files:

- *timestamp.txt*: records the date and time of the run, along with the version of the pyautogen library installed
- *console_log.txt*: all console output produced by Docker when running autogen. Read this like you would a regular console.
- *chat_completions.json*: a log of all OpenAI ChatCompletions, as logged by ``autogen.ChatCompletion.start_logging(compact=False)``
- *[agent]_messages.json*: for each Agent, a log of their messages dictionaries
- *./coding*: A directory containing all code written by Autogen, and all artifacts produced by that code.

## Scenario Templating

All scenarios are stored in JSONL files in the ``./scenarios'' directory. Each line of a scenario file is a JSON object with the following schema:

```
{
"id": string,
"template": filename,
"values" {
"field_name1": string,
"field_name2": string,
...
"field_nameN": string
}
}
```

For example:

```
{
"id": "two_agent_stocks_gpt4",
"template": "default_two_agents.py",
"values": {
"\__MODEL\__": "gpt-4",
"\__PROMPT\__": "Plot and save to disk a chart of NVDA and TESLA stock price YTD."
}
}
```

Where the ``id`` is the instance id used when saving results, ``template`` points to a python file that contains the scenario logic, and ``values`` contains a set of strings to find and replace when expanding the template.

An example templated python file is:

```
from autogen import AssistantAgent, UserProxyAgent, config_list_from_json
import os
import json
import testbed_utils
testbed_utils.init()
##############################
config_list = config_list_from_json(
"OAI_CONFIG_LIST", filter_dict={"model": ["\__MODEL\__"]},
)
assistant = AssistantAgent("assistant", llm_config={
"request_timeout": 180,
"config_list": config_list}
)
user_proxy = UserProxyAgent("user_proxy",
human_input_mode="NEVER",
code_execution_config={
"work_dir": "coding",
"use_docker": False,
},
max_consecutive_auto_reply=10)
user_proxy.initiate_chat(assistant, message="\__PROMPT\__")
##############################
testbed_utils.finalize(assistant, user_proxy)
```


## (Example) Running HumanEval

One sample Testbed scenario type is a variation of the classic [HumanEval](https://github.com/openai/human-eval) benchmark. In this scenario, agents are given access to the unit test results, and are able to continue to debug their code until the problem is solved or they run out of tokens or turns. We can then count how many turns it took to solve the problem (returning -1 if the problem remains unsolved by the end of the conversation, and "" if the run is missing).

Accessing this scenario-type requires downloading and converting the HumanEval dataset, running the Testbed, collating the results, and finally computing the metrics. The following commands will accomplish this, running each test instance 3 times with GPT-3.5-Turbo-16k:

```
python utils/download_humaneval.py
python ./run_scenarios.py --repeat 3 scenarios/human_eval_two_agents_gpt35.jsonl
python utils/collate_human_eval.py ./results/human_eval_two_agents_gpt35 | python utils/metrics_human_eval.py > human_eval_results_gpt35.csv
cat human_eval_results_gpt35.csv
```
1 change: 1 addition & 0 deletions samples/tools/testbed/includes/ENV.example
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
export BING_API_KEY=
50 changes: 50 additions & 0 deletions samples/tools/testbed/includes/testbed_utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
from importlib.metadata import version as lib_version
from datetime import datetime
import os
import autogen
import json


def init():
"""Helper function to initialize logging in a testbed scenario.
Specifically, write timestamp and version information, then
initialize autogen logging.
Args:
None
Returns:
None
"""

# Print some information about the run
with open("timestamp.txt", "wt") as f:
f.write("Timestamp: " + datetime.now().isoformat() + "\n")
f.write("pyautogen version: " + lib_version("pyautogen") + "\n")


def finalize(agents):
"""Helper function to finalize logging in a testbed scenario.
Calling this function will save all the chat completions logged
by Autogen to disk, and will save the messages dictionaries of
all agents passed via the agents argument.
Args:
agents (list): a list of the agents whose messages will be logged to disk.
Returns:
None
"""

script_dir = os.path.dirname(os.path.realpath(__file__))

def messages_to_json(agent):
messages = dict()
for item in agent.chat_messages.items():
messages[item[0].name] = item[1]
return json.dumps(messages, indent=4)

for agent in agents:
fname = agent.name + "_messages.json"
with open(os.path.join(script_dir, fname), "wt") as fh:
fh.write(messages_to_json(agent))
Loading

0 comments on commit 1c4a5e6

Please sign in to comment.