Testbed folders #792

afourney · 2023-11-28T07:08:26Z

Why are these changes needed?

The current testbed templating format works great for single-file scenarios, but is less flexible when multiple files need to be included in a test (e.g., including a PDF or image to operate over). This PR moves the templating format to one that accepts whole folders. Backwards compatibility is maintained.

Related issue number

This PR will enable progress on #691, #692, and other benchmarks (e.g., newly released GAIA)

Checks

I've included any doc changes needed for https://microsoft.github.io/autogen/. See https://microsoft.github.io/autogen/docs/Contribute#documentation to build and test documentation locally.
I've added tests (if relevant) corresponding to the changes introduced in this PR.
I've made sure all auto checks have passed.

…stbed_folders

… scenarios.

codecov-commenter · 2023-11-28T07:09:48Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (ae7066b) 27.77% compared to head (d712076) 27.77%.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #792   +/-   ##
=======================================
  Coverage   27.77%   27.77%           
=======================================
  Files          27       27           
  Lines        3500     3500           
  Branches      794      794           
=======================================
  Hits          972      972           
  Misses       2457     2457           
  Partials       71       71

Flag	Coverage Δ
unittests	`27.71% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

samples/tools/testbed/README.md

yiranwu0 · 2023-11-28T15:30:33Z

I was looking at this testbed and has a question about Docker. Currently the run_scenario will pull the python:3.11 docker image and do pip install every time. Is it possible to pass in names of a customized docker images to run (or am I overlooking)? I think it is even better if users could just pass in a docker_build.txt with the required python packages to create a new docker image if the intention is to run a large scale experiment.

LeoLjl · 2023-11-28T15:41:46Z

I was looking at this testbed and has a question about Docker. Currently the run_scenario will pull the python:3.11 docker image and do pip install every time. Is it possible to pass in names of a customized docker images to run (or am I overlooking)? I think it is even better if users could just pass in a docker_build.txt with the required python packages to create a new docker image if the intention is to run a large scale experiment.

I added package names to include/requirements.txt so that every time a new dokcer image is created, these packages are automatically installed. This could be a workaround if installation is fast.

afourney · 2023-11-28T17:32:14Z

Yes, adding support for a custom Docker image, or a Docker file would be a logical next step. At present, there are a couple of options. (1) you can update or specify a requirements.txt file. Or (2) you can customize the global_startup.sh or scenario_startup.sh files to install by other means. The new folder-oriented specification means that you can include zips, packages, or other dependencies in the includes or scenario template folders as well, allowing for testing on private content.

…ations.

samples/tools/testbed/README.md

* Re-added completion logging when using older versions of autogen. * Extended scenario definitions and templating to include folders. * Prepare collate_human_eval.py for working with group chat scenarios. * Converted HumanEval to the folder-based approach, and added GroupChat scenarios. * Fixed the default termination message. * Fixed another termination condition. * Updated compatible autogen versions. * Added initial support for GAIA benchmark. * Fixed a bug in executing the finalize scripts. * Generalized the template further to support multiple folder copy operations. * Refined GAIA support, and broke scenarios down by difficulty. * Added some experimental scripts for computing metrics over GAIA. This is a first version, and will likely need refinement. * Added instructions for cloning GAIA * Updated README to fix some typos. * Added a script to format GAIA reslts for the leaderboard. * Update samples/tools/testbed/scenarios/GAIA/Templates/BasicTwoAgents/scenario.py Co-authored-by: LeoLjl <[email protected]> --------- Co-authored-by: Qingyun Wu <[email protected]> Co-authored-by: LeoLjl <[email protected]>

… (microsoft#810) * Re-added completion logging when using older versions of autogen. * Extended scenario definitions and templating to include folders. * Prepare collate_human_eval.py for working with group chat scenarios. * Converted HumanEval to the folder-based approach, and added GroupChat scenarios. * Fixed the default termination message. * Fixed another termination condition. * Updated compatible autogen versions. * Added initial support for GAIA benchmark. * Fixed a bug in executing the finalize scripts. * Generalized the template further to support multiple folder copy operations. * Refined GAIA support, and broke scenarios down by difficulty. * Added some experimental scripts for computing metrics over GAIA. This is a first version, and will likely need refinement. * Added instructions for cloning GAIA * Updated README to fix some typos. * Added a script to format GAIA reslts for the leaderboard. * Update samples/tools/testbed/scenarios/GAIA/Templates/BasicTwoAgents/scenario.py Co-authored-by: LeoLjl <[email protected]> --------- Co-authored-by: Qingyun Wu <[email protected]> Co-authored-by: LeoLjl <[email protected]>

* Re-added completion logging when using older versions of autogen. * Extended scenario definitions and templating to include folders. * Prepare collate_human_eval.py for working with group chat scenarios. * Converted HumanEval to the folder-based approach, and added GroupChat scenarios. * Fixed the default termination message. * Fixed another termination condition. * Updated compatible autogen versions. * Fixed a bug in executing the finalize scripts. * Generalized the template further to support multiple folder copy operations. * Add tests from AutoGPT. * Update README.md * Fix typo * Update samples/tools/testbed/README.md --------- Co-authored-by: LeoLjl <[email protected]> Co-authored-by: Qingyun Wu <[email protected]>

… (microsoft#810) * Re-added completion logging when using older versions of autogen. * Extended scenario definitions and templating to include folders. * Prepare collate_human_eval.py for working with group chat scenarios. * Converted HumanEval to the folder-based approach, and added GroupChat scenarios. * Fixed the default termination message. * Fixed another termination condition. * Updated compatible autogen versions. * Added initial support for GAIA benchmark. * Fixed a bug in executing the finalize scripts. * Generalized the template further to support multiple folder copy operations. * Refined GAIA support, and broke scenarios down by difficulty. * Added some experimental scripts for computing metrics over GAIA. This is a first version, and will likely need refinement. * Added instructions for cloning GAIA * Updated README to fix some typos. * Added a script to format GAIA reslts for the leaderboard. * Update samples/tools/testbed/scenarios/GAIA/Templates/BasicTwoAgents/scenario.py Co-authored-by: LeoLjl <[email protected]> --------- Co-authored-by: Qingyun Wu <[email protected]> Co-authored-by: LeoLjl <[email protected]>

afourney added 9 commits November 16, 2023 15:25

Re-added completion logging when using older versions of autogen.

84897b4

Merge remote-tracking branch 'origin/testbed_add_014_logging' into te…

76ae8f5

…stbed_folders

Extended scenario definitions and templating to include folders.

014063e

Prepare collate_human_eval.py for working with group chat scenarios.

81c62c9

Converted HumanEval to the folder-based approach, and added GroupChat…

6f0a45f

… scenarios.

Fixed the default termination message.

5e9fe60

Fixed another termination condition.

f96245a

Merge branch 'main' into testbed_folders

bf815aa

Updated compatible autogen versions.

a95c748

afourney added evaluation proj-autogenbench Issues related to AutoGenBench. labels Nov 28, 2023

afourney requested review from qingyun-wu and a team November 28, 2023 07:08

afourney self-assigned this Nov 28, 2023

sonichi reviewed Nov 28, 2023

View reviewed changes

samples/tools/testbed/README.md Outdated Show resolved Hide resolved

sonichi requested review from LeoLjl and a team November 28, 2023 15:02

afourney added 2 commits November 28, 2023 23:28

Merge branch 'main' into testbed_folders

b962226

Fixed a bug in executing the finalize scripts.

53cd5b4

afourney marked this pull request as draft November 29, 2023 15:16

Generalized the template further to support multiple folder copy oper…

2d97bb8

…ations.

afourney marked this pull request as ready for review November 29, 2023 22:15

This was referenced Nov 29, 2023

Adds the GAIA benchark to the Testbed. This PR depends on #792 #810

Merged

Unable to succesfully run the Autogen Testbed Environment #800

Closed

LeoLjl added 2 commits November 30, 2023 13:52

Add tests from AutoGPT.

e574e21

Update README.md

1218746

Fix typo

e506195

sonichi enabled auto-merge November 30, 2023 16:24

qingyun-wu reviewed Nov 30, 2023

View reviewed changes

samples/tools/testbed/README.md Outdated Show resolved Hide resolved

qingyun-wu added 3 commits November 30, 2023 11:38

Merge branch 'main' into testbed_folders

911865f

Update samples/tools/testbed/README.md

30b0a7e

Merge branch 'main' into testbed_folders

d712076

qingyun-wu approved these changes Nov 30, 2023

View reviewed changes

sonichi added this pull request to the merge queue Nov 30, 2023

Merged via the queue into main with commit 45c2a78 Nov 30, 2023
16 checks passed

afourney deleted the testbed_folders branch December 2, 2023 05:24

afourney mentioned this pull request Dec 14, 2023

Allow users to specify the Docker image to use with Testbed #986

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Testbed folders #792

Testbed folders #792

afourney commented Nov 28, 2023 •

edited

Loading

codecov-commenter commented Nov 28, 2023 •

edited

Loading

yiranwu0 commented Nov 28, 2023

LeoLjl commented Nov 28, 2023

afourney commented Nov 28, 2023

Testbed folders #792

Testbed folders #792

Conversation

afourney commented Nov 28, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

codecov-commenter commented Nov 28, 2023 • edited Loading

Codecov Report

yiranwu0 commented Nov 28, 2023

LeoLjl commented Nov 28, 2023

afourney commented Nov 28, 2023

afourney commented Nov 28, 2023 •

edited

Loading

codecov-commenter commented Nov 28, 2023 •

edited

Loading