Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testbed folders #792

Merged
merged 18 commits into from
Nov 30, 2023
Merged

Testbed folders #792

merged 18 commits into from
Nov 30, 2023

Conversation

afourney
Copy link
Member

@afourney afourney commented Nov 28, 2023

Why are these changes needed?

The current testbed templating format works great for single-file scenarios, but is less flexible when multiple files need to be included in a test (e.g., including a PDF or image to operate over). This PR moves the templating format to one that accepts whole folders. Backwards compatibility is maintained.

Related issue number

This PR will enable progress on #691, #692, and other benchmarks (e.g., newly released GAIA)

Checks

@afourney afourney added evaluation proj-autogenbench Issues related to AutoGenBench. labels Nov 28, 2023
@afourney afourney requested review from qingyun-wu and a team November 28, 2023 07:08
@afourney afourney self-assigned this Nov 28, 2023
@codecov-commenter
Copy link

codecov-commenter commented Nov 28, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (ae7066b) 27.77% compared to head (d712076) 27.77%.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #792   +/-   ##
=======================================
  Coverage   27.77%   27.77%           
=======================================
  Files          27       27           
  Lines        3500     3500           
  Branches      794      794           
=======================================
  Hits          972      972           
  Misses       2457     2457           
  Partials       71       71           
Flag Coverage Δ
unittests 27.71% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@sonichi sonichi requested review from LeoLjl and a team November 28, 2023 15:02
@yiranwu0
Copy link
Collaborator

I was looking at this testbed and has a question about Docker. Currently the run_scenario will pull the python:3.11 docker image and do pip install every time. Is it possible to pass in names of a customized docker images to run (or am I overlooking)? I think it is even better if users could just pass in a docker_build.txt with the required python packages to create a new docker image if the intention is to run a large scale experiment.

@LeoLjl
Copy link
Collaborator

LeoLjl commented Nov 28, 2023

I was looking at this testbed and has a question about Docker. Currently the run_scenario will pull the python:3.11 docker image and do pip install every time. Is it possible to pass in names of a customized docker images to run (or am I overlooking)? I think it is even better if users could just pass in a docker_build.txt with the required python packages to create a new docker image if the intention is to run a large scale experiment.

I added package names to include/requirements.txt so that every time a new dokcer image is created, these packages are automatically installed. This could be a workaround if installation is fast.

@afourney
Copy link
Member Author

Yes, adding support for a custom Docker image, or a Docker file would be a logical next step. At present, there are a couple of options. (1) you can update or specify a requirements.txt file. Or (2) you can customize the global_startup.sh or scenario_startup.sh files to install by other means. The new folder-oriented specification means that you can include zips, packages, or other dependencies in the includes or scenario template folders as well, allowing for testing on private content.

@afourney afourney marked this pull request as draft November 29, 2023 15:16
@sonichi sonichi enabled auto-merge November 30, 2023 16:24
@sonichi sonichi added this pull request to the merge queue Nov 30, 2023
Merged via the queue into main with commit 45c2a78 Nov 30, 2023
16 checks passed
@afourney afourney deleted the testbed_folders branch December 2, 2023 05:24
github-merge-queue bot pushed a commit that referenced this pull request Dec 6, 2023
* Re-added completion logging when using older versions of autogen.

* Extended scenario definitions and templating to include folders.

* Prepare collate_human_eval.py for working with group chat scenarios.

* Converted HumanEval to the folder-based approach, and added GroupChat scenarios.

* Fixed the default termination message.

* Fixed another termination condition.

* Updated compatible autogen versions.

* Added initial support for GAIA benchmark.

* Fixed a bug in executing the finalize scripts.

* Generalized the template further to support multiple folder copy operations.

* Refined GAIA support, and broke scenarios down by difficulty.

* Added some experimental scripts for computing metrics over GAIA. This is a first version, and will likely need refinement.

* Added instructions for cloning GAIA

* Updated README to fix some typos.

* Added a script to format GAIA reslts for the leaderboard.

* Update samples/tools/testbed/scenarios/GAIA/Templates/BasicTwoAgents/scenario.py

Co-authored-by: LeoLjl <[email protected]>

---------

Co-authored-by: Qingyun Wu <[email protected]>
Co-authored-by: LeoLjl <[email protected]>
rlam3 pushed a commit to rlam3/autogen that referenced this pull request Dec 19, 2023
… (microsoft#810)

* Re-added completion logging when using older versions of autogen.

* Extended scenario definitions and templating to include folders.

* Prepare collate_human_eval.py for working with group chat scenarios.

* Converted HumanEval to the folder-based approach, and added GroupChat scenarios.

* Fixed the default termination message.

* Fixed another termination condition.

* Updated compatible autogen versions.

* Added initial support for GAIA benchmark.

* Fixed a bug in executing the finalize scripts.

* Generalized the template further to support multiple folder copy operations.

* Refined GAIA support, and broke scenarios down by difficulty.

* Added some experimental scripts for computing metrics over GAIA. This is a first version, and will likely need refinement.

* Added instructions for cloning GAIA

* Updated README to fix some typos.

* Added a script to format GAIA reslts for the leaderboard.

* Update samples/tools/testbed/scenarios/GAIA/Templates/BasicTwoAgents/scenario.py

Co-authored-by: LeoLjl <[email protected]>

---------

Co-authored-by: Qingyun Wu <[email protected]>
Co-authored-by: LeoLjl <[email protected]>
whiskyboy pushed a commit to whiskyboy/autogen that referenced this pull request Apr 17, 2024
* Re-added completion logging when using older versions of autogen.

* Extended scenario definitions and templating to include folders.

* Prepare collate_human_eval.py for working with group chat scenarios.

* Converted HumanEval to the folder-based approach, and added GroupChat scenarios.

* Fixed the default termination message.

* Fixed another termination condition.

* Updated compatible autogen versions.

* Fixed a bug in executing the finalize scripts.

* Generalized the template further to support multiple folder copy operations.

* Add tests from AutoGPT.

* Update README.md

* Fix typo

* Update samples/tools/testbed/README.md

---------

Co-authored-by: LeoLjl <[email protected]>
Co-authored-by: Qingyun Wu <[email protected]>
whiskyboy pushed a commit to whiskyboy/autogen that referenced this pull request Apr 17, 2024
… (microsoft#810)

* Re-added completion logging when using older versions of autogen.

* Extended scenario definitions and templating to include folders.

* Prepare collate_human_eval.py for working with group chat scenarios.

* Converted HumanEval to the folder-based approach, and added GroupChat scenarios.

* Fixed the default termination message.

* Fixed another termination condition.

* Updated compatible autogen versions.

* Added initial support for GAIA benchmark.

* Fixed a bug in executing the finalize scripts.

* Generalized the template further to support multiple folder copy operations.

* Refined GAIA support, and broke scenarios down by difficulty.

* Added some experimental scripts for computing metrics over GAIA. This is a first version, and will likely need refinement.

* Added instructions for cloning GAIA

* Updated README to fix some typos.

* Added a script to format GAIA reslts for the leaderboard.

* Update samples/tools/testbed/scenarios/GAIA/Templates/BasicTwoAgents/scenario.py

Co-authored-by: LeoLjl <[email protected]>

---------

Co-authored-by: Qingyun Wu <[email protected]>
Co-authored-by: LeoLjl <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proj-autogenbench Issues related to AutoGenBench.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants