-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Testbed folders #792
Testbed folders #792
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #792 +/- ##
=======================================
Coverage 27.77% 27.77%
=======================================
Files 27 27
Lines 3500 3500
Branches 794 794
=======================================
Hits 972 972
Misses 2457 2457
Partials 71 71
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
I was looking at this testbed and has a question about Docker. Currently the run_scenario will pull the python:3.11 docker image and do pip install every time. Is it possible to pass in names of a customized docker images to run (or am I overlooking)? I think it is even better if users could just pass in a |
I added package names to include/requirements.txt so that every time a new dokcer image is created, these packages are automatically installed. This could be a workaround if installation is fast. |
Yes, adding support for a custom Docker image, or a Docker file would be a logical next step. At present, there are a couple of options. (1) you can update or specify a requirements.txt file. Or (2) you can customize the global_startup.sh or scenario_startup.sh files to install by other means. The new folder-oriented specification means that you can include zips, packages, or other dependencies in the includes or scenario template folders as well, allowing for testing on private content. |
* Re-added completion logging when using older versions of autogen. * Extended scenario definitions and templating to include folders. * Prepare collate_human_eval.py for working with group chat scenarios. * Converted HumanEval to the folder-based approach, and added GroupChat scenarios. * Fixed the default termination message. * Fixed another termination condition. * Updated compatible autogen versions. * Added initial support for GAIA benchmark. * Fixed a bug in executing the finalize scripts. * Generalized the template further to support multiple folder copy operations. * Refined GAIA support, and broke scenarios down by difficulty. * Added some experimental scripts for computing metrics over GAIA. This is a first version, and will likely need refinement. * Added instructions for cloning GAIA * Updated README to fix some typos. * Added a script to format GAIA reslts for the leaderboard. * Update samples/tools/testbed/scenarios/GAIA/Templates/BasicTwoAgents/scenario.py Co-authored-by: LeoLjl <[email protected]> --------- Co-authored-by: Qingyun Wu <[email protected]> Co-authored-by: LeoLjl <[email protected]>
… (microsoft#810) * Re-added completion logging when using older versions of autogen. * Extended scenario definitions and templating to include folders. * Prepare collate_human_eval.py for working with group chat scenarios. * Converted HumanEval to the folder-based approach, and added GroupChat scenarios. * Fixed the default termination message. * Fixed another termination condition. * Updated compatible autogen versions. * Added initial support for GAIA benchmark. * Fixed a bug in executing the finalize scripts. * Generalized the template further to support multiple folder copy operations. * Refined GAIA support, and broke scenarios down by difficulty. * Added some experimental scripts for computing metrics over GAIA. This is a first version, and will likely need refinement. * Added instructions for cloning GAIA * Updated README to fix some typos. * Added a script to format GAIA reslts for the leaderboard. * Update samples/tools/testbed/scenarios/GAIA/Templates/BasicTwoAgents/scenario.py Co-authored-by: LeoLjl <[email protected]> --------- Co-authored-by: Qingyun Wu <[email protected]> Co-authored-by: LeoLjl <[email protected]>
* Re-added completion logging when using older versions of autogen. * Extended scenario definitions and templating to include folders. * Prepare collate_human_eval.py for working with group chat scenarios. * Converted HumanEval to the folder-based approach, and added GroupChat scenarios. * Fixed the default termination message. * Fixed another termination condition. * Updated compatible autogen versions. * Fixed a bug in executing the finalize scripts. * Generalized the template further to support multiple folder copy operations. * Add tests from AutoGPT. * Update README.md * Fix typo * Update samples/tools/testbed/README.md --------- Co-authored-by: LeoLjl <[email protected]> Co-authored-by: Qingyun Wu <[email protected]>
… (microsoft#810) * Re-added completion logging when using older versions of autogen. * Extended scenario definitions and templating to include folders. * Prepare collate_human_eval.py for working with group chat scenarios. * Converted HumanEval to the folder-based approach, and added GroupChat scenarios. * Fixed the default termination message. * Fixed another termination condition. * Updated compatible autogen versions. * Added initial support for GAIA benchmark. * Fixed a bug in executing the finalize scripts. * Generalized the template further to support multiple folder copy operations. * Refined GAIA support, and broke scenarios down by difficulty. * Added some experimental scripts for computing metrics over GAIA. This is a first version, and will likely need refinement. * Added instructions for cloning GAIA * Updated README to fix some typos. * Added a script to format GAIA reslts for the leaderboard. * Update samples/tools/testbed/scenarios/GAIA/Templates/BasicTwoAgents/scenario.py Co-authored-by: LeoLjl <[email protected]> --------- Co-authored-by: Qingyun Wu <[email protected]> Co-authored-by: LeoLjl <[email protected]>
Why are these changes needed?
The current testbed templating format works great for single-file scenarios, but is less flexible when multiple files need to be included in a test (e.g., including a PDF or image to operate over). This PR moves the templating format to one that accepts whole folders. Backwards compatibility is maintained.
Related issue number
This PR will enable progress on #691, #692, and other benchmarks (e.g., newly released GAIA)
Checks