-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add collate file and more tests from autogpt into testbed #915
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #915 +/- ##
==========================================
- Coverage 26.58% 26.44% -0.14%
==========================================
Files 28 28
Lines 3732 3777 +45
Branches 847 858 +11
==========================================
+ Hits 992 999 +7
- Misses 2667 2706 +39
+ Partials 73 72 -1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
The pipeline is ready to go, just more tests to be added. |
@afourney PTAL. I've added 13 tests, covering coding, scraping, file io. The pipeline is similar to that of HumanEval. I add some packages to requirements.txt because they are foundamental to task solving. I don't think it is neccesary to waste conversation turns for agents to install necessary packages. |
reviewing now |
I think the question about installing packages is an interesting one. If they are common packages, then yes, we should expect them to already be installed. If, however, they are pretty esoteric and specific to the problem (e.g., yfinance) then identifying and installing the library is arguably part of the problem the agents need to work through to succeed. This is actually why I designed the Testbed the way that I did... so that each run would face identical obstacles. Probably we want to identify some common packages and actually have them pre-installed on a Docker image. Identifying a core set of packages before looking at the problem set is likely ideal and free of bias. Perhaps something like https://learnpython.com/blog/most-popular-python-packages/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. I was able to run it no problem, and the documentation is good.
As per #976 there may be more work to do, but I am happy to accept it as is, then submit my own PR to standardize things as per #976 and #973.
Also, I think the jury is still out on if we want to allow the agents to continue after a failed test. This capability is no-doubt extremely useful to see how the agents adapt to a feedback signal (the same is true for HumanEval), but it makes the benchmarks incomparable to other reported numbers. An option to stop after the first attempt is something we may want to invest in, (and perhaps turn on by default). We should also at least make clear that our implementation diverges in this way
* workflow path->paths * Apply suggestions from code review Co-authored-by: Li Jiang <[email protected]> --------- Co-authored-by: Li Jiang <[email protected]>
Why are these changes needed?
Add more tests from autogpt and reorganize file structure.
Related issue number
Checks