-
Notifications
You must be signed in to change notification settings - Fork 45.6k
Examples for benchmarking #1359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Like I mentioned elsewhere, I firmly believe the project should consider adding a new directory to the repository where all such ai-settings.yaml files could be committed/maintained (to have history etc). The point being, for any sort of AI, you inevitably need some form of benchmark and training data. Thus, if the ~120k of folks using this project currently could share some of their dysfunctional ai-settings.yaml files, this benchmark suite could grow rapidly over the course of a couple of weeks, and then it would be a piece of cake to run this suite every once in a while to see which improvements to AutoGPT help the agent complete more tasks / in a more correct fashion. Thus, please consider having some benchmark suite for yaml files - ideally, one category of files that are known to work well (for regression testing purposes) and another one for yaml files that "break" Auto-GPT (endless loops, repeating stuff etc) Once this is in place, this could help improve the project rather rapidly. For future reference:
|
This issue has automatically been marked as stale because it has not had any activity in the last 50 days. You can unstale it by commenting or removing the label. Otherwise, this issue will be closed in 10 days. |
This issue was closed automatically because it has been stale for 10 days with no activity. |
Duplicates
Summary 💡
In order to improve AutoGPT, we need examples of simple tasks which it should be able to ace but DOESN'T.
If you've found a good candidate task, please dump the contents ai_settings.yaml JSON in a comment.
Examples 🌈
This gets lost doing google searches.
Motivation 🔦
If we have decent benchmarks we can figure out where the technology is weakest, and apply focus at the key points.
The text was updated successfully, but these errors were encountered: