Skip to content

Examples for benchmarking #1359

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 task done
p-i- opened this issue Apr 14, 2023 · 3 comments
Closed
1 task done

Examples for benchmarking #1359

p-i- opened this issue Apr 14, 2023 · 3 comments
Labels
Needs Benchmark This change is hard to test and requires a benchmark Stale

Comments

@p-i-
Copy link
Contributor

p-i- commented Apr 14, 2023

Duplicates

  • I have searched the existing issues

Summary 💡

In order to improve AutoGPT, we need examples of simple tasks which it should be able to ace but DOESN'T.

If you've found a good candidate task, please dump the contents ai_settings.yaml JSON in a comment.

Examples 🌈

ai_goals:
- Get one PR number for the Auto-GPT project that is conflict-free and can be merged
ai_name: Manager-GPT
ai_role: an AI designed to look for GitHub pull requests that are free of conflicts

This gets lost doing google searches.

Motivation 🔦

If we have decent benchmarks we can figure out where the technology is weakest, and apply focus at the key points.

@ntindle ntindle added the Needs Benchmark This change is hard to test and requires a benchmark label Apr 23, 2023
@Boostrix
Copy link
Contributor

Boostrix commented Apr 30, 2023

Like I mentioned elsewhere, I firmly believe the project should consider adding a new directory to the repository where all such ai-settings.yaml files could be committed/maintained (to have history etc). The point being, for any sort of AI, you inevitably need some form of benchmark and training data.

Thus, if the ~120k of folks using this project currently could share some of their dysfunctional ai-settings.yaml files, this benchmark suite could grow rapidly over the course of a couple of weeks, and then it would be a piece of cake to run this suite every once in a while to see which improvements to AutoGPT help the agent complete more tasks / in a more correct fashion.

Thus, please consider having some benchmark suite for yaml files - ideally, one category of files that are known to work well (for regression testing purposes) and another one for yaml files that "break" Auto-GPT (endless loops, repeating stuff etc)

Once this is in place, this could help improve the project rather rapidly.

For future reference:

#15 (comment)

Basically this whole github issue revolves around tests:
We'll need to run benchmarks in github action to validate it's not "loosing" capability at every pull request. the benchmark has to use the same version of GPT every time and has to test the whole spectrum of what autogpt can do:

  • write text
  • browse the internet
  • execute commands
  • etc, etc...

#15 (comment)

@github-actions
Copy link
Contributor

github-actions bot commented Sep 6, 2023

This issue has automatically been marked as stale because it has not had any activity in the last 50 days. You can unstale it by commenting or removing the label. Otherwise, this issue will be closed in 10 days.

@github-actions github-actions bot added the Stale label Sep 6, 2023
@github-actions
Copy link
Contributor

This issue was closed automatically because it has been stale for 10 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Benchmark This change is hard to test and requires a benchmark Stale
Projects
None yet
Development

No branches or pull requests

3 participants