Examples for benchmarking #1359

p-i- · 2023-04-14T16:52:16Z

Duplicates

I have searched the existing issues

Summary 💡

In order to improve AutoGPT, we need examples of simple tasks which it should be able to ace but DOESN'T.

If you've found a good candidate task, please dump the contents ai_settings.yaml JSON in a comment.

Examples 🌈

ai_goals:
- Get one PR number for the Auto-GPT project that is conflict-free and can be merged
ai_name: Manager-GPT
ai_role: an AI designed to look for GitHub pull requests that are free of conflicts

This gets lost doing google searches.

Motivation 🔦

If we have decent benchmarks we can figure out where the technology is weakest, and apply focus at the key points.

The text was updated successfully, but these errors were encountered:

Boostrix · 2023-04-30T07:05:30Z

Like I mentioned elsewhere, I firmly believe the project should consider adding a new directory to the repository where all such ai-settings.yaml files could be committed/maintained (to have history etc). The point being, for any sort of AI, you inevitably need some form of benchmark and training data.

Thus, if the ~120k of folks using this project currently could share some of their dysfunctional ai-settings.yaml files, this benchmark suite could grow rapidly over the course of a couple of weeks, and then it would be a piece of cake to run this suite every once in a while to see which improvements to AutoGPT help the agent complete more tasks / in a more correct fashion.

Thus, please consider having some benchmark suite for yaml files - ideally, one category of files that are known to work well (for regression testing purposes) and another one for yaml files that "break" Auto-GPT (endless loops, repeating stuff etc)

Once this is in place, this could help improve the project rather rapidly.

For future reference:

#15 (comment)

Basically this whole github issue revolves around tests:
We'll need to run benchmarks in github action to validate it's not "loosing" capability at every pull request. the benchmark has to use the same version of GPT every time and has to test the whole spectrum of what autogpt can do:

write text

browse the internet

execute commands

etc, etc...

#15 (comment)

github-actions · 2023-09-06T21:09:33Z

This issue has automatically been marked as stale because it has not had any activity in the last 50 days. You can unstale it by commenting or removing the label. Otherwise, this issue will be closed in 10 days.

github-actions · 2023-09-19T01:47:20Z

This issue was closed automatically because it has been stale for 10 days with no activity.

ntindle added the Needs Benchmark This change is hard to test and requires a benchmark label Apr 23, 2023

Boostrix mentioned this issue May 7, 2023

Prompt Profiles #3954

Closed

1 task

github-actions bot added the Stale label Sep 6, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Examples for benchmarking #1359

Examples for benchmarking #1359

p-i- commented Apr 14, 2023

Boostrix commented Apr 30, 2023 •

edited

Loading

github-actions bot commented Sep 6, 2023

github-actions bot commented Sep 19, 2023

Examples for benchmarking #1359

Examples for benchmarking #1359

Comments

p-i- commented Apr 14, 2023

Duplicates

Summary 💡

Examples 🌈

Motivation 🔦

Boostrix commented Apr 30, 2023 • edited Loading

github-actions bot commented Sep 6, 2023

github-actions bot commented Sep 19, 2023

Boostrix commented Apr 30, 2023 •

edited

Loading