Full prompt test cases #1354

josephcmiller2 · 2023-04-14T16:40:44Z

Background

We should be tracking prompts that work successfully both to have an understanding of what's capable and also to ensure code changes don't break what's already working. Besides the unit tests, we need the end-to-end test cases for real world questions.

Changes

Add a set of test cases that seem to work consistently with the current state of AutoGPT
Add a script to execute those test cases
Add instructions to create one's own test cases

Documentation

tests/prompt_tests/README.md - Explains the usage and how to create one's own test cases
In-code comments

Test Plan

Tested each test case multiple times
Tested script with [all] as well as individual test cases

PR Quality Checklist

My pull request is atomic and focuses on a single change.
I have thoroughly tested my changes with multiple different prompts.
I have considered potential risks and mitigations for my changes.
I have documented my changes clearly and comprehensively.
I have not snuck in any "extra" small tweaks changes

…'s useless to ask AutoGPT to get specific weather conditions

…exity

…or improvements

waynehamadi · 2023-04-20T22:30:34Z

20 cycles to run the test is going to be expensive, and there are multiple tests.
why so many similar tests ?
let's build the CI pipeline with it as well, because the test doesn't have much value if it doesn't break when people push stuff.
I would love it if instead of just checking for a file, you had an even more specific check

Can we just start with one test that you know will always work in less than, let's say 10 cycles ?

waynehamadi · 2023-04-20T22:45:54Z

@josephcmiller2 you wanna join us on discord so we talk about it?
We love that initiative, it's just we're wondering if there is a way to integrate either this:

as smoke test
or as a benchmark.

RIght now it's kind of sitting in between, because it's very simple and not very hard to achieve by the AI(good smoke test for consistency), but it's also very long (benchmark)

josephcmiller2 · 2023-04-20T23:00:58Z

I'm on Discord. You can find in me in pull-requests or dev-autogpt

ntindle · 2023-04-22T07:34:54Z

what was the status of this? would love to get test coverage up asap

josephcmiller2 · 2023-04-22T14:45:15Z

what was the status of this? would love to get test coverage up asap

I haven't received any feedback on why it isn't merged or what to do next. Push the devs on Discord to help get this through.

Pwuts · 2023-04-24T20:29:39Z

@josephcmiller2 you received feedback from Merwane: #1354 (comment) #1354 (comment)

To summarize our thoughts: tests that involve prompting the AI should either be usable as a benchmark to test the AI's efficacy, or as a smoke test that needs max 3 cycles to determine if something is broken. Everything outside of those brackets is a gray area that will cost us a lot of money in terms of API tokens.

Boostrix · 2023-05-02T11:56:07Z

See also #1359

p-i- · 2023-05-05T00:54:46Z

This is a mass message from the AutoGPT core team.
Our apologies for the ongoing delay in processing PRs.
This is because we are re-architecting the AutoGPT core!

For more details (and for infor on joining our Discord), please refer to:
https://github.com/Significant-Gravitas/Auto-GPT/wiki/Architecting

Boostrix · 2023-05-05T18:54:29Z

To summarize our thoughts: tests that involve prompting the AI should either be usable as a benchmark to test the AI's efficacy, or as a smoke test that needs max 3 cycles to determine if something is broken.

if autgpt.py were to be executable in a single-shot fashion via corresponding CLI args, we could have our cake and eat it ... aka, use auto-gpt itself to such tests, optionally. That would not even have to be a feature available in interactive mode, just via a startup flag / CLI argument. That way, you'd probably get a ton of data.

Boostrix · 2023-05-07T06:21:35Z

tests that involve prompting the AI should either be usable as a benchmark to test the AI's efficacy, or as a smoke test that needs max 3 cycles to determine if something is broken. Everything outside of those brackets is a gray area that will cost us a lot of money in terms of API tokens.

Maybe we can have our cake and eat it by allowing folks to experiment with different "prompt profiles" for different purposes ? This sort of feature would be strictly opt-in (env level), but it would help us deal with all those PRs where people share their own prompt configs, and neatly organize things - while also providing an excellent way to benchmark/test (regressions!) things over time.

For the sake of regression testings, we should not underestimate the power of allowing people to create/share their own prompt profiles. This sort of feature would at least help close 10+ PRs here, immediately !

See: #1874 (comment)

vercel · 2023-05-21T13:21:10Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
docs	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	May 21, 2023 1:21pm

github-actions · 2023-05-21T13:21:24Z

This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size

Pwuts · 2023-05-31T13:44:15Z

Closing as stale

josephcmiller2 added 30 commits April 13, 2023 08:26

Added initial prompt test

8525df6

Add prompt to weather test

30f1ad5

Updated test to produce a more consistent result

61cf240

Add test prompt that includes filtering weather

dc0339a

Most weather sites have the weather in pictures and not in text so it…

1550d20

…'s useless to ask AutoGPT to get specific weather conditions

updated test info

dccd193

Added movie test case

030955f

Added movie tests

4419b78

Add a test that combines multiple instructions in a single goal

7fb0dd4

Merge branch 'master' into test-cases

926812f

Make the naming scheme more consistent

274461a

Add test configs for each case

3b6d349

Add test that must find movies and their ratings for additional compl…

6ae882c

…exity

Add more complexity by finding the times movies are playing on Friday

ba3fc6c

Clarify the times are on Friday

ae81698

Clarify the movies are in Denver

f3940d1

Removed test that is unreliable

4d1740d

Merge branch 'master' into test-cases

520fbdf

Initial test script

dcb1a5a

Add some helpful output

554adee

Don't chdir before executing

88d543c

Fix invalid argument in test configs

c6ab18e

Updated script to copy the ai_settings.yaml file and include some min…

a744d0a

…or improvements

Correct config typo in filename

7e89e16

Adjust the argument requirements behavior

28f7440

Add a README

a9bf039

Update code formatting in README

63cb785

Update code formatting in README

4309fa1

Update usage

f501c4e

Add a check for the existence of the output files

235e77d

josephcmiller2 added 6 commits April 19, 2023 09:09

Merge branch 'master' into test-cases

896d7bc

Don't include example test outputs

82e48d1

Use --ai-settings instead of copying the files

7cf8d00

Fix typo

b81becc

Removed unused import and sorted imports as required

2deb159

Fix formatting as required

1f7d694

Pwuts added testing and removed enhancement New feature or request labels Apr 20, 2023

Pwuts assigned Pwuts and waynehamadi and unassigned Pwuts Apr 20, 2023

This was referenced May 7, 2023

Persuading AI to always give json in response with GPT-4 #1042

Closed

Prompt Profiles #3954

Closed

Merge branch 'master' into test-cases

471ef84

github-actions bot added the size/xl label May 21, 2023

vercel bot temporarily deployed to Preview May 21, 2023 13:21 Inactive

Pwuts closed this May 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full prompt test cases #1354

Full prompt test cases #1354

josephcmiller2 commented Apr 14, 2023

waynehamadi commented Apr 20, 2023

waynehamadi commented Apr 20, 2023

josephcmiller2 commented Apr 20, 2023

ntindle commented Apr 22, 2023

josephcmiller2 commented Apr 22, 2023

Pwuts commented Apr 24, 2023 •

edited

Loading

Boostrix commented May 2, 2023

p-i- commented May 5, 2023

Boostrix commented May 5, 2023

Boostrix commented May 7, 2023 •

edited

Loading

vercel bot commented May 21, 2023 •

edited

Loading

github-actions bot commented May 21, 2023

Pwuts commented May 31, 2023

Full prompt test cases #1354

Full prompt test cases #1354

Conversation

josephcmiller2 commented Apr 14, 2023

Background

Changes

Documentation

Test Plan

PR Quality Checklist

waynehamadi commented Apr 20, 2023

waynehamadi commented Apr 20, 2023

josephcmiller2 commented Apr 20, 2023

ntindle commented Apr 22, 2023

josephcmiller2 commented Apr 22, 2023

Pwuts commented Apr 24, 2023 • edited Loading

Boostrix commented May 2, 2023

p-i- commented May 5, 2023

Boostrix commented May 5, 2023

Boostrix commented May 7, 2023 • edited Loading

vercel bot commented May 21, 2023 • edited Loading

github-actions bot commented May 21, 2023

Pwuts commented May 31, 2023

Pwuts commented Apr 24, 2023 •

edited

Loading

Boostrix commented May 7, 2023 •

edited

Loading

vercel bot commented May 21, 2023 •

edited

Loading