Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full prompt test cases #1354

Closed

Conversation

josephcmiller2
Copy link
Contributor

Background

We should be tracking prompts that work successfully both to have an understanding of what's capable and also to ensure code changes don't break what's already working. Besides the unit tests, we need the end-to-end test cases for real world questions.

Changes

  • Add a set of test cases that seem to work consistently with the current state of AutoGPT
  • Add a script to execute those test cases
  • Add instructions to create one's own test cases

Documentation

  • tests/prompt_tests/README.md - Explains the usage and how to create one's own test cases
  • In-code comments

Test Plan

Tested each test case multiple times
Tested script with [all] as well as individual test cases

PR Quality Checklist

  • My pull request is atomic and focuses on a single change.
  • I have thoroughly tested my changes with multiple different prompts.
  • I have considered potential risks and mitigations for my changes.
  • I have documented my changes clearly and comprehensively.
  • I have not snuck in any "extra" small tweaks changes

…'s useless to ask AutoGPT to get specific weather conditions
@Pwuts Pwuts added testing and removed enhancement New feature or request labels Apr 20, 2023
@waynehamadi
Copy link
Contributor

  • 20 cycles to run the test is going to be expensive, and there are multiple tests.
  • why so many similar tests ?
  • let's build the CI pipeline with it as well, because the test doesn't have much value if it doesn't break when people push stuff.
  • I would love it if instead of just checking for a file, you had an even more specific check

Can we just start with one test that you know will always work in less than, let's say 10 cycles ?

@waynehamadi
Copy link
Contributor

@josephcmiller2 you wanna join us on discord so we talk about it?
We love that initiative, it's just we're wondering if there is a way to integrate either this:

  • as smoke test
  • or as a benchmark.

RIght now it's kind of sitting in between, because it's very simple and not very hard to achieve by the AI(good smoke test for consistency), but it's also very long (benchmark)

@josephcmiller2
Copy link
Contributor Author

I'm on Discord. You can find in me in pull-requests or dev-autogpt

@Pwuts Pwuts assigned Pwuts and waynehamadi and unassigned Pwuts Apr 20, 2023
@ntindle
Copy link
Member

ntindle commented Apr 22, 2023

what was the status of this? would love to get test coverage up asap

@josephcmiller2
Copy link
Contributor Author

what was the status of this? would love to get test coverage up asap

I haven't received any feedback on why it isn't merged or what to do next. Push the devs on Discord to help get this through.

@Pwuts
Copy link
Member

Pwuts commented Apr 24, 2023

@josephcmiller2 you received feedback from Merwane: #1354 (comment) #1354 (comment)

To summarize our thoughts: tests that involve prompting the AI should either be usable as a benchmark to test the AI's efficacy, or as a smoke test that needs max 3 cycles to determine if something is broken. Everything outside of those brackets is a gray area that will cost us a lot of money in terms of API tokens.

@Boostrix
Copy link
Contributor

Boostrix commented May 2, 2023

See also #1359

@p-i-
Copy link
Contributor

p-i- commented May 5, 2023

This is a mass message from the AutoGPT core team.
Our apologies for the ongoing delay in processing PRs.
This is because we are re-architecting the AutoGPT core!

For more details (and for infor on joining our Discord), please refer to:
https://github.com/Significant-Gravitas/Auto-GPT/wiki/Architecting

@Boostrix
Copy link
Contributor

Boostrix commented May 5, 2023

To summarize our thoughts: tests that involve prompting the AI should either be usable as a benchmark to test the AI's efficacy, or as a smoke test that needs max 3 cycles to determine if something is broken.

if autgpt.py were to be executable in a single-shot fashion via corresponding CLI args, we could have our cake and eat it ... aka, use auto-gpt itself to such tests, optionally. That would not even have to be a feature available in interactive mode, just via a startup flag / CLI argument. That way, you'd probably get a ton of data.

@Boostrix
Copy link
Contributor

Boostrix commented May 7, 2023

tests that involve prompting the AI should either be usable as a benchmark to test the AI's efficacy, or as a smoke test that needs max 3 cycles to determine if something is broken. Everything outside of those brackets is a gray area that will cost us a lot of money in terms of API tokens.

Maybe we can have our cake and eat it by allowing folks to experiment with different "prompt profiles" for different purposes ? This sort of feature would be strictly opt-in (env level), but it would help us deal with all those PRs where people share their own prompt configs, and neatly organize things - while also providing an excellent way to benchmark/test (regressions!) things over time.

For the sake of regression testings, we should not underestimate the power of allowing people to create/share their own prompt profiles. This sort of feature would at least help close 10+ PRs here, immediately !

See: #1874 (comment)

@vercel
Copy link

vercel bot commented May 21, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback May 21, 2023 1:21pm

@github-actions
Copy link
Contributor

This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size

@vercel vercel bot temporarily deployed to Preview May 21, 2023 13:21 Inactive
@Pwuts
Copy link
Member

Pwuts commented May 31, 2023

Closing as stale

@Pwuts Pwuts closed this May 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

9 participants