-
Notifications
You must be signed in to change notification settings - Fork 44.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Full prompt test cases #1354
Full prompt test cases #1354
Conversation
…'s useless to ask AutoGPT to get specific weather conditions
Can we just start with one test that you know will always work in less than, let's say 10 cycles ? |
@josephcmiller2 you wanna join us on discord so we talk about it?
RIght now it's kind of sitting in between, because it's very simple and not very hard to achieve by the AI(good smoke test for consistency), but it's also very long (benchmark) |
I'm on Discord. You can find in me in pull-requests or dev-autogpt |
what was the status of this? would love to get test coverage up asap |
I haven't received any feedback on why it isn't merged or what to do next. Push the devs on Discord to help get this through. |
@josephcmiller2 you received feedback from Merwane: #1354 (comment) #1354 (comment) To summarize our thoughts: tests that involve prompting the AI should either be usable as a benchmark to test the AI's efficacy, or as a smoke test that needs max 3 cycles to determine if something is broken. Everything outside of those brackets is a gray area that will cost us a lot of money in terms of API tokens. |
See also #1359 |
This is a mass message from the AutoGPT core team. For more details (and for infor on joining our Discord), please refer to: |
if autgpt.py were to be executable in a single-shot fashion via corresponding CLI args, we could have our cake and eat it ... aka, use auto-gpt itself to such tests, optionally. That would not even have to be a feature available in interactive mode, just via a startup flag / CLI argument. That way, you'd probably get a ton of data. |
Maybe we can have our cake and eat it by allowing folks to experiment with different "prompt profiles" for different purposes ? This sort of feature would be strictly opt-in (env level), but it would help us deal with all those PRs where people share their own prompt configs, and neatly organize things - while also providing an excellent way to benchmark/test (regressions!) things over time. For the sake of regression testings, we should not underestimate the power of allowing people to create/share their own prompt profiles. This sort of feature would at least help close 10+ PRs here, immediately ! See: #1874 (comment) |
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
This PR exceeds the recommended size of 200 lines. Please make sure you are NOT addressing multiple issues with one PR. Note this PR might be rejected due to its size |
Closing as stale |
Background
We should be tracking prompts that work successfully both to have an understanding of what's capable and also to ensure code changes don't break what's already working. Besides the unit tests, we need the end-to-end test cases for real world questions.
Changes
Documentation
Test Plan
Tested each test case multiple times
Tested script with [all] as well as individual test cases
PR Quality Checklist