Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Psychological challenge: Sally and Anne's Test. #3871

Closed
waynehamadi opened this issue May 5, 2023 · 7 comments
Closed

Psychological challenge: Sally and Anne's Test. #3871

waynehamadi opened this issue May 5, 2023 · 7 comments

Comments

@waynehamadi
Copy link
Contributor

waynehamadi commented May 5, 2023

Summary 💡

This idea was brought up by @javableu. We would like to create a psychological challenge inspired from the Sally and Anne's test..

https://en.wikipedia.org/wiki/Sally%E2%80%93Anne_test

Screenshot 2023-05-05 at 4 40 41 PM

GPT-3.5 is able to solve the problem.
Screenshot 2023-05-05 at 4 43 43 PM

But we need to make sure that autogpt is able to do it temporally.
What that means is, we can create a scenario that simulates Sally and Anne's test, but across multiple cycles, and make sure that Auto-GPT is able to remember that Sally put the marble in her own basket.

DM me on discord and join us on discord and DM me if you're interested in creating this challenge (https://discord.gg/autogpt my discord is merwanehamadi)

@anonhostpi
Copy link

Interesting proposal. I like the idea of collecting metrics on the behavior of AutoGPT.

May propose a different approach to resolving these issues:
https://gist.github.com/anonhostpi/97d4bb3e9535c92b8173fae704b76264#observerregulatory-agents

There has been a lot of talk about using observer/regulatory agents to catch, stop, and report bad behavior. General consensus has been that these agents get obsessed with that kind of role and tend to report violations of compliancy frequently.

However, collecting metrics on how likely AutoGPT will misbehave is also a good idea.

@anonhostpi
Copy link

This would be a very fascinating test. May serve as a basis for making such regulatory/observer agents

@Boostrix
Copy link
Contributor

Boostrix commented May 7, 2023

And there goes all the fun because you're offering to help us implement the equivalent of a police-officer.agent ....

@waynehamadi
Copy link
Contributor Author

waynehamadi commented May 14, 2023

@Boostrix @anonhostpi do you know anyone that wants to do that challenge? It's pretty fun.

@Boostrix
Copy link
Contributor

we need to make sure that autogpt is able to do it temporally [...] across multiple cycles

Personally, I doubt that agpt is there yet - there are currenty so many challenges in other parts of the project. For instance, look at the number of folks who have complained that it takes agpt using GPT 3.5 more than 10 cycles and 10 minutes to write a hello world program in python, whereas GPT itself can directly provide the same program in a single second.

Thus, I don't believe this is the level of problem to be tackled currently.
Don't get me wrong, I applaud you for all those challenges - but some of these are waaaay off currently.

We need to solve at least another 20+ "basic" problems (challenges) first before even thinking about pychological stuff like this.

Agents not being able to plan out their actions properly is a problem, but it's worsened by lacking suspend/resume support.
So, while we need more challenges, these need to be more basic ones at this stage, I am afraid.

We need to build the foundation for more complex challenges at first.

@javableu
Copy link
Contributor

If i give the full level 5 to chatgpt4, it solves it in one prompt. Autogpt manages the level one well and is only slightly off on level 2. if u ask it, it is also able to give u the real position of the marbles, the conversations between each individuals, the believes of believes, … Also if people struggle to get an hello world, it comes from the prompt. The model is confused by the role/goals that are being given. The goals are really often not well written. I havent fully tested this theory but i think the ai is REALLY acurate on the wording. Differences between goals, outcome, objectives can be potentially crucial.

@waynehamadi
Copy link
Contributor Author

@Boostrix yeah ok I am never going to complain when someone says "too early".
good call, let's get the basics first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants