Security measure for agentic LLMs using a council of AIs moderted by a veto system. The council judges an agent's actions outputs based on specified categories.
Implement a system to judge AI Agents outputs using a council of AI models. Decentralize the decision making power to avoid potential disasters.
Language models, acting as a "judge", will rate an AI output out of 10. If any of the judges in the council (formed by a group of judges) vetoes an output (verdict == false), that output will be flagged as being potentially immoral/unjust/harmful/useless.
- Clone the repository via
git clone https://github.com/seanpixel/council-of-ai.git
and cd into the cloned repository. - Install required packages by doing: pip install -r requirements.txt
- Download the ethics dataset from here and move it into root (same dir as main.py).
- Create a .env file or plug in your key in judge.py (line 8), all you need is an OPENAI_API_KEY
- Go to main.py and choose the test type using the choice variable (default is commonsense)
- Run
python main.py
and see what kinds of judgements the council makes
Note: For for "commonsense" AITA (Am I the Asshole?) questions, "allowed" means you are the asshole and "blocked" means you are not the asshole (so it's kind of inverted).
After creating Teenage-AGI, I wondered about potential implications of Agentic LLMs and some ways to moderate its unpredictable behaviors. From this, I thought of democracy and how a decentralized system of AIs could monitor other AIs from causing harm. So came council-of-ai. While contributing to the "acceleration" of technology, I still care about AI Safety and believe that safely guiding AI towards the future can be as fun and exciting as accelerating.
I'm a founder currently running a startup called DSNR and also a first-year at USC. Contact me on twitter about anything would love to chat.
Create more "setups", these are basically the characteristics of the judges. Play around with more example Agent outputs and possbily use your own by adding them to "actions.yaml". Use more judges or even plug in your own local LLM. Or even better, implement the council on an unaligned base model (Llama?) and experiment. This is a growing initiative so any help would be appreciated.
Credits to @DanHendrycks for the Ethics dataset used in testing the idea.