AIrmageddon
is a prototype project that showcases how to run basic visual reasoning on a camera, video or list of images. In this experiment two LLM agents talk to each other to understand if the image shows a potential threat, then they decide whether or not to alert the owner.
Each frame or image is passed through YOLOv5
, if the current frame differs sufficiently from the previous one it is passed down to Florence-2
for captioning. The caption is in turn passed down to an LLM (Llama 3.1 8B by default) that starts a conversation with a second LLM, the first takes the role of a Security Guard
, the second of a Burglar
. When a decision is taken the Security Guard
can either decide to move along with the analysis or raise an alert.
All events detected are logged as memories and provided to the LLM as context to enhance its analysis.
A third agent
acts as a chatbot with the ability to signal to other agents when the owners are at home or they have left. At the same time, the chatbot can tell users what happened through the day.
The project runs reasonably well on Apple Silicon M2
, using less than 5GB of RAM. It should work on most cpus.
- Llama.cpp (
brew install llama.cpp
on Mac) - Redis (
brew install redis
orapt-get install redis-server
) - Espeak (
brew install espeak
on Mac) - (Optional) Pushnotifier an account is required (it's free) together with an API KEY if you'd like to receive push notifications on your phone from the agents
- LLama 3.1 8B Instruct GGUF,
Q6_K
is suggested and has to be manually downloaded in themodels
folder - YOLOv5, weights for
yolov5m6.pt
are automatically downloaded if not found in themodels
folder - Florence-2-large, weights are automatically downloaded if the
Florence-2-large
folder is not found in themodels
folder
Copy conf/config-sample.yml
to conf/config.yml
and edit the file accordingly. Default values should work for most scenarios it is still required to set camera_feed
and models_folder
.
If you prefer environment variables, or you're running the script in Docker, copy the .env-sample
into .env
and edit it accordingly.
Remember that environment variables override configuration variables.
camera_feed
: path to your camera, a video or a folder with images or files to analyze (eg: rtsp://camera.local/feed)llama_server
:llama-server
addressllama_server_completion
:llama-server
address for /completion apiopenai_server_completion
:openai
or compatible providers completion API endpointchatbot_voice_enabled
: enable or disable (true
orfalse
) text-to-speech in the chatbot UI, requiresespeak
use_local_llm
: whether to do inference with the localllama_server
or using theopenai_server_completion
APIowners_away
: set the initial status of the owners when the script startsmodels_folder
: path to the folder containing all models' weightsllm_weights
: LLM's weights filenameyolo_weights
: YOLO's weights filenameredis_host:
Redis hostredis_port
: Redis portredis_db
: Redis dbopenai_api_key
: any string if you're using a local LLM otherwise it will be your API key for that providerguard_max_memories
: the maximum number of memories to transfer to the agentsmin_frame_similarity
: the minimum similarity difference (from 0.0 to 1.0) to trigger an analysis from the LLMtriggers
: a list of keywords that trigger an analysis even if themin_similarity_threshold
has not been hit (applied on YOLO detections and YOLO-supported categories)frame_capture_interval
: how often, in seconds, to run analysis on a frame from a camera or a stream (does not apply to folders and videos), set to 0 to disable throttlingdeploy_id
: if you're using DeepInfra as your OpenAI compatible providermodel_name
: not necessary if you're using a local llm, otherwise meta-llama/Meta-Llama-3-8B-Instruct or the model you prefercaptures_folder
: path to the folder where you want to store all images and responses analyzed by the LLMpn_username
: YourPushnotifier
username (required by the API)pn_password
: YourPushnotifier
password (required by the API)pn_package_name
: YourPushnotifier
package name (required by the API), you can create this from herepn_api_key
: YourPushnotifier
API KEY (required by the API), you can obtain this from here
The only parameter you'll probably want to play with is the min_frame_similarity
, empirically anything from 0.75
to 0.89
seems to work well. Higher values will trigger the analysis on too many frames, especially if there's movement, lower values reduce the sensitivity and only trigger an analysis when two frames are very different.
- Clone this project:
git clone https://github.com/RamonBeast/airmageddon.git
cd airmageddon
- Create a
virtualenv
:virtualenv venv
(pip install virtualenv
) - Activate it:
. ./venv/bin/activate
- Install all requirements:
pip install -r requirements.txt
- Copy
conf/config-sample.yml
toconf/config.yml
and edit it (or do the same with.env-sample
in the project root if you use docker) - Make sure
redis
is running - Start
llama-server
:llama-server --model /models/Meta-Llama-3.1-8B-Instruct-Q6_K_L.gguf --ctx-size 16384 -ngl 99 --simple-io --threads 30
(these settings work well on Apple Silicon M2) - Run the chatbot:
chainlit run chatbot.py -w
- Start the alarm system:
python monitor.py
(this will stream all conversations between the agents)
The chatbot is available by default on http://localhost:8000/
, if you tell the chatbot that you are back home, or you are leaving home, it will set the state accordingly and the other agents will know whether you're home or not. This changes the way they alert about potential threats, for instance, if you're watching TV, the agents won't usually send alerts. Sometimes they will alert you anyway with notifications like no threats, everything is ok
or funnily with Person lying on the couch, they might be hurt
or similar silly alerts. If you leave home, the agents will usually trigger a notification if they see someone on the cameras.
System prompts are defined in the agents
of the conf/config.yml
file. If you're using the smaller 8B model, agents will lose attention easily and they will start to hallucinate, especially if they're passed on too many memories (hence, the default is 3
). For longer conversations they might have troubles calling functions correctly. Larger models, like 70B, do handle the prompt more efficiently and hallucinate less.
Other settings like temperature
and min_p
are set in brain.py
, default values are sensible for this specific use case as if the LLM gets too creative, it will have troubles handling function calling. This problem is solved with slightly larger models.
You can implement as many custom functions as you like, such as to allow the Agents to turn cameras on or off, turn lights on or off etc depending on your tolerance for chaos. Functions should be implemented in functions.py
and usage should be specified in the agent's system prompt in conf/config.yml
.
Please don't use this project for anything serious. Not only it's awfully written, but also LLMs are quite unreliable, especially at smaller size. Fun is ensured, but don't leave them in charge of your home security, they will make a mess and give you several heart attacks.