This Python script generates image captions using different large language models through Simon Willison's llm
CLI tool.
- Python 3.x
- Ollama (for local models):
brew install ollama
-
Install uv:
pip install -U uv
-
Create a virtual environment:
uv venv
-
Install llm and verify path:
uv pip install -r requirements.txt
-
Activate the virtual environment:
source .venv/bin/activate
-
Install LLM plugins:
# Local models via Ollama uv pip install llm-ollama # Anthropic Claude models uv pip install llm-anthropic # Mistral models uv pip install llm-mistral
-
Pull required local models:
ollama pull llava:13b ollama pull llava:34b ollama pull llava-llama3 ollama pull llama3.2-vision:11b-instruct-q8_0 ollama pull minicpm-v
To upgrade llm and its plugins:
uv pip install -U llm
uv pip install -U llm-ollama llm-anthropic llm-mistral
Set up API keys for cloud-based models:
# OpenAI (for GPT-4 Vision)
llm keys set openai
# Anthropic (for Claude)
llm keys set anthropic
# Mistral (for Pixtral models)
llm keys set mistral
This tool supports all vision and multi-modal models available through the llm
CLI tool. The models.yaml
file configures model-specific parameters like prompts, temperature and token limits. While several models are pre-configured, you can add any model supported by llm
by adding its configuration to models.yaml
.
- Claude 3 Sonnet (Anthropic) - anthropic/claude-3-sonnet-20240229
- GPT-4 Vision (OpenAI) - chatgpt-4o-latest
- Pixtral 12B (Mistral) - mistral/pixtral-12b-latest
- Pixtral Large (Mistral) - mistral/pixtral-large-latest
- LLaVA 13B - llava:13b
- LLaVA 34B - llava:34b
- LLaVA Llama3 - llava-llama3
- Llama 3.2 Vision (11B) - llama3.2-vision:11b-instruct-q8_0
- MiniCPM-V - minicpm-v
List available models:
./caption.py --list
Generate captions using all models:
./caption.py path/to/image.jpg
Use specific models:
./caption.py path/to/image.jpg --model chatgpt-4o-latest pixtral-12b
Add context to improve caption accuracy:
./caption.py path/to/image.jpg --context "Photo taken at DrupalCon Barcelona 2024"
./caption.py path/to/image.jpg --context "Location: Isle of Skye, Scotland"
Additional options:
--context # Add contextual information to improve caption accuracy
--time # Include execution time in output
--debug # Show detailed debug information
Standard output:
{
"image": "path/to/image.jpg",
"captions": {
"model-name": "Generated caption.",
"another-model": "Another caption."
}
}
With timing information (--time
flag):
{
"image": "path/to/image.jpg",
"captions": {
"model-name": {
"caption": "Generated caption.",
"time": 2
}
}
}