Image caption generator

This Python script generates image captions using different large language models through Simon Willison's llm CLI tool.

Prerequisites

Python 3.x
Ollama (for local models):
```
brew install ollama
```

Installation steps

Install uv:
```
pip install -U uv
```
Create a virtual environment:
```
uv venv 
```
Install llm and verify path:
```
uv pip install -r requirements.txt
```
Activate the virtual environment:
```
source .venv/bin/activate
```

Install LLM plugins:

# Local models via Ollama
uv pip install llm-ollama

# Anthropic Claude models
uv pip install llm-anthropic

# Mistral models
uv pip install llm-mistral

Pull required local models:

ollama pull llava:13b
ollama pull llava:34b
ollama pull llava-llama3
ollama pull llama3.2-vision:11b-instruct-q8_0
ollama pull minicpm-v

Upgrading

To upgrade llm and its plugins:

uv pip install -U llm
uv pip install -U llm-ollama llm-anthropic llm-mistral

Configure API keys

Set up API keys for cloud-based models:

# OpenAI (for GPT-4 Vision)
llm keys set openai

# Anthropic (for Claude)
llm keys set anthropic

# Mistral (for Pixtral models)
llm keys set mistral

Supported models

This tool supports all vision and multi-modal models available through the llm CLI tool. The models.yaml file configures model-specific parameters like prompts, temperature and token limits. While several models are pre-configured, you can add any model supported by llm by adding its configuration to models.yaml.

Cloud models

Claude 3 Sonnet (Anthropic) - anthropic/claude-3-sonnet-20240229
GPT-4 Vision (OpenAI) - chatgpt-4o-latest
Pixtral 12B (Mistral) - mistral/pixtral-12b-latest
Pixtral Large (Mistral) - mistral/pixtral-large-latest

Local models (via Ollama)

LLaVA 13B - llava:13b
LLaVA 34B - llava:34b
LLaVA Llama3 - llava-llama3
Llama 3.2 Vision (11B) - llama3.2-vision:11b-instruct-q8_0
MiniCPM-V - minicpm-v

Usage

List available models:

./caption.py --list

Generate captions using all models:

./caption.py path/to/image.jpg

Use specific models:

./caption.py path/to/image.jpg --model chatgpt-4o-latest pixtral-12b

Add context to improve caption accuracy:

./caption.py path/to/image.jpg --context "Photo taken at DrupalCon Barcelona 2024"
./caption.py path/to/image.jpg --context "Location: Isle of Skye, Scotland"

Additional options:

--context  # Add contextual information to improve caption accuracy
--time     # Include execution time in output
--debug    # Show detailed debug information

Output format

Standard output:

{
  "image": "path/to/image.jpg",
  "captions": {
    "model-name": "Generated caption.",
    "another-model": "Another caption."
  }
}

With timing information (--time flag):

{
  "image": "path/to/image.jpg",
  "captions": {
    "model-name": {
      "caption": "Generated caption.",
      "time": 2
    }
  }
}

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
test-images		test-images
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
caption		caption
caption.py		caption.py
models.yaml		models.yaml
requirements.txt		requirements.txt
test_caption.py		test_caption.py
update-images.py		update-images.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Image caption generator

Prerequisites

Installation steps

Upgrading

Configure API keys

Supported models

Cloud models

Local models (via Ollama)

Usage

Output format

About

Releases

Packages

Languages

License

dbuytaert/image-caption

Folders and files

Latest commit

History

Repository files navigation

Image caption generator

Prerequisites

Installation steps

Upgrading

Configure API keys

Supported models

Cloud models

Local models (via Ollama)

Usage

Output format

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages