This project uses various Large Language Models (LLMs) to generate image captions. It provides an easy-to-use script for creating captions and comparing results from different models.
The script supports the following models:
- ViT-GPT2 (2021)
- Microsoft GIT (2022)
- BLIP Large (2022)
- BLIP-2 with OPT backbone (2023)
- BLIP-2 with FLAN-T5 backbone (2023)
- MiniCPM-V (2024)
- LLaVa (13B) (2024)
- LLaVa (34B) (2024)
- Llama 3.2 Vision (11B, Q8) (2024)
These installation instructions were written for macOS on a MacBook. While the general steps should apply to other platforms, some commands (e.g. those using Homebrew) may require adjustments for Linux or Windows.
Python 3.11 is required for compatibility with PyTorch, the machine learning framework used by Hugging Face. While newer Python versions are available, PyTorch does not support them yet. If you have a newer Python version installed, you will need to install Python 3.11 alongside it.
This project also uses uv
, a package and dependency manager for Python. uv
simplifies dependency resolution and creates isolated virtual environments, keeping your system clean and preventing version conflicts.
Let's install Python 3.11 and uv using Homebrew:
brew install [email protected]
brew install uv
Clone the Git repository to download the script and all its files:
git clone https://github.com/dbuytaert/image-caption
cd image-caption
Create a virtual environment using uv
:
uv venv -p 3.11
The script relies on the following Python libraries:
- torch: A deep learning framework for handling neural network operations.
- transformers: Provides pre-trained models and tools for text and vision processing.
- pillow: Handles image processing for vision-language models.
- ollama: Enables local deployment of large language models with vision support.
These dependencies are defined in requirements.txt
. Install them in your vertual environment using the following command:
uv pip install -r requirements.txt
While we interact with Ollama through its Python API, the Ollama service needs to be running locally. Install and start it with:
brew install ollama
ollama serve
Keep this service running in a separate terminal while using the captioning script.
Before using a specific LLM model, you need to pull it. For example, to pull the llava:13b model:
ollama pull llava:13b
ollama pull llava:34b
ollama pull llama3.2-vision:11b-instruct-q8_0
Make the script executable:
chmod +x caption
Run the script:
# Run all models on the provided image
./caption image.jpg
# Print a list of all available models
./caption --list
# Run a single model
./caption image.jpg --model git
# Run a single model and capture executing time
./caption image.jpg --model git --time
# Run multiple models
./caption image.jpg --model blip2-flan llama32-vision-11b
# Run all models on multiple files
find . -name "*.jpg" -exec ./caption {} \;
# Run all models on multiple files and combine the output into a single JSON file
find . -name "*.jpg" -exec ./caption {} \; | jq -s '{"results": .}' > captions.json
# Same as above but capture execution time
find . -name "*.jpg" -exec ./caption {} --time \; | jq -s '{"results": .}' > captions.json
Beware: The first time you run the script, it will download the model data from Hugging Face and additional models from Ollama. This initial download is very large and may take some time depending on your internet connection. Subsequent runs will use the cached models and be much faster.
Example output:
./caption --list
Available models:
vit-gpt2 - ViT-GPT2 (2021)
git - Microsoft GIT (2022)
blip - BLIP Large (2022)
blip2-opt - BLIP-2 with OPT backbone (2023)
blip2-flan - BLIP-2 with FLAN-T5 backbone (2023)
minicpm-v - MiniCPM-V (2024)
llava-13b - Large Language and Vision Assistant (13B) (2024)
llava-34b - Large Language and Vision Assistant (34B) (2024)
llama32-vision-11b - Llama 3.2 Vision (11B, Q8) (2024)
./caption test-images/image-1.jpg
{
"image": "test-images/image-1.jpg",
"captions": {
"vit-gpt2": "A city at night with skyscrapers and a traffic light on the side of the street in front of a tall building.",
"git": "A busy city street is lit up at night, with the word qroi on the right side of the sign.",
"blip": "This is an aerial view of a busy city street at night with lots of people walking and cars on the side of the road.",
"blip2-opt": "An aerial view of a busy city street at night.",
"blip2-flan": "An aerial view of a busy street in tokyo, japanese city at night with large billboards.",
"minicpm-v": "A bustling cityscape at night with illuminated billboards and advertisements, including one for Michael Kors.",
"llava-13b": "A bustling nighttime scene from Tokyo's famous Shibuya Crossing, characterized by its bright lights and dense crowds of people moving through the intersection.",
"llava-34b": "A bustling city street at night, filled with illuminated buildings and numerous pedestrians.",
"llama32-vision-11b": "A bustling city street at night, with towering skyscrapers and neon lights illuminating the scene."
}
}
From here, you can use the captions, compare them for consistency, combine them to create more accurate descriptions, or process them with another language model for translation or improvement. Get creative!
Adding new models is straightforward. Open caption.py
and look for the MODELS
sections near the top of the script. You can easily add additional Hugging Face or Ollama models.
If you have success with different or newer models, I'd love to hear from you! You can reach me at [email protected] or contribute by opening a ticket or pull request on GitHub.
The model data is stored at the following locations:
- Hugging Face models: Stored in
~/.cache/huggingface/
. - Ollama models: Stored in
~/.ollama/models/
.
If you need to free up disk space later:
- For Hugging Face models: Use
transformers-cli cache clean
or manually delete~/.cache/huggingface/
. - For Ollama models: Remove specific models with
ollama rm [model-name]
.
The project includes unit tests for the caption clean-up functionality. To run the tests:
# Create and activate a virtual environment
uv venv --python=python3.11
source .venv/bin/activate
# Install the dependencies in the activated environment
uv pip install -r requirements.txt
# Run the tests
python3 -m unittest test_caption.py -v