This repository contains a GUI application for image captioning using your own finetune of the PaliGemma model. The application allows for both single image and batch processing, with options to use local models or models from Hugging Face.
- Load PaliGemma models from local directories or Hugging Face
- Single image captioning with preview
- Batch processing of images in a folder
- Edit and save generated captions
- Modern and user-friendly interface
- Python 3.7 or higher
- PyQt6
- torch
- transformers
- Pillow
It's recommended to use a virtual environment for this project. Here's how to set it up on different operating systems:
python -m venv venv
venv\Scripts\activate
python3 -m venv venv
source venv/bin/activate
Once your virtual environment is activated, install the required packages:
pip install -r requirements.txt
-
Activate your virtual environment (if not already activated).
-
Run the application:
python paligemma_gui.py
-
Load a PaliGemma model:
- Enter a local path to a PaliGemma model directory, or
- Enter a Hugging Face model ID (e.g., "markury/paligemma-448-ft-1")
- Click "Load Model"
-
For single image captioning:
- Go to the "Single Image" tab
- Select an image or enter an image path
- (Optional) Enter input text
- Click "Generate Caption"
-
For batch processing:
- Go to the "Batch Processing" tab
- Select a folder containing images
- (Optional) Enter input text for batch processing
- Click "Process Batch"
- Double-click on results to edit captions
- If you encounter any CUDA-related errors, ensure that your PyTorch installation matches your CUDA version.
- For "module not found" errors, make sure you've activated the virtual environment and installed all dependencies.
- If the GUI doesn't launch, check that you have PyQt6 installed correctly.
For scenarios where a graphical user interface is not available (e.g., running on a remote server or in a cloud environment), you can use the command-line interface (CLI) version of the PaliGemma Image Captioner.
The CLI script (paligemma_cli.py
) supports both single image captioning and batch processing. Here's how to use it:
- Ensure you have all the required dependencies installed:
pip install torch transformers pillow tqdm
- Run the script with the appropriate arguments:
python paligemma_cli.py --model <model_path_or_id> --mode <single|batch> [other options]
--model
: Path to local model or Hugging Face model ID (required)--mode
: Choose between "single" for single image processing or "batch" for processing a folder of images (required)--image
: Path to the image file (required for single mode)--folder
: Path to the folder containing images (required for batch mode)--text
: Input text for captioning (optional)--output
: Output file path for single mode caption (optional)
- Single image captioning:
python paligemma_cli.py --model markury/paligemma-448-ft-1 --mode single --image path/to/image.jpg --text "Describe this image:" --output caption.txt
- Batch processing:
python paligemma_cli.py --model path/to/local/model --mode batch --folder path/to/image/folder --text "describe"
- For batch processing, the script will automatically save captions as text files with the same name as the image files in the specified folder.
- The script uses CUDA if available, otherwise it falls back to CPU processing.
- Progress is displayed using a progress bar for batch processing.
Remember to activate your virtual environment before running the script if you're using one.
Contributions to improve the PaliGemma Image Captioner are welcome. Please feel free to submit pull requests or create issues for bugs and feature requests.
This project uses the PaliGemma model, which is based on the work by Google Research. Special thanks to the Hugging Face team for providing easy access to transformer models.