This Python script processes a folder of images, extracts text using Tesseract OCR, and matches the extracted text against specified regex patterns. It is designed to handle batch processing of images and identifies images that contain text matching the given patterns.
- Python 3.x
- Tesseract OCR installed on your system
Install the required Python libraries using:
pip install -r requirements.txt
Ensure Tesseract OCR is installed on your system. Installation instructions can be found at Tesseract's GitHub repository. Usage
Run the script with the following command:
python detecAi.py -f [folder_path] -mr [regex_patterns] -bs [batch_size] -o [output_file]
- -f/--folder: Path to the folder containing images.
- -mr/--regex: List of regex patterns to search in the text.
- -bs/--batch-size: Number of images to process in each batch (default: 25).
- -o/--output-file: Output file to save the names of matched images (default: matched_images.txt)
python detecAi.py -f ./images -mr "\\d{3}-\\d{2}-\\d{4}" -bs 10 -o results.txt
https://youtu.be/W-riZ-_lO0Q?si=2AHpVmdljpTsm4Tr
Contributions to this project are welcome. Please fork the repository and open a pull request with your changes or suggestions.
Tesseract OCR, for the OCR engine.
Pillow, for image processing capabilities.