Agropontos Regex is a small Python program that extracts geolocation coordinates from PDF files, eg.: rural property registration documents.
It works for two types of coordinates, UTM and Lat-Long. And generates a CSV file that can be imported directly to GIS software, like QGIS.
The program interface can be used like a notepad to correct any errors or wrong characters brought by the OCR scanning. It also generates a new PDF file correcting the page tilt and rotation.
![](https://private-user-images.githubusercontent.com/2325925/241585939-af976dc1-9415-4b6a-b2bb-71f8e9642555.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjE1MzM3ODgsIm5iZiI6MTcyMTUzMzQ4OCwicGF0aCI6Ii8yMzI1OTI1LzI0MTU4NTkzOS1hZjk3NmRjMS05NDE1LTRiNmEtYjJiYi03MWY4ZTk2NDI1NTUucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDcyMSUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA3MjFUMDM0NDQ4WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9MTkwYjFhODFmYjA5NTMwYTRmMzA0YmQzMDI4ZmFmNWNjN2VmNGNmZTc0ZTNmZGRkMTQ3NWU3MTUyNjc3OTM4ZSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.VVbDpzK-r50d6Z-Q9S3heqtP73Fy_maopNBDF-2_lzw)
You need to install the following packages for Windows:
I recommend using the Chocolatey package manager to install some of the following: (Run in an Administrator command prompt)
- Python 3.8 (64-bit) or later
choco install python3
- Tesseract 4.1.1 (64-bit) or later
choco install --pre tesseract
- You'll also need the trained data files for Tesseract, according to your language
- Ghostscript 9.50 (64-bit) or later
choco install ghostscript
- OCRmyPDF 14.2.0 (64-bit) or later
pip install ocrmypdf
- pypdf 3.9.0 (64-bit) or later
pip install pypdf