Link: https://www.youtube.com/watch?v=V7wUBW-en-4
As part of Luddy hackathon, 2024. We worked on creating a standalone application to split pdf file into multiple files based on the OCR configuration given by the user. This application can be used to split the court legal proceeding documents into multiple lead documents for submission automatically for multiple users. Use case
- Nikhil Vemula
 - Pavan Vemulapalli
 - Bhargavi Thanneeru  
 - Krishna Teja J
 
Takes configuration file, input folder and output folder as input creates multiple pdf files for in a folder for each user.
We felt creating the configuration file manually is not a feasable solution. We developed an interface to easily extract configuration from a template pdf providing by ther user.
It involves two steps
Use Adobe Acrobat to create annotation as shown below and save the the pdf after annotating all the PDF page as required.
You can select the above created pdf in the application to generate the configuration file automatically. You can modify or add any more conifugrations and save the configuration file.
- It is a python gui application.
 - Uses 
pymupdffor extracting pdf page and converting them to images - Uses 
tesseractto perform OCR on the annotated area given in the configuration. - Uses 
tkinterto create the GUI interface. 
For mac os, you should be able to install the equivalent libraries in windows.
brew install tcl-tk
You may face trouble with making it work wiht pyenv check this out https://stackoverflow.com/a/60469203
brew install tesseract
pip install -r requirements.txt
pyinstaller --onefile --windowed --icon=icons/icon.ico --name=PDFSplitter --add-data "C:\Program Files\Tesseract-OCR;Tesseract-OCR" app.py
- Understanding the unique legal document use case and the need for this application
 - Setting up teserract and linking it with the python
 - Unable to create a standalone single file application without any installation using 
py2apporpyinstaller 



