Split PDF using OCR

Youtube Video

Link: https://www.youtube.com/watch?v=V7wUBW-en-4

Problem statement

As part of Luddy hackathon, 2024. We worked on creating a standalone application to split pdf file into multiple files based on the OCR configuration given by the user. This application can be used to split the court legal proceeding documents into multiple lead documents for submission automatically for multiple users. Use case

Team

Features

Split PDF

Takes configuration file, input folder and output folder as input creates multiple pdf files for in a folder for each user.

Auto generation of configuration file

We felt creating the configuration file manually is not a feasable solution. We developed an interface to easily extract configuration from a template pdf providing by ther user.

It involves two steps

Creating a template pdf with annotations

Use Adobe Acrobat to create annotation as shown below and save the the pdf after annotating all the PDF page as required.

Creating configuration using app

You can select the above created pdf in the application to generate the configuration file automatically. You can modify or add any more conifugrations and save the configuration file.

Technical Details

It is a python gui application.
Uses pymupdf for extracting pdf page and converting them to images
Uses tesseract to perform OCR on the annotated area given in the configuration.
Uses tkinter to create the GUI interface.

Installation

For mac os, you should be able to install the equivalent libraries in windows.

Install tkinter

brew install tcl-tk

You may face trouble with making it work wiht pyenv check this out https://stackoverflow.com/a/60469203

Install tesseract

brew install tesseract

Install python dependencies

pip install -r requirements.txt

Build

pyinstaller --onefile --windowed --icon=icons/icon.ico --name=PDFSplitter --add-data "C:\Program Files\Tesseract-OCR;Tesseract-OCR" app.py

Challenges

Understanding the unique legal document use case and the need for this application
Setting up teserract and linking it with the python
Unable to create a standalone single file application without any installation using py2app or pyinstaller

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
icons		icons
images		images
sample_pdfs		sample_pdfs
usecase		usecase
.gitignore		.gitignore
README.md		README.md
Report.pdf		Report.pdf
app.py		app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Split PDF using OCR

Youtube Video

Contents

Problem statement

Team

Features

Split PDF

Auto generation of configuration file

Creating a template pdf with annotations

Creating configuration using app

Technical Details

Installation

Install tkinter

Install tesseract

Install python dependencies

Build

Challenges

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

nikhil-vemula/split-pdf-using-ocr

Folders and files

Latest commit

History

Repository files navigation

Split PDF using OCR

Youtube Video

Contents

Problem statement

Team

Features

Split PDF

Auto generation of configuration file

Creating a template pdf with annotations

Creating configuration using app

Technical Details

Installation

Install tkinter

Install tesseract

Install python dependencies

Build

Challenges

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages