Parse tables from PDF

This tool was created to automize the process of pulling tables from PDF documents. It goes through all the pages, recognises where tables are and then proceeds to transfer them to csv. Using pytesseract it parses text from each cell and determines its position in the table.

You can use this tool by either directly running the python script along with some flags or by running a Web server that will host a web page for uploading files to procees them on server and return the csv files. Whilst displaying the current progress.

Here's the front page

While processing, it displays processing status for each page and gives you option to download each one individually, or altogether at the end

Example results

Input table as an image in PDF file

Parsed table

Installation

Required python libraries

pip install pytesseract opencv-python tqdm progressbar pdf2image pymupdf fitz frontend tools

# Optional for webserver
pip install aiohttp eventlet

Tesseract installation on Linux using apt

sudo apt install tesseract-ocr tesseract-ocr-rus

Linux using pacman

sudo pacman -S poppler
sudo pacman -S tesseract  # Select needed language, for example rus - 94
sudo pacman -S tesseract-data-rus tesseract-data-eng

Windows

Download tesseract exe from https://github.com/UB-Mannheim/tesseract/wiki.
Install this exe in C:\Program Files (x86)\Tesseract-OCR
Open virtual machine command prompt in windows or anaconda prompt.
Run pip install pytesseract

Running

Running locally

From PDF file

python3 recognise.py --client --input example/rencap2021.pdf --limit 10

And from remote PDF file

python3 recognise.py --client --remote https://github.com/pavtiger/Parse-tables-from-PDF/raw/master/example/rencap2021.pdf --limit 10

All data will output to output/ directory. You can find example results in example/.

You can also change the render quality (>= 200)

python3 recognise.py --client --input example/rencap2021.pdf --limit 10 --quality 300

Running web server

python3 recognise.py --server

All available flags:

input - Path to input pdf file to convert
remote - Link to a remote location from where to obtain PDF file
limit - Process only first N pages. (-1 if all)
quality - PDF page render quality (default 200). Increasing will consume more RAM, but going under 200 is highly unadvised. This will cause recongision errors. For reference, 300 requires 8gb of RAM

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
.idea		.idea
example		example
readme_images		readme_images
static		static
.gitignore		.gitignore
README.md		README.md
example_config.py		example_config.py
parse_table.py		parse_table.py
recognise.py		recognise.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parse tables from PDF

Example results

Installation

Running

Running locally

Running web server

About

Releases

Packages

Contributors 3

Languages

pavtiger/Parse-tables-from-PDF

Folders and files

Latest commit

History

Repository files navigation

Parse tables from PDF

Example results

Installation

Running

Running locally

Running web server

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages