- Several sample PDFs!
- A set of commands to convert PDFs to plain text!
- A compiled version of pdf-splitter for mac!
- pdftk
- Some stuff from brew (make sure to run
brew update
first):brew install imagemagick
brew install tesseract
Get the first page of a PDF:
pdftk pdfs/52-6078041_990PF_200706.pdf shuffle 1 output 52-607.pdf
Turn that first page into an image:
./pdf-splitter 52-607.pdf 'img/%.d.jpg' 1200px
Get a section of the page to process:
convert img/1.jpg -crop 490x20+185+225 img/1.crop.jpg
OCR the image (you'll get a file named 1.txt):
tesseract img/1.crop.jpg 1