Scripts and tools for ingesting data.
Note: This has only been tested on macOS.
$ brew install pdftotext
tr
is also required, but tr
should already be installed by the OS:
$ which tr
/usr/bin/tr
To extract data from multiple PDFs in an input directory:
$ python3 extractors/p223_pdf_batch.py my/input/directory my/output/directory
TODO: Add instructions for retrieving PDFs and cached outputs from Google Cloud.
$ curl \
https://www.seattleschools.org/wp-content/uploads/2024/09/P223_Sep24.pdf \
-o p223_sep24.pdf
$ pdftotext -layout p223_sep24.pdf -f 2 - | tr -s ' ' > squished.txt
$ python3 extractors/p223_pdf_to_csv.py squished.txt out.csv