Simple Python script converting UIC official course description PDF to semi-structure file (TSV)
It is recommended to create a new virtual enviroment for this project, you can use venv
or conda
.
After setup the environment, you need to install required packages specified in requirements.txt
, this can be done with command pip install -r requirements.txt
After pip finished the installation, you have a enviroment ready to process the PDF.
Since we do not own the copyright of the file, we will not prepare a copy in this repositry. But you should able to obtain the file via AR's Website.
This script is not well polished, so parameters are all hard-coded. So before you actually press the button to start the parsing magic, you need to change these two variable:
PDF_FILENAME
at Line 7 of the extract.py file: This should point to the path where the PDF located.FILENAME_PREFIX
at Line 6 of the extract.py file: This is the prefix for all output files.- The script should output three files, including
<FILENAME_PREFIX>-raw_lines.txt
: The raw lines extracted from PDF file, use for code debug.<FILENAME_PREFIX>-records.tsv
: TSV file contains columnscourse_code
,course_name
,course_units
,course_prerequisites
<FILENAME_PREFIX>-description.tsv
: TSV file contains columnscourse_code
,course_description
After you modified there two variable, you should able to start parsing with command python extract.py