Description

This script imports all pdfs in the 'Input' folder, expecting them to be in the format from equineline.com.

Current Status:

The script will run for any number of pdf files, but I've only tested Awesome Again thoroughly. Outputs are regularly put in the Google Drive when I can :D

Catch rate

Trying to capture every single foal, no matter the race count. Currently at 1,314 total rows for Awesome Again. This means approximately 5% of entries are being lost, though I forget what the exact total count is.

Known Issues:

Many foals contain a country tag, e.g. '(KOR).' This can cause problems down the line if other data sets do/do not contain it.
- Currently, it breaks a lot of entries as the pattern relies on the birth year being the only thing in parentheses.
Page breaks introduce many issues. Headers/footers are complicated, and my current method of filtering them out (scanner obj) is slowing everything down by a ton.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.github/workflows		.github/workflows
__pycache__		__pycache__
input		input
old code		old code
scripts		scripts
.gitignore		.gitignore
README.md		README.md
horse.ico		horse.ico
main.py		main.py
main.spec		main.spec
naming.md		naming.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Description

Current Status:

Catch rate

Known Issues:

Remaining tasks:

About

Releases 1

Packages

Languages

downloadjpg/brooks-research

Folders and files

Latest commit

History

Repository files navigation

Description

Current Status:

Catch rate

Known Issues:

Remaining tasks:

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages