Disease database

Creating a historical disease database (19th-20th century) for municipalities in the Netherlands.

Preparation

pip install tqdm polars requests matplotlib plotnine PyQt6 pyarrow

Data extraction

The downloaded delpher xml files are contained in a zip folder, which takes up a lot of storage space.

The extract_article_data.py script extracts the titles and texts from the zip folder for each article. Then, it stores all extracted data as a polars dataframe with three columns file_name, title and text. Finally, it is saved as a parquet file (article_data.parquet), with a much smaller size.

With the extract_meta_data.py script, we extract meta information about both the newspapers and the individual articles. This results in two separate polars dataframes saved in parquet format:

newspaper_meta_data.parquet includes these columns: newspaper_name, newspaper_location, newspaper_date, newspaper_years_digitalised, newspaper_years_issued, newspaper_language, newspaper_temporal, newspaper_publisher, newspaper_spatial, and pdf_link.
article_meta_data.parquet includes these columns: newspaper_id, item_id, item_subject, item_filename, and item_type.

Before you run the following script, make sure to specify the correct path to the delpher zip folder using file_path.

python extract_article_data.py
python extract_meta_data.py

Then, the script combine_and_chunk.py joins these datasets and creates a yearly-chunked series of parquet files in the folder processed_data/combined.

Data analysis

The script query.py uses the prepared combined data to search for mentions of diseases and locations in articles. The file produces the plot shown above. It also produces this plot about Leiden:

This plot aligns quite nicely with the google ngram viewer, querying "cholera" in an English, German, and French corpus (click here to see interactively)

Contact

This project is developed and maintained by the ODISSEI Social Data Science (SoDa) team.

Do you have questions, suggestions, or remarks? File an issue in the issue tracker or feel free to contact the team at odissei-soda.nl

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
archive		archive
data_conversion		data_conversion
delpher_api		delpher_api
img		img
maps		maps
processed_data		processed_data
raw_data		raw_data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
query.py		query.py
query_map.py		query_map.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Disease database

Preparation

Data extraction

Data analysis

Contact

About

Releases

Packages

Contributors 3

Languages

License

sodascience/disease_database

Folders and files

Latest commit

History

Repository files navigation

Disease database

Preparation

Data extraction

Data analysis

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages