Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENHANCE] [DATA] Update NICD Hospitalization Data #924

Open
maximeLpt opened this issue Dec 1, 2021 · 13 comments
Open

[ENHANCE] [DATA] Update NICD Hospitalization Data #924

maximeLpt opened this issue Dec 1, 2021 · 13 comments
Assignees
Labels
bug Something isn't working data enhancement New feature or request good first issue Good for newcomers

Comments

@maximeLpt
Copy link

maximeLpt commented Dec 1, 2021

Hi All,
Would it be possible to have an update regarding the following dataset?
covid19za/data/nicd_hospital_surveillance_data.csv

Edit by @vukosim

Resume updating NICD Hospitalisation CSV File

Option 1 - Scraper To Extract the Daily Table and then Update the CSV, see below for screenshot

Option 2 - Human Computation, A Volunteer to update the numbers every morning.

@maximeLpt maximeLpt added the bug Something isn't working label Dec 1, 2021
@vukosim vukosim added data enhancement New feature or request labels Dec 2, 2021
@vukosim vukosim changed the title [BUG] [DATA] [BUG] [DATA] Update NICD Hospitalization Data Dec 2, 2021
@vukosim vukosim changed the title [BUG] [DATA] Update NICD Hospitalization Data [ENHANCE] [DATA] Update NICD Hospitalization Data Dec 2, 2021
@vukosim
Copy link
Member

vukosim commented Dec 2, 2021

The challenge has been keeping up with the updates of the NICD Hospital Admissions reports. They are available here https://www.nicd.ac.za/diseases-a-z-index/disease-index-covid-19/surveillance-reports/daily-hospital-surveillance-datcov-report/

The second page of the daily reports has this table

image

Now if we can get someone to start doing backfill (start with 1 December 2021 for example, and work backwards it would be great @maximeLpt

@vukosim vukosim added the good first issue Good for newcomers label Dec 2, 2021
@maximeLpt
Copy link
Author

Do you have a scrapper for that?

@vukosim
Copy link
Member

vukosim commented Dec 2, 2021

@maximeLpt no. It was initially filled in by a volunteer, day by day.

@anelda
Copy link
Contributor

anelda commented Dec 2, 2021

Is there no chance NICD can/will just share the table in Excel or CSV format?

@vukosim
Copy link
Member

vukosim commented Dec 2, 2021

A friend has made the request, we will see if she gets a response @anelda

@vukosim vukosim self-assigned this Dec 2, 2021
@HerkulaasCombrink
Copy link
Collaborator

Hey @anelda @vukosim let me know what the NICD says, and if not luck comes - then we can surely build a wrapper/scraper for the PDF's or find a way to parse the data from a different format?

@sjbeckett
Copy link

Hi all,

I had a go at scraping the NICD datasets for data since the 27 October 2020 (when the daily reports table has 10 columns). There is still work to do to get some of the columns described in covid19za/data/nicd_hospital_surveillance_data.csv (general, high care, isolation, total health care workers admitted) and in filtering for the many ways in which dates appear in the raw pdf files. Hope that this can be useful as a basis for scraping if there is no luck with NICD:

https://gist.github.com/sjbeckett/f1d3822db7d41d33dfddd01814a64481

@SivuyileNzimeni
Copy link

I managed to write a fairly decent scraper for the pdfs. Perhaps the script can be enhanced by adding a filter for a particular day. eg. r Report_Date >= Sys.Date() to avoid re-downloading each file. The script can run a cronjob.

Another script is the table scraper. The results were mixed. This could be due to changes to how tables are formatted , how the files are generated etc. Extracting each table from all files can be time-consuming. There probably exists a way to uniformly format the tables into a sound format in another language to technique.
directory containing the scripts

@vukosim
Copy link
Member

vukosim commented Dec 9, 2021

Hi all,

I had a go at scraping the NICD datasets for data since the 27 October 2020 (when the daily reports table has 10 columns). There is still work to do to get some of the columns described in covid19za/data/nicd_hospital_surveillance_data.csv (general, high care, isolation, total health care workers admitted) and in filtering for the many ways in which dates appear in the raw pdf files. Hope that this can be useful as a basis for scraping if there is no luck with NICD:

https://gist.github.com/sjbeckett/f1d3822db7d41d33dfddd01814a64481

Thank you. i think this is a good start and we can then try to start filling in where the scraper could not.

@krokkie
Copy link
Collaborator

krokkie commented Dec 11, 2021

All

I did some work on the hospitalization data too. Actually my son :-).
Not in the data/nicd_hospital_surveillance_data.csv file yet, but in another file with a different format.
See data/covid19za_provincial_raw_hospitalization.csv
The scraper lives in scripts/daily_nicd_datcov.R.
And the github workflow runs this script every night -- it appears as if the posting / publication of these files are manual, so I'm not 100% sure what the best time would be.

Remaining todo:

  • make sure the scraper runs stable
  • extract the relevant parts and update the old summary hospitalization file, which is partially done already.
    g

@sjbeckett
Copy link

Hi all,

I had a go at scraping the NICD datasets for data since the 27 October 2020 (when the daily reports table has 10 columns). There is still work to do to get some of the columns described in covid19za/data/nicd_hospital_surveillance_data.csv (general, high care, isolation, total health care workers admitted) and in filtering for the many ways in which dates appear in the raw pdf files. Hope that this can be useful as a basis for scraping if there is no luck with NICD:

https://gist.github.com/sjbeckett/f1d3822db7d41d33dfddd01814a64481

Just a note I updated the above gist to incorporate general and high care patient numbers.
Of note, isolation was no longer reported after 02-09-2020; and admitted healthcare workers were no longer reported after 08-10-2020.

@dataprojectswithMJ
Copy link

Greetings everyone. I took a different approach to this issue and created an API which is free and publicly available. At the moment, the database has data from 01 July 2021 to 28 Dec 2021.

The API has 2 main endpoints:

  1. /all:

    • Gets all data from the the database
      Screenshot (23)
  2. /dates:

    • Allows for filtering data between 2 specific dates. Province is not required but you can specify it if you are looking for a specific province data..

    Without Province
    Screenshot (22)

    With Province
    Screenshot (24)

Getting the PDFs from NICD website and uploading the data to a cloud database is automated but annotating the table on the PDF is still manual because the PDF formatting varies inconsistently between the documents so that is the only manual process thus far.

Try the API and I would appreciate some pointers and thoughts:
https://covidza-data.deta.dev/docs

@vukosim
Copy link
Member

vukosim commented Jan 3, 2022

Happy new year everyone. There will be a bit more action to finalise this in the next 2 weeks. Thank you so much for the work and ideas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working data enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

8 participants