Skip to content

Conversation

@weiji14
Copy link
Owner

@weiji14 weiji14 commented Nov 7, 2018

The work in this Pull Request ensures that our data source list should be both human readable and machine readable. People wishing to examine the original data sources should be able to do so with a visible document (README.md). Scripts should be able to easily parse the url links and sha256 hashes tied to the data via something like a Comma Separated Value file (data_list.csv) or YAML file (data_list.yml).

This is also where we get serious about data integrity. Explicitly listing all the data files with their SHA256 checksums (required) and their download URL (if available). Note only the Pine Island Glacier datasets are missing a public URL.

Links upstream to #7, #8, #20, #21.

TODO:

  • Create data_list.csv listing all the current data files (b55a7c7)
  • Refactor data_prep.ipynb to use the data_list file instead of a dictionary, use pandas, implement sha256 checks (e47fb90)
  • Create data_list.yml file including metadata (e.g. DOIs) for each dataset (115a8ed)
  • Refresh and reformat the 3 README.md files in highres/lowres/misc folders (0b241e0)

Getting serious with data integrity. Explicitly listing all the data files with their SHA256 checksums (required) and their download URL (if available). Note only the Pine Island Glacier datasets are missing URL. Links upstream to #7, #8, #20, #21.
@weiji14 weiji14 added the enhancement ✨ New feature or request label Nov 7, 2018
@weiji14 weiji14 added this to the v0.4.0 milestone Nov 7, 2018
@weiji14 weiji14 self-assigned this Nov 7, 2018
Part of #48. Refactor data download code to use pandas table loaded from data_list.csv, and implement sha256 hash checks. Note that we don't actually have a proper url for directly downloading the Pine Island Glacier dataset (#7).
Relates to #8, #20. Adding in the persistent digital object identifier for the REMA dataset. Point to literature by Noh and Howat 2018, and also the dataset DOI. Also changed the MEaSUREsV2 dataset to point to the DOI link instead of NSIDC.
@weiji14 weiji14 force-pushed the data_list branch 2 times, most recently from 95bd90b to 4632010 Compare November 9, 2018 03:04
Switch to a format that is both human and machine readable - YAML. See http://yaml.org/spec/1.2/spec.html. Datasets are grouped into blocks, with each block having its own metadata (e.g. dataset and literature DOI links). Aiming to automatically generate README.md files from this YAML file into the highres/lowres/misc folders.
Standardize the README.md files across the highres/lowres/misc folders. Has columns Filename, Location, Resolution, Literature Citation and Data Citation. Wrote some Python pandas code to parse the data_list.yml. Note that high resolution datasets have 'nan' as their resolution as they are point datasets (i.e. this was placed in too hard basket).
@weiji14 weiji14 changed the title WIP Refactor data download part with sha256 hash checks Make data sources readable and incorporate sha256 hash checks Nov 9, 2018
@weiji14 weiji14 merged commit 0b241e0 into master Nov 9, 2018
@weiji14 weiji14 deleted the data_list branch November 9, 2018 05:01
weiji14 added a commit that referenced this pull request Nov 15, 2018
Did not realize that my local copy of bed_WGS84_grid.txt was corrupted, so changing the hash. Also remove extra whitespace in the data_list.yml file. Patches #48.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement ✨ New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant