Make data sources readable and incorporate sha256 hash checks #48

weiji14 · 2018-11-07T03:41:23Z

The work in this Pull Request ensures that our data source list should be both human readable and machine readable. People wishing to examine the original data sources should be able to do so with a visible document (README.md). Scripts should be able to easily parse the url links and sha256 hashes tied to the data via something like a Comma Separated Value file (data_list.csv) or YAML file (data_list.yml).

This is also where we get serious about data integrity. Explicitly listing all the data files with their SHA256 checksums (required) and their download URL (if available). Note only the Pine Island Glacier datasets are missing a public URL.

Links upstream to #7, #8, #20, #21.

TODO:

~~Create data_list.csv listing all the current data files~~ (b55a7c7)
Refactor data_prep.ipynb to use the data_list file instead of a dictionary, use pandas, implement sha256 checks (e47fb90)
Create data_list.yml file including metadata (e.g. DOIs) for each dataset (115a8ed)
Refresh and reformat the 3 README.md files in highres/lowres/misc folders (0b241e0)

Getting serious with data integrity. Explicitly listing all the data files with their SHA256 checksums (required) and their download URL (if available). Note only the Pine Island Glacier datasets are missing URL. Links upstream to #7, #8, #20, #21.

Part of #48. Refactor data download code to use pandas table loaded from data_list.csv, and implement sha256 hash checks. Note that we don't actually have a proper url for directly downloading the Pine Island Glacier dataset (#7).

Relates to #8, #20. Adding in the persistent digital object identifier for the REMA dataset. Point to literature by Noh and Howat 2018, and also the dataset DOI. Also changed the MEaSUREsV2 dataset to point to the DOI link instead of NSIDC.

Switch to a format that is both human and machine readable - YAML. See http://yaml.org/spec/1.2/spec.html. Datasets are grouped into blocks, with each block having its own metadata (e.g. dataset and literature DOI links). Aiming to automatically generate README.md files from this YAML file into the highres/lowres/misc folders.

Standardize the README.md files across the highres/lowres/misc folders. Has columns Filename, Location, Resolution, Literature Citation and Data Citation. Wrote some Python pandas code to parse the data_list.yml. Note that high resolution datasets have 'nan' as their resolution as they are point datasets (i.e. this was placed in too hard basket).

Did not realize that my local copy of bed_WGS84_grid.txt was corrupted, so changing the hash. Also remove extra whitespace in the data_list.yml file. Patches #48.

weiji14 added the enhancement ✨ New feature or request label Nov 7, 2018

weiji14 added this to the v0.4.0 milestone Nov 7, 2018

weiji14 self-assigned this Nov 7, 2018

weiji14 added 2 commits November 8, 2018 15:42

weiji14 force-pushed the data_list branch 2 times, most recently from 95bd90b to 4632010 Compare November 9, 2018 03:04

weiji14 added 2 commits November 9, 2018 17:13

weiji14 force-pushed the data_list branch from 4632010 to 0b241e0 Compare November 9, 2018 04:37

weiji14 changed the title ~~WIP Refactor data download part with sha256 hash checks~~ Make data sources readable and incorporate sha256 hash checks Nov 9, 2018

weiji14 merged commit 0b241e0 into master Nov 9, 2018

weiji14 deleted the data_list branch November 9, 2018 05:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make data sources readable and incorporate sha256 hash checks #48

Make data sources readable and incorporate sha256 hash checks #48

Uh oh!

weiji14 commented Nov 7, 2018 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Make data sources readable and incorporate sha256 hash checks #48

Make data sources readable and incorporate sha256 hash checks #48

Uh oh!

Conversation

weiji14 commented Nov 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

weiji14 commented Nov 7, 2018 •

edited

Loading