ohw22-proj-kerchunk

Repository for prototyping kerchunking of open AWS datasets.

Collaborators

Paul Branson
Sean Harkins
Chuck Daniels
Aimee Barciauskas
Alex Mandel
Anthony Lukach

Background

So you found an open data bucket with a heap of data you want to analyse. Maybe it is dense grids of numerical model data at global scale in NetCDF format or a stack of daily satellite earth observation TIFF files. If only you could just open them all as if they were one dataset.... Well maybe you can! The Kerchunk library builds on top of the fsspec library and provides a interface via Zarr to create an XArray dataset overlay that can span many files stored in object storage.

Analysing a large stack of dense array data stored in open data buckets with traditional libraries can be hard, slow and sometimes not even possible. Whilst in an ideal world, datasets would be published in an Analysis Ready format, this is frequently not the case and not universally possible due to the variety of access patterns that may preclude a notionally ideal chunking layout.

Given that these datasets are large, owned by someone else and many researchers have limited organisational capacity to rechunk and mirror such large datasets, solutions that enhance the ability to analyse published datasets 'as-is' are valuable. Recent examples here and here of such an approach are spurring a flurry of activity of people 'kerchunk'-ing available datasets as evidenced by the burgeoning kerchunk issues list!

But once a dataset has been kerchunk-ed once, others could use that index - they typically compress considerably into small (<100 MB) files that could be shared for others to reuse or to feed into a Pangeo-Forge recipe.

Goals

Build an understanding of how kerchunk works (Done)
Identify some priority open data datasets to kerchunk (Done)
Test out a few file storage options for kerchunk indexes (Didnt get to)
Discuss options for sharing a kerchunk index (Didnt get to)
Create a pangeo-forge recipe. See pangeo-forge/staged-recipes#173 (In Progress)

Progress towards to the goals was tracked using github issues: https://github.com/oceanhackweek/ohw22-proj-kerchunk/issues

Datasets

Himawari 8 Sea Surface Temperature (https://registry.opendata.aws/noaa-himawari/)

Workflow

Identify datasets
Examine dataset file format and internal chunking
Generate kerchunk index
Analyse dataset with GCM-Fitlers

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
catalog		catalog
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ohw22-proj-kerchunk

Collaborators

Background

Goals

Datasets

Workflow

References

About

Releases

Packages

Languages

License

oceanhackweek/ohw22-proj-kerchunk

Folders and files

Latest commit

History

Repository files navigation

ohw22-proj-kerchunk

Collaborators

Background

Goals

Datasets

Workflow

References

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages