Repository for prototyping kerchunking of open AWS datasets.
- Paul Branson
- Sean Harkins
- Chuck Daniels
- Aimee Barciauskas
- Alex Mandel
- Anthony Lukach
So you found an open data bucket with a heap of data you want to analyse. Maybe it is dense grids of numerical model data at global scale in NetCDF format or a stack of daily satellite earth observation TIFF files. If only you could just open them all as if they were one dataset.... Well maybe you can! The Kerchunk library builds on top of the fsspec library and provides a interface via Zarr to create an XArray dataset overlay that can span many files stored in object storage.
Analysing a large stack of dense array data stored in open data buckets with traditional libraries can be hard, slow and sometimes not even possible. Whilst in an ideal world, datasets would be published in an Analysis Ready format, this is frequently not the case and not universally possible due to the variety of access patterns that may preclude a notionally ideal chunking layout.
Given that these datasets are large, owned by someone else and many researchers have limited organisational capacity to rechunk and mirror such large datasets, solutions that enhance the ability to analyse published datasets 'as-is' are valuable. Recent examples here and here of such an approach are spurring a flurry of activity of people 'kerchunk'-ing available datasets as evidenced by the burgeoning kerchunk issues list!
But once a dataset has been kerchunk-ed once, others could use that index - they typically compress considerably into small (<100 MB) files that could be shared for others to reuse or to feed into a Pangeo-Forge recipe.
- Build an understanding of how kerchunk works (Done)
- Identify some priority open data datasets to kerchunk (Done)
- Test out a few file storage options for kerchunk indexes (Didnt get to)
- Discuss options for sharing a kerchunk index (Didnt get to)
- Create a pangeo-forge recipe. See pangeo-forge/staged-recipes#173 (In Progress)
Progress towards to the goals was tracked using github issues: https://github.com/oceanhackweek/ohw22-proj-kerchunk/issues
Himawari 8 Sea Surface Temperature (https://registry.opendata.aws/noaa-himawari/)
- Identify datasets
- Examine dataset file format and internal chunking
- Generate kerchunk index
- Analyse dataset with GCM-Fitlers