-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Let XarrayZarrRecipe use fsspec references for opening netCDF inputs #218
Conversation
Specifically, h5py - it seems that these are attributes that h5netcdf is swallowing to apply CDF conventions. The code could clean them out, too, although they probably don't do any harm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well it seems pretty simple on a quick read!
Am I right in thinking that the input to reference-maker is always single files in this model? It may be efficient to pre-combine sets of filenames (if the combination logic is sound, of course). I would also argue, if practical, that a reference set for the whole aggregated dataset should always be created as a side-product of copying to zarr. If a future recipe wants to make a new chunking or re-copy the data (which hasn't changed), then reusing that artefact would save a lot of CPU time, as well as, of course, allow direct access to the original bytes.
(Opening many reference files with open_mf can still use a lot of memory when it is set to check the coordinates or attributes of all of the dataset, be warned; I don't suppose this is any different for input by virtual zarr versus any other)
import numpy as np | ||
import xarray as xr | ||
import zarr | ||
|
||
from ..chunk_grid import ChunkGrid | ||
from ..patterns import CombineOp, DimIndex, FilePattern, Index | ||
from ..reference import create_hdf5_reference, unstrip_protocol |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that _unstrip_protocol
is now available in fsspec.utils
. A more thorough one for chained FSs would be good, but not necessary for the work here.
ref_data = create_hdf5_reference(fp, url, fname) | ||
|
||
ref_fname = _input_reference_fname(input_key) | ||
metadata_cache[ref_fname] = ref_data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this is a recipe detail, but it would be good to figure out where we can store these reference files for the future, perhaps together with a hash/uid of the file they were created from.
secrets=file_pattern.query_string_secrets, | ||
**file_pattern.fsspec_open_kwargs, | ||
) as f: | ||
with dask.config.set(scheduler="single-threaded"): # make sure we don't use a scheduler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we not want a scheduler? Because we are dealing with local files?
Is there anything I can do to help this PR? |
Thanks for the ping Martin! I too really want to move this forward. I guess I am stuck on this problem:
Your response:
is helpful but not sufficient to move forward. We need to make a decision about how to handle these extra attributes. Data providers care very much about these metadata attributes. The dataset attributes often conform to specific conventions like CF. So the ideal behavior would be if the dataset that is produced by fsspec-reference-maker is identical in every detail to the one read from the HDF5 file. I would like to see this addressed in fsspec-reference-maker, e.g. by replicating the h5netcdf logic. If that fix can be made, then the tests here will just pass without any special casing. |
Awesome work in fsspec/kerchunk#89! Any chance you can make an fsspec-reference-maker release so we can update our dependencies here? |
released as 0.0.4 |
We are only failing due to the memory-usage test from #220, so I'm going to merge. |
Potentially closes #177 by allowing us to bypass h5py when reading netCDF inputs from cloud storage.
After #213 this is pretty simple.
The tests should fail like this
It looks like fsspec-reference-maker is giving more metadata attributes than the h5netcdf library.