Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let XarrayZarrRecipe use fsspec references for opening netCDF inputs #218

Merged
merged 3 commits into from
Oct 15, 2021

Conversation

rabernat
Copy link
Contributor

@rabernat rabernat commented Oct 2, 2021

Potentially closes #177 by allowing us to bypass h5py when reading netCDF inputs from cloud storage.

After #213 this is pretty simple.

The tests should fail like this

recipe-netcdf_local_file_pattern_sequential-netcdf_local_paths_sequential_2d-execute_recipe_manual] FAILED [  6%]

=================================================== FAILURES ===================================================
_ test_recipe_with_references[execute_recipe_no_prefect-netCDFtoZarr_recipe-netcdf_local_file_pattern_sequential-netcdf_local_paths_sequential_2d-execute_recipe_manual] _

recipe_fixture = (<class 'pangeo_forge_recipes.recipes.xarray_zarr.XarrayZarrRecipe'>, <FilePattern {'time': 5}>, {'input_cache': Cache...t 0x1801146a0>, root_path='/private/var/folders/n8/63q49ms55wxcj_gfbtykwp5r0000gn/T/pytest-of-rpa/pytest-45/target11'))
execute_recipe = <function execute_recipe_manual.<locals>.execute at 0x180e78ee0>

    @pytest.mark.parametrize("recipe_fixture", all_recipes)
    def test_recipe_with_references(recipe_fixture, execute_recipe):
        """Same as above, but use fsspec references for opening the files."""
    
        RecipeClass, file_pattern, kwargs, ds_expected, target = recipe_fixture
        rec = RecipeClass(file_pattern, open_input_with_fsspec_reference=True, **kwargs)
        execute_recipe(rec)
        ds_actual = xr.open_zarr(target.get_mapper()).load()
>       xr.testing.assert_identical(ds_actual, ds_expected)
E       AssertionError: Left and right Dataset objects are not identical
E       
E       Differing coordinates:
E       L * time     (time) datetime64[ns] 2010-01-01 2010-01-02 ... 2010-01-10
E           NAME: time
E           _Netcdf4Dimid: 0
E       R * time     (time) datetime64[ns] 2010-01-01 2010-01-02 ... 2010-01-10
E       L * lon      (lon) float64 5.0 15.0 25.0 35.0 45.0 ... 325.0 335.0 345.0 355.0
E           NAME: lon
E           _Netcdf4Dimid: 2
E           long_name: longitude
E           units: degrees_east
E       R * lon      (lon) float64 5.0 15.0 25.0 35.0 45.0 ... 325.0 335.0 345.0 355.0
E           units: degrees_east
E           long_name: longitude
E       L * lat      (lat) float64 5.0 15.0 25.0 35.0 45.0 ... 145.0 155.0 165.0 175.0
E           NAME: lat
E           _Netcdf4Dimid: 1
E           long_name: latitude
E           units: degrees_north
E       R * lat      (lat) float64 5.0 15.0 25.0 35.0 45.0 ... 145.0 155.0 165.0 175.0
E           units: degrees_north
E           long_name: latitude
E       Differing data variables:
E       L   foo      (time, lat, lon) float64 0.417 0.7203 0.0001144 ... 0.1179 0.3748
E           _Netcdf4Coordinates: [0, 1, 2]
E           _Netcdf4Dimid: 0
E           long_name: Fantastic Foo
E       R   foo      (time, lat, lon) float64 0.417 0.7203 0.0001144 ... 0.1179 0.3748
E           long_name: Fantastic Foo
E       L   bar      (time, lat, lon) float64 9.0 4.0 3.0 2.0 2.0 ... 3.0 3.0 9.0 5.0
E           _Netcdf4Coordinates: [0, 1, 2]
E           _Netcdf4Dimid: 0
E           long_name: Beautiful Bar
E       R   bar      (time, lat, lon) int64 9 4 3 2 2 8 0 0 4 8 ... 9 8 4 4 6 0 3 3 9 5
E           long_name: Beautiful Bar
E       Attributes only on the left object:
E           _NCProperties: version=2,netcdf=4.8.0,hdf5=1.10.6

tests/recipe_tests/test_XarrayZarrRecipe.py:143: AssertionError

It looks like fsspec-reference-maker is giving more metadata attributes than the h5netcdf library.

@rabernat rabernat requested a review from martindurant October 2, 2021 01:11
@martindurant
Copy link
Contributor

It looks like fsspec-reference-maker is giving more metadata attributes than the h5netcdf library.

Specifically, h5py - it seems that these are attributes that h5netcdf is swallowing to apply CDF conventions. The code could clean them out, too, although they probably don't do any harm.

Copy link
Contributor

@martindurant martindurant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well it seems pretty simple on a quick read!

Am I right in thinking that the input to reference-maker is always single files in this model? It may be efficient to pre-combine sets of filenames (if the combination logic is sound, of course). I would also argue, if practical, that a reference set for the whole aggregated dataset should always be created as a side-product of copying to zarr. If a future recipe wants to make a new chunking or re-copy the data (which hasn't changed), then reusing that artefact would save a lot of CPU time, as well as, of course, allow direct access to the original bytes.

(Opening many reference files with open_mf can still use a lot of memory when it is set to check the coordinates or attributes of all of the dataset, be warned; I don't suppose this is any different for input by virtual zarr versus any other)

import numpy as np
import xarray as xr
import zarr

from ..chunk_grid import ChunkGrid
from ..patterns import CombineOp, DimIndex, FilePattern, Index
from ..reference import create_hdf5_reference, unstrip_protocol
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that _unstrip_protocol is now available in fsspec.utils. A more thorough one for chained FSs would be good, but not necessary for the work here.

ref_data = create_hdf5_reference(fp, url, fname)

ref_fname = _input_reference_fname(input_key)
metadata_cache[ref_fname] = ref_data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is a recipe detail, but it would be good to figure out where we can store these reference files for the future, perhaps together with a hash/uid of the file they were created from.

secrets=file_pattern.query_string_secrets,
**file_pattern.fsspec_open_kwargs,
) as f:
with dask.config.set(scheduler="single-threaded"): # make sure we don't use a scheduler
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we not want a scheduler? Because we are dealing with local files?

@martindurant
Copy link
Contributor

Is there anything I can do to help this PR?

@rabernat
Copy link
Contributor Author

Is there anything I can do to help this PR?

Thanks for the ping Martin! I too really want to move this forward. I guess I am stuck on this problem:

fsspec-reference-maker is giving more metadata attributes than the h5netcdf library.

Your response:

it seems that these are attributes that h5netcdf is swallowing to apply CDF conventions. The code could clean them out, too, although they probably don't do any harm.

is helpful but not sufficient to move forward. We need to make a decision about how to handle these extra attributes. Data providers care very much about these metadata attributes. The dataset attributes often conform to specific conventions like CF. So the ideal behavior would be if the dataset that is produced by fsspec-reference-maker is identical in every detail to the one read from the HDF5 file.

I would like to see this addressed in fsspec-reference-maker, e.g. by replicating the h5netcdf logic. If that fix can be made, then the tests here will just pass without any special casing.

@rabernat
Copy link
Contributor Author

Awesome work in fsspec/kerchunk#89!

Any chance you can make an fsspec-reference-maker release so we can update our dependencies here?

@martindurant
Copy link
Contributor

released as 0.0.4

@rabernat
Copy link
Contributor Author

We are only failing due to the memory-usage test from #220, so I'm going to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hanging during store_chunk loop
2 participants