Let XarrayZarrRecipe use fsspec references for opening netCDF inputs #218

rabernat · 2021-10-02T00:58:43Z

Potentially closes #177 by allowing us to bypass h5py when reading netCDF inputs from cloud storage.

After #213 this is pretty simple.

The tests should fail like this

recipe-netcdf_local_file_pattern_sequential-netcdf_local_paths_sequential_2d-execute_recipe_manual] FAILED [  6%]

=================================================== FAILURES ===================================================
_ test_recipe_with_references[execute_recipe_no_prefect-netCDFtoZarr_recipe-netcdf_local_file_pattern_sequential-netcdf_local_paths_sequential_2d-execute_recipe_manual] _

recipe_fixture = (<class 'pangeo_forge_recipes.recipes.xarray_zarr.XarrayZarrRecipe'>, <FilePattern {'time': 5}>, {'input_cache': Cache...t 0x1801146a0>, root_path='/private/var/folders/n8/63q49ms55wxcj_gfbtykwp5r0000gn/T/pytest-of-rpa/pytest-45/target11'))
execute_recipe = <function execute_recipe_manual.<locals>.execute at 0x180e78ee0>

    @pytest.mark.parametrize("recipe_fixture", all_recipes)
    def test_recipe_with_references(recipe_fixture, execute_recipe):
        """Same as above, but use fsspec references for opening the files."""
    
        RecipeClass, file_pattern, kwargs, ds_expected, target = recipe_fixture
        rec = RecipeClass(file_pattern, open_input_with_fsspec_reference=True, **kwargs)
        execute_recipe(rec)
        ds_actual = xr.open_zarr(target.get_mapper()).load()
>       xr.testing.assert_identical(ds_actual, ds_expected)
E       AssertionError: Left and right Dataset objects are not identical
E       
E       Differing coordinates:
E       L * time     (time) datetime64[ns] 2010-01-01 2010-01-02 ... 2010-01-10
E           NAME: time
E           _Netcdf4Dimid: 0
E       R * time     (time) datetime64[ns] 2010-01-01 2010-01-02 ... 2010-01-10
E       L * lon      (lon) float64 5.0 15.0 25.0 35.0 45.0 ... 325.0 335.0 345.0 355.0
E           NAME: lon
E           _Netcdf4Dimid: 2
E           long_name: longitude
E           units: degrees_east
E       R * lon      (lon) float64 5.0 15.0 25.0 35.0 45.0 ... 325.0 335.0 345.0 355.0
E           units: degrees_east
E           long_name: longitude
E       L * lat      (lat) float64 5.0 15.0 25.0 35.0 45.0 ... 145.0 155.0 165.0 175.0
E           NAME: lat
E           _Netcdf4Dimid: 1
E           long_name: latitude
E           units: degrees_north
E       R * lat      (lat) float64 5.0 15.0 25.0 35.0 45.0 ... 145.0 155.0 165.0 175.0
E           units: degrees_north
E           long_name: latitude
E       Differing data variables:
E       L   foo      (time, lat, lon) float64 0.417 0.7203 0.0001144 ... 0.1179 0.3748
E           _Netcdf4Coordinates: [0, 1, 2]
E           _Netcdf4Dimid: 0
E           long_name: Fantastic Foo
E       R   foo      (time, lat, lon) float64 0.417 0.7203 0.0001144 ... 0.1179 0.3748
E           long_name: Fantastic Foo
E       L   bar      (time, lat, lon) float64 9.0 4.0 3.0 2.0 2.0 ... 3.0 3.0 9.0 5.0
E           _Netcdf4Coordinates: [0, 1, 2]
E           _Netcdf4Dimid: 0
E           long_name: Beautiful Bar
E       R   bar      (time, lat, lon) int64 9 4 3 2 2 8 0 0 4 8 ... 9 8 4 4 6 0 3 3 9 5
E           long_name: Beautiful Bar
E       Attributes only on the left object:
E           _NCProperties: version=2,netcdf=4.8.0,hdf5=1.10.6

tests/recipe_tests/test_XarrayZarrRecipe.py:143: AssertionError

It looks like fsspec-reference-maker is giving more metadata attributes than the h5netcdf library.

martindurant · 2021-10-02T20:28:35Z

It looks like fsspec-reference-maker is giving more metadata attributes than the h5netcdf library.

Specifically, h5py - it seems that these are attributes that h5netcdf is swallowing to apply CDF conventions. The code could clean them out, too, although they probably don't do any harm.

martindurant

Well it seems pretty simple on a quick read!

Am I right in thinking that the input to reference-maker is always single files in this model? It may be efficient to pre-combine sets of filenames (if the combination logic is sound, of course). I would also argue, if practical, that a reference set for the whole aggregated dataset should always be created as a side-product of copying to zarr. If a future recipe wants to make a new chunking or re-copy the data (which hasn't changed), then reusing that artefact would save a lot of CPU time, as well as, of course, allow direct access to the original bytes.

(Opening many reference files with open_mf can still use a lot of memory when it is set to check the coordinates or attributes of all of the dataset, be warned; I don't suppose this is any different for input by virtual zarr versus any other)

martindurant · 2021-10-02T20:29:51Z

pangeo_forge_recipes/recipes/xarray_zarr.py

 import numpy as np
 import xarray as xr
 import zarr

 from ..chunk_grid import ChunkGrid
 from ..patterns import CombineOp, DimIndex, FilePattern, Index
+from ..reference import create_hdf5_reference, unstrip_protocol


Note that _unstrip_protocol is now available in fsspec.utils. A more thorough one for chained FSs would be good, but not necessary for the work here.

martindurant · 2021-10-02T20:31:20Z

pangeo_forge_recipes/recipes/xarray_zarr.py

+        ref_data = create_hdf5_reference(fp, url, fname)
+
+    ref_fname = _input_reference_fname(input_key)
+    metadata_cache[ref_fname] = ref_data


Maybe this is a recipe detail, but it would be good to figure out where we can store these reference files for the future, perhaps together with a hash/uid of the file they were created from.

martindurant · 2021-10-02T20:33:37Z

pangeo_forge_recipes/recipes/xarray_zarr.py

+            secrets=file_pattern.query_string_secrets,
+            **file_pattern.fsspec_open_kwargs,
+        ) as f:
+            with dask.config.set(scheduler="single-threaded"):  # make sure we don't use a scheduler


Why do we not want a scheduler? Because we are dealing with local files?

martindurant · 2021-10-14T14:57:33Z

Is there anything I can do to help this PR?

rabernat · 2021-10-14T15:30:14Z

Is there anything I can do to help this PR?

Thanks for the ping Martin! I too really want to move this forward. I guess I am stuck on this problem:

fsspec-reference-maker is giving more metadata attributes than the h5netcdf library.

Your response:

it seems that these are attributes that h5netcdf is swallowing to apply CDF conventions. The code could clean them out, too, although they probably don't do any harm.

is helpful but not sufficient to move forward. We need to make a decision about how to handle these extra attributes. Data providers care very much about these metadata attributes. The dataset attributes often conform to specific conventions like CF. So the ideal behavior would be if the dataset that is produced by fsspec-reference-maker is identical in every detail to the one read from the HDF5 file.

I would like to see this addressed in fsspec-reference-maker, e.g. by replicating the h5netcdf logic. If that fix can be made, then the tests here will just pass without any special casing.

rabernat · 2021-10-15T18:53:27Z

Awesome work in fsspec/kerchunk#89!

Any chance you can make an fsspec-reference-maker release so we can update our dependencies here?

martindurant · 2021-10-15T19:01:44Z

released as 0.0.4

rabernat · 2021-10-15T19:52:58Z

We are only failing due to the memory-usage test from #220, so I'm going to merge.

rabernat added 2 commits October 1, 2021 20:51

make XarrayZarrRecipe work with references

db9e47d

Merge remote-tracking branch 'upstream/master' into reference-refactor

963e123

rabernat requested a review from martindurant October 2, 2021 01:11

martindurant reviewed Oct 2, 2021

View reviewed changes

martindurant mentioned this pull request Oct 15, 2021

Remove CDF special attributes when translating fsspec/kerchunk#89

Merged

martindurant closed this in fsspec/kerchunk#89 Oct 15, 2021

rabernat reopened this Oct 15, 2021

update fsspec-reference-maker dependency to 0.0.4

ac39713

rabernat merged commit f960a84 into pangeo-forge:master Oct 15, 2021

cisaacstern mentioned this pull request Oct 15, 2021

Hanging during store_chunk loop #177

Closed

cisaacstern mentioned this pull request Nov 3, 2021

Proposed Recipes for ERA5 pangeo-forge/staged-recipes#92

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Let XarrayZarrRecipe use fsspec references for opening netCDF inputs #218

Let XarrayZarrRecipe use fsspec references for opening netCDF inputs #218

rabernat commented Oct 2, 2021

martindurant commented Oct 2, 2021

martindurant left a comment

martindurant Oct 2, 2021

martindurant Oct 2, 2021

martindurant Oct 2, 2021

martindurant commented Oct 14, 2021

rabernat commented Oct 14, 2021

rabernat commented Oct 15, 2021

martindurant commented Oct 15, 2021

rabernat commented Oct 15, 2021

Let XarrayZarrRecipe use fsspec references for opening netCDF inputs #218

Let XarrayZarrRecipe use fsspec references for opening netCDF inputs #218

Conversation

rabernat commented Oct 2, 2021

martindurant commented Oct 2, 2021

martindurant left a comment

Choose a reason for hiding this comment

martindurant Oct 2, 2021

Choose a reason for hiding this comment

martindurant Oct 2, 2021

Choose a reason for hiding this comment

martindurant Oct 2, 2021

Choose a reason for hiding this comment

martindurant commented Oct 14, 2021

rabernat commented Oct 14, 2021

rabernat commented Oct 15, 2021

martindurant commented Oct 15, 2021

rabernat commented Oct 15, 2021