Skip to content

Conversation

@eshort0401
Copy link

Addresses issue #8749 by implementing default dimensions when reading zarr stores with missing metadata. With this PR, if dimension names are missing xarray will try to build a Dataset from a zarr store using default dimension names, dim_0, dim_1 etc. Note we can only use default dimensions if each variable in the store has a consistent shape, as discussed by @TomNicholas and @etienneschalk discussed in #8749.

Motivating Example

Extending the example of @etienneschalk to both zarr 2 and 3 specifications, consider

import xarray as xr
import numpy as np
import json
from pathlib import Path
import shutil
import glob

# Create example dataset
da_a = xr.DataArray(np.arange(3 * 18).reshape(3, 18), dims=["label", "z"])
da_b = xr.DataArray([1, 2, 3], dims="label")
ds = xr.Dataset({"a": da_a, "b": da_b})
print(f"Original Dataset\n----------------\n{ds}\n")

# Save to zarr
ds_path = "./ds.zarr"
kwargs = {"consolidated": True, "zarr_format": 3} # Change these to check other cases
ds.to_zarr(ds_path, mode="w", **kwargs)

# Now simulate loading stored zarr without dimension name metadata

# Create functions for stripping dimension metadata from stored zarr
def strip_zarr_3(ds_path, stripped_ds_path):
    """Create a copy of a zarr 3 with dimension_names metadata removed."""    
    shutil.rmtree(stripped_ds_path, ignore_errors=True)
    shutil.copytree(ds_path, stripped_ds_path, dirs_exist_ok=True)
    # Get all the zarr.json metadata files. 
    metadata_files = glob.glob(f"{stripped_ds_path}/**/zarr.json", recursive=True)
    # Iterate through and remove all "dimension_names" entries
    for file in metadata_files:
        with open(file, "r") as f:
            metadata = json.load(f)
        metadata.pop("dimension_names", None)
        con_metadata = metadata.get("consolidated_metadata", None)
        if con_metadata:
            for k in con_metadata["metadata"].keys():
                con_metadata["metadata"][k].pop("dimension_names", None)
                
        with open(file, "w") as f:
            json.dump(metadata, f, indent=2)

def strip_zarr_2(ds_path, stripped_ds_path):
    """Create a copy of a zarr 2 with _ARRAY_DIMENSIONS metadata removed."""    
    # Get all the .zattrs metadata files. 
    # Note .zattrs are optional in zarr 2 
    # https://zarr-specs.readthedocs.io/en/latest/v2/v2.0.html#attributes
    shutil.rmtree(stripped_ds_path, ignore_errors=True)
    shutil.copytree(ds_path, stripped_ds_path, dirs_exist_ok=True)
    zattrs_files = glob.glob(f"{stripped_ds_path}/**/.zattrs", recursive=True)
    # Iterate through and remove all "_ARRAY_DIMENSIONS" entries
    for file in zattrs_files:
        with open(file, "r") as f:
            metadata = json.load(f)
        metadata.pop("_ARRAY_DIMENSIONS", None)
        with open(file, "w") as f:
            json.dump(metadata, f, indent=2)
    zmetadata_file = Path(stripped_ds_path) / ".zmetadata"
    if zmetadata_file.exists():
        with open(zmetadata_file, "r") as f:
            metadata = json.load(f)
        for k in metadata["metadata"].keys():
            metadata["metadata"][k].pop("_ARRAY_DIMENSIONS", None)
        with open(zmetadata_file, "w") as f:
            json.dump(metadata, f, indent=2)

# Strip dimension name metadata from the stored zarr
stripped_ds_path = "./stripped_ds.zarr"
if kwargs["zarr_format"] == 3:
    strip_zarr_3(ds_path, stripped_ds_path)
else:
    strip_zarr_2(ds_path, stripped_ds_path)

# Now load the stripped zarr; default dimension names are created. 
loaded_ds_2 = xr.open_zarr(stripped_ds_path, **kwargs).compute()
print(f"Stripped Dataset\n--------------\n{loaded_ds_2}\n")

With this PR, the code above will no longer raise an error, but instead return

Original Dataset
----------------
<xarray.Dataset> Size: 456B
Dimensions:  (label: 3, z: 18)
Dimensions without coordinates: label, z
Data variables:
    a        (label, z) int64 432B 0 1 2 3 4 5 6 7 8 ... 46 47 48 49 50 51 52 53
    b        (label) int64 24B 1 2 3

Stripped Dataset
--------------
<xarray.Dataset> Size: 456B
Dimensions:  (dim_0: 3, dim_1: 18)
Dimensions without coordinates: dim_0, dim_1
Data variables:
    a        (dim_0, dim_1) int64 432B 0 1 2 3 4 5 6 7 ... 47 48 49 50 51 52 53
    b        (dim_0) int64 24B 1 2 3

General Notes

It appears we have at least 3 zarr conventions considered by xarray.

  1. xarray flavoured zarr 2, with the optional .zattrs https://zarr-specs.readthedocs.io/en/latest/v2/v2.0.html#attributes used to store dimension names in _ARRAY_DIMENSIONS.
  2. zarr 3, with dimension names stored in the optional dimension_names metadata attribute https://zarr-specs.readthedocs.io/en/latest/v3/core/index.html#dimension-names
  3. NetCDF zarr https://docs.unidata.ucar.edu/netcdf/NUG/nczarr_head.html, which stores the dimension names in dim_refs.

The _get_zarr_dims_and_attrs function tries to get the dimension names by checking all three of these conventions. Perhaps the convention should be handled more explicitly somehow?

@github-actions github-actions bot added topic-backends topic-zarr Related to zarr storage library io topic-NamedArray Lightweight version of Variable labels Dec 12, 2025
@TomNicholas
Copy link
Member

Thanks for having a look at this! See my comment here #8749 (comment).

The _get_zarr_dims_and_attrs function tries to get the dimension names by checking all three of these conventions. Perhaps the convention should be handled more explicitly somehow?

This behaviour should be described in https://docs.xarray.dev/en/stable/internals/zarr-encoding-spec.html, and if it's not we should improve that docs page.

@eshort0401
Copy link
Author

Thanks heaps for the review @TomNicholas!

From #8749 (comment)

I have changed my mind - I don't think that trying to auto-infer some default dimension names makes sense for Zarr.

After thinking about this more I agree with you. I had interpreted @etienneschalk's example (#8749 (comment))

xr.Dataset({"xda_1": xr.DataArray([1]), "xda_2": xr.DataArray([2])})

as suggesting that Dataset infers dimension names, but of course it doesn't!

From #8749 (comment)

So actually I think the only correct thing to do here is either

  1. raise an error. We can improve the error message (a PR for that would be welcome), but we shouldn't be trying to auto-infer names for the dimensions.

and from above

This behaviour should be described in https://docs.xarray.dev/en/stable/internals/zarr-encoding-spec.html, and if it's not we should improve that docs page.

Ok I'll close this PR and have another look at the error messages and the https://docs.xarray.dev/en/stable/internals/zarr-encoding-spec.html page as you suggest! Thanks again for your review, and your patience as I learn the depths of zarr and xarray!

@eshort0401 eshort0401 closed this Dec 13, 2025
@TomNicholas
Copy link
Member

Thanks! I'm glad you agree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

io topic-backends topic-NamedArray Lightweight version of Variable topic-zarr Related to zarr storage library

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Lack of resilience towards missing _ARRAY_DIMENSIONS xarray's special zarr attribute #280

2 participants