Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to write to sharded Zarr V3 because shards not in valid_encodings #9947

Open
jacobbieker opened this issue Jan 13, 2025 · 1 comment · May be fixed by #9948
Open

Unable to write to sharded Zarr V3 because shards not in valid_encodings #9947

jacobbieker opened this issue Jan 13, 2025 · 1 comment · May be fixed by #9948
Labels
bug topic-zarr Related to zarr storage library

Comments

@jacobbieker
Copy link

jacobbieker commented Jan 13, 2025

What is your issue?

Hi,

I'm trying to work with creating sharded Zarr V3 datasets with Xarray. Now that Zarr-Python V3 is out, and Xarray supports it, I've tried making a few. I've previously brought this up here in the discussions. There seems to be an issue though. With the latest version of Xarray and Zarr, the below fails to write to Zarr with the error ValueError: unexpected encoding parameters for zarr backend: ['shards'].

import xarray as xr
import numpy as np

ds = xr.Dataset(
    {
        "a": (("x", "y"), np.random.rand(16, 16)),
        "b": (("x", "y"), np.random.rand(16, 16)),
    },
    coords={"x": list(range(16)), "y": list(range(16))},
)
ds = ds.chunk({"x": 4, "y": 4})
print(ds)
encoding = {
        v: {"shards": (16, 16)}
        for v in ds.data_vars}
print(encoding)
ds.to_zarr("test.zarr", mode="w", encoding=encoding, zarr_format=3, compute=True)
ds = xr.open_zarr("test.zarr")
print(ds)

There does seem to be a fairly straightforward fix though, of adding shards to here:

valid_encodings = {
"chunks",
"compressor", # TODO: delete when min zarr >=3
"compressors",
"filters",
"serializer",
"cache_metadata",
"write_empty_chunks",
}

Once that is added, the above code snippet writes the sharded Zarr successfully, and can be read successfully back.

Digging into it a bit more, even with the change in #9948, the below will fail on checksum when writing with larger shard and chunk sizes.

import xarray as xr
import numpy as np
import zarr

shard_size = 1000

ds = xr.Dataset(
    {
        "a": (("time", "x", "y"), np.random.rand(10, shard_size, shard_size)),
        "b": (("time", "x", "y"), np.random.rand(10, shard_size, shard_size)),
    },
    coords={"time": list(range(10)), "x": list(range(shard_size)), "y": list(range(shard_size))},
)
ds = ds.chunk({"time": 1, "x": 100, "y": 100})
encoding = {
        v: {"compressors": zarr.codecs.BloscCodec(cname='zstd', clevel=9, shuffle=zarr.codecs.BloscShuffle.bitshuffle), "shards": (1, shard_size, shard_size)}
        for v in ds.data_vars}
print(ds)
ds.to_zarr("test2.zarr", mode="w", encoding=encoding, zarr_format=3, compute=True)
ds = xr.open_zarr("test2.zarr")
print(ds)

Traceback:

Traceback (most recent call last):
  File "/Users/jacob/Development/planetary-datasets/min_shard.py", line 48, in <module>
    ds.to_zarr("test2.zarr", mode="w", encoding=encoding, zarr_format=3, compute=True)
  File "/Users/jacob/.conda/envs/pc/lib/python3.12/site-packages/xarray/core/dataset.py", line 2622, in to_zarr
    return to_zarr(  # type: ignore[call-overload,misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jacob/.conda/envs/pc/lib/python3.12/site-packages/xarray/backends/api.py", line 2217, in to_zarr
    writes = writer.sync(
             ^^^^^^^^^^^^
  File "/Users/jacob/.conda/envs/pc/lib/python3.12/site-packages/xarray/backends/common.py", line 358, in sync
    delayed_store = chunkmanager.store(
                    ^^^^^^^^^^^^^^^^^^^
  File "/Users/jacob/.conda/envs/pc/lib/python3.12/site-packages/xarray/namedarray/daskmanager.py", line 247, in store
    return store(
           ^^^^^^
  File "/Users/jacob/.conda/envs/pc/lib/python3.12/site-packages/dask/array/core.py", line 1282, in store
    compute_as_if_collection(Array, store_dsk, map_keys, **kwargs)
  File "/Users/jacob/.conda/envs/pc/lib/python3.12/site-packages/dask/base.py", line 397, in compute_as_if_collection
    return schedule(dsk2, keys, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jacob/.conda/envs/pc/lib/python3.12/site-packages/dask/threaded.py", line 91, in get
    results = get_async(
              ^^^^^^^^^^
  File "/Users/jacob/.conda/envs/pc/lib/python3.12/site-packages/dask/local.py", line 516, in get_async
    raise_exception(exc, tb)
  File "/Users/jacob/.conda/envs/pc/lib/python3.12/site-packages/dask/local.py", line 324, in reraise
    raise exc
  File "/Users/jacob/.conda/envs/pc/lib/python3.12/site-packages/dask/local.py", line 229, in execute_task
    result = task(data)
             ^^^^^^^^^^
  File "/Users/jacob/.conda/envs/pc/lib/python3.12/site-packages/dask/_task_spec.py", line 651, in __call__
    return self.func(*new_argspec)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jacob/.conda/envs/pc/lib/python3.12/site-packages/dask/array/core.py", line 4575, in store_chunk
    return load_store_chunk(x, out, index, lock, return_stored, False)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jacob/.conda/envs/pc/lib/python3.12/site-packages/dask/array/core.py", line 4557, in load_store_chunk
    out[index] = x
    ~~~^^^^^^^
  File "/Users/jacob/.conda/envs/pc/lib/python3.12/site-packages/zarr/core/array.py", line 2464, in __setitem__
    self.set_orthogonal_selection(pure_selection, value, fields=fields)
  File "/Users/jacob/.conda/envs/pc/lib/python3.12/site-packages/zarr/_compat.py", line 43, in inner_f
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/Users/jacob/.conda/envs/pc/lib/python3.12/site-packages/zarr/core/array.py", line 2920, in set_orthogonal_selection
    return sync(
           ^^^^^
  File "/Users/jacob/.conda/envs/pc/lib/python3.12/site-packages/zarr/core/sync.py", line 142, in sync
    raise return_result
  File "/Users/jacob/.conda/envs/pc/lib/python3.12/site-packages/zarr/core/sync.py", line 98, in _runner
    return await coro
           ^^^^^^^^^^
  File "/Users/jacob/.conda/envs/pc/lib/python3.12/site-packages/zarr/core/array.py", line 1354, in _set_selection
    await self.codec_pipeline.write(
  File "/Users/jacob/.conda/envs/pc/lib/python3.12/site-packages/zarr/core/codec_pipeline.py", line 468, in write
    await concurrent_map(
  File "/Users/jacob/.conda/envs/pc/lib/python3.12/site-packages/zarr/core/common.py", line 68, in concurrent_map
    return await asyncio.gather(*[asyncio.ensure_future(run(item)) for item in items])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jacob/.conda/envs/pc/lib/python3.12/site-packages/zarr/core/common.py", line 66, in run
    return await func(*item)
           ^^^^^^^^^^^^^^^^^
  File "/Users/jacob/.conda/envs/pc/lib/python3.12/site-packages/zarr/core/codec_pipeline.py", line 344, in write_batch
    await self.encode_partial_batch(
  File "/Users/jacob/.conda/envs/pc/lib/python3.12/site-packages/zarr/core/codec_pipeline.py", line 229, in encode_partial_batch
    await self.array_bytes_codec.encode_partial(batch_info)
  File "/Users/jacob/.conda/envs/pc/lib/python3.12/site-packages/zarr/abc/codec.py", line 235, in encode_partial
    await concurrent_map(
  File "/Users/jacob/.conda/envs/pc/lib/python3.12/site-packages/zarr/core/common.py", line 68, in concurrent_map
    return await asyncio.gather(*[asyncio.ensure_future(run(item)) for item in items])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jacob/.conda/envs/pc/lib/python3.12/site-packages/zarr/core/common.py", line 66, in run
    return await func(*item)
           ^^^^^^^^^^^^^^^^^
  File "/Users/jacob/.conda/envs/pc/lib/python3.12/site-packages/zarr/codecs/sharding.py", line 582, in _encode_partial_single
    await self._load_full_shard_maybe(
  File "/Users/jacob/.conda/envs/pc/lib/python3.12/site-packages/zarr/codecs/sharding.py", line 730, in _load_full_shard_maybe
    await _ShardReader.from_bytes(shard_bytes, self, chunks_per_shard)
  File "/Users/jacob/.conda/envs/pc/lib/python3.12/site-packages/zarr/codecs/sharding.py", line 194, in from_bytes
    obj.index = await codec._decode_shard_index(shard_index_bytes, chunks_per_shard)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jacob/.conda/envs/pc/lib/python3.12/site-packages/zarr/codecs/sharding.py", line 632, in _decode_shard_index
    await get_pipeline_class()
  File "/Users/jacob/.conda/envs/pc/lib/python3.12/site-packages/zarr/core/codec_pipeline.py", line 435, in decode
    output.extend(await self.decode_batch(batch_info))
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jacob/.conda/envs/pc/lib/python3.12/site-packages/zarr/core/codec_pipeline.py", line 172, in decode_batch
    chunk_bytes_batch = await bb_codec.decode(
                        ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jacob/.conda/envs/pc/lib/python3.12/site-packages/zarr/abc/codec.py", line 129, in decode
    return await _batching_helper(self._decode_single, chunks_and_specs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jacob/.conda/envs/pc/lib/python3.12/site-packages/zarr/abc/codec.py", line 407, in _batching_helper
    return await concurrent_map(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jacob/.conda/envs/pc/lib/python3.12/site-packages/zarr/core/common.py", line 68, in concurrent_map
    return await asyncio.gather(*[asyncio.ensure_future(run(item)) for item in items])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jacob/.conda/envs/pc/lib/python3.12/site-packages/zarr/core/common.py", line 66, in run
    return await func(*item)
           ^^^^^^^^^^^^^^^^^
  File "/Users/jacob/.conda/envs/pc/lib/python3.12/site-packages/zarr/abc/codec.py", line 420, in wrap
    return await func(chunk, chunk_spec)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jacob/.conda/envs/pc/lib/python3.12/site-packages/zarr/codecs/crc32c_.py", line 46, in _decode_single
    raise ValueError(
ValueError: Stored and computed checksum do not match. Stored: b'\x7f\x7f(o'. Computed: b'\xb0\xec\xbe6'.
@jacobbieker jacobbieker added the needs triage Issue that has not been reviewed by xarray team member label Jan 13, 2025
Copy link

welcome bot commented Jan 13, 2025

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

@TomNicholas TomNicholas added bug topic-zarr Related to zarr storage library and removed needs triage Issue that has not been reviewed by xarray team member labels Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug topic-zarr Related to zarr storage library
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants