This repository was archived by the owner on Nov 12, 2025. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 6
This repository was archived by the owner on Nov 12, 2025. It is now read-only.
Understanding index for shards #6
Copy link
Copy link
Closed
Description
Thanks for building zarrita! We've been using zarrita to write sharded V3 data as a test for new parsing capabilities for V3 data in zarr-js (cc @freeman-lab). However, we're having some difficulties understanding the structure of the index in the output from zarrita.
As an example, here is a sharded dataset:
import zarrita
import numpy as np
store = zarrita.LocalStore('testdata')
data = np.arange(0, 128 * 128, dtype='int32').reshape((128, 128))
shard_shape = (64,64)
chunk_shape=(32,32)
a = await zarrita.Array.create_async(
store / 'zarrita-sharding',
shape=data.shape,
dtype='int32',
chunk_shape=shard_shape,
chunk_key_encoding=('v2', '.'),
codecs=[
zarrita.codecs.sharding_codec(
chunk_shape=chunk_shape,
codecs=[zarrita.codecs.blosc_codec(typesize=data.dtype.itemsize)]
),
],
)
a[:, :] = dataBased on https://zarr.dev/zeps/draft/ZEP0002.html#binary-shard-format, we'd expect the index as the last 64 bytes at the end of the file. However, reading those data gives unexpected results. Are you able to offer guidance for interpreting the shard index?
import os
from functools import reduce
shard = 'testdata/zarrita-sharding/0.0'
nbytes = os.path.getsize(shard)
nchunks_per_shard = int(reduce(lambda x, y: x*y, list(shard_shape)) / reduce(lambda x, y: x*y, list(chunk_shape)))
header_bytes=int(nchunks_per_shard)*16
remainder = nbytes - header_bytes
chunks = {}
with open(shard, 'rb') as f:
_ = f.read(remainder)
for ind in range(nchunks_per_shard):
chunks[ind] = {}
chunks[ind]['offset'] = int.from_bytes(f.read(8), byteorder='little')
chunks[ind]['nbytes'] = int.from_bytes(f.read(8), byteorder='little')
print(chunks){0: {'offset': 8499740278784, 'nbytes': 8499740278784}, 1: {'offset': 8972186681344, 'nbytes': 17471926960128}, 2: {'offset': 8650064134144, 'nbytes': 26121991094272}, 3: {'offset': 9174050144256, 'nbytes': 13244620951116578816}}
I've copied the equivalent code using Zarr-Python below in case it's helpful for understand the issue.
import os
from functools import reduce
os.environ["ZARR_V3_EXPERIMENTAL_API"] = "1"
os.environ["ZARR_V3_SHARDING"] = "1"
import numpy as np
import zarr
from zarr._storage.v3 import DirectoryStoreV3
from zarr._storage.v3_storage_transformers import ShardingStorageTransformer
shape=(128,128)
data = np.arange(0, 128 * 128, dtype='int32').reshape(shape)
chunk_shape=(32,32)
chunks_per_shard=(2,2)
path = "testdata/zarr-python"
sharded_store = DirectoryStoreV3(path)
sharding_transformer = ShardingStorageTransformer("indexed", chunks_per_shard=chunks_per_shard)
arr_sh = zarr.create(
shape,
chunks=chunk_shape,
dtype=data.dtype,
store=sharded_store,
storage_transformers=[sharding_transformer],
)
arr_sh[:] = data
shard = "testdata/zarr-python/data/root/c1/1"
nbytes = os.path.getsize(shard)
header_bytes=reduce(lambda x, y: x*y, list(chunks_per_shard))*16
remainder = nbytes - header_bytes
nchunks_per_shard = reduce(lambda x, y: x*y, list(chunks_per_shard))
chunks = {}
with open(shard, 'rb') as f:
_ = f.read(remainder)
for ind in range(nchunks_per_shard):
chunks[ind] = {}
chunks[ind]['offset'] = int.from_bytes(f.read(8), byteorder='little')
chunks[ind]['nbytes'] = int.from_bytes(f.read(8), byteorder='little')
print(chunks){0: {'offset': 1068, 'nbytes': 356}, 1: {'offset': 356, 'nbytes': 356}, 2: {'offset': 712, 'nbytes': 356}, 3: {'offset': 0, 'nbytes': 356}}
Metadata
Metadata
Assignees
Labels
No labels