Understanding index for shards

Thanks for building zarrita! We've been using zarrita to write sharded V3 data as a test for new parsing capabilities for V3 data in [zarr-js](https://github.com/freeman-lab/zarr-js) (cc @freeman-lab). However, we're having some difficulties understanding the structure of the index in the output from zarrita.

As an example, here is a sharded dataset:

```python
import zarrita
import numpy as np

store = zarrita.LocalStore('testdata')

data = np.arange(0, 128 * 128, dtype='int32').reshape((128, 128))

shard_shape = (64,64)
chunk_shape=(32,32)

a = await zarrita.Array.create_async(
    store / 'zarrita-sharding',
    shape=data.shape,
    dtype='int32',
    chunk_shape=shard_shape,
    chunk_key_encoding=('v2', '.'),
    codecs=[
        zarrita.codecs.sharding_codec(
            chunk_shape=chunk_shape,
            codecs=[zarrita.codecs.blosc_codec(typesize=data.dtype.itemsize)]
        ),
    ],
)

a[:, :] = data
```

Based on https://zarr.dev/zeps/draft/ZEP0002.html#binary-shard-format, we'd expect the index as the last 64 bytes at the end of the file. However, reading those data gives unexpected results. Are you able to offer guidance for interpreting the shard index?

```python
import os
from functools import reduce
shard = 'testdata/zarrita-sharding/0.0'
nbytes = os.path.getsize(shard)
nchunks_per_shard = int(reduce(lambda x, y: x*y, list(shard_shape)) / reduce(lambda x, y: x*y, list(chunk_shape)))
header_bytes=int(nchunks_per_shard)*16
remainder = nbytes - header_bytes

chunks = {}
with open(shard, 'rb') as f:
    _ = f.read(remainder)
    for ind in range(nchunks_per_shard):
        chunks[ind] = {}
        chunks[ind]['offset'] = int.from_bytes(f.read(8), byteorder='little')
        chunks[ind]['nbytes'] = int.from_bytes(f.read(8), byteorder='little')
        
print(chunks)
```
```
{0: {'offset': 8499740278784, 'nbytes': 8499740278784}, 1: {'offset': 8972186681344, 'nbytes': 17471926960128}, 2: {'offset': 8650064134144, 'nbytes': 26121991094272}, 3: {'offset': 9174050144256, 'nbytes': 13244620951116578816}}
```

I've copied the equivalent code using Zarr-Python below in case it's helpful for understand the issue.

```python
import os
from functools import reduce


os.environ["ZARR_V3_EXPERIMENTAL_API"] = "1"
os.environ["ZARR_V3_SHARDING"] = "1"

import numpy as np
import zarr
from zarr._storage.v3 import DirectoryStoreV3
from zarr._storage.v3_storage_transformers import ShardingStorageTransformer

shape=(128,128)
data = np.arange(0, 128 * 128, dtype='int32').reshape(shape)
chunk_shape=(32,32)
chunks_per_shard=(2,2)

path = "testdata/zarr-python"
sharded_store = DirectoryStoreV3(path)
sharding_transformer = ShardingStorageTransformer("indexed", chunks_per_shard=chunks_per_shard)

arr_sh = zarr.create(
    shape,
    chunks=chunk_shape,
    dtype=data.dtype,
    store=sharded_store,
    storage_transformers=[sharding_transformer],
)

arr_sh[:] = data

shard = "testdata/zarr-python/data/root/c1/1"
nbytes = os.path.getsize(shard)
header_bytes=reduce(lambda x, y: x*y, list(chunks_per_shard))*16
remainder = nbytes - header_bytes
nchunks_per_shard = reduce(lambda x, y: x*y, list(chunks_per_shard))

chunks = {}
with open(shard, 'rb') as f:
    _ = f.read(remainder)
    for ind in range(nchunks_per_shard):
        chunks[ind] = {}
        chunks[ind]['offset'] = int.from_bytes(f.read(8), byteorder='little')
        chunks[ind]['nbytes'] = int.from_bytes(f.read(8), byteorder='little')
        
print(chunks)
```
```
{0: {'offset': 1068, 'nbytes': 356}, 1: {'offset': 356, 'nbytes': 356}, 2: {'offset': 712, 'nbytes': 356}, 3: {'offset': 0, 'nbytes': 356}}
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Understanding index for shards #6

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Understanding index for shards #6

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions