Skip to content
This repository was archived by the owner on Nov 12, 2025. It is now read-only.
This repository was archived by the owner on Nov 12, 2025. It is now read-only.

Understanding index for shards #6

@maxrjones

Description

@maxrjones

Thanks for building zarrita! We've been using zarrita to write sharded V3 data as a test for new parsing capabilities for V3 data in zarr-js (cc @freeman-lab). However, we're having some difficulties understanding the structure of the index in the output from zarrita.

As an example, here is a sharded dataset:

import zarrita
import numpy as np

store = zarrita.LocalStore('testdata')

data = np.arange(0, 128 * 128, dtype='int32').reshape((128, 128))

shard_shape = (64,64)
chunk_shape=(32,32)

a = await zarrita.Array.create_async(
    store / 'zarrita-sharding',
    shape=data.shape,
    dtype='int32',
    chunk_shape=shard_shape,
    chunk_key_encoding=('v2', '.'),
    codecs=[
        zarrita.codecs.sharding_codec(
            chunk_shape=chunk_shape,
            codecs=[zarrita.codecs.blosc_codec(typesize=data.dtype.itemsize)]
        ),
    ],
)

a[:, :] = data

Based on https://zarr.dev/zeps/draft/ZEP0002.html#binary-shard-format, we'd expect the index as the last 64 bytes at the end of the file. However, reading those data gives unexpected results. Are you able to offer guidance for interpreting the shard index?

import os
from functools import reduce
shard = 'testdata/zarrita-sharding/0.0'
nbytes = os.path.getsize(shard)
nchunks_per_shard = int(reduce(lambda x, y: x*y, list(shard_shape)) / reduce(lambda x, y: x*y, list(chunk_shape)))
header_bytes=int(nchunks_per_shard)*16
remainder = nbytes - header_bytes

chunks = {}
with open(shard, 'rb') as f:
    _ = f.read(remainder)
    for ind in range(nchunks_per_shard):
        chunks[ind] = {}
        chunks[ind]['offset'] = int.from_bytes(f.read(8), byteorder='little')
        chunks[ind]['nbytes'] = int.from_bytes(f.read(8), byteorder='little')
        
print(chunks)
{0: {'offset': 8499740278784, 'nbytes': 8499740278784}, 1: {'offset': 8972186681344, 'nbytes': 17471926960128}, 2: {'offset': 8650064134144, 'nbytes': 26121991094272}, 3: {'offset': 9174050144256, 'nbytes': 13244620951116578816}}

I've copied the equivalent code using Zarr-Python below in case it's helpful for understand the issue.

import os
from functools import reduce


os.environ["ZARR_V3_EXPERIMENTAL_API"] = "1"
os.environ["ZARR_V3_SHARDING"] = "1"

import numpy as np
import zarr
from zarr._storage.v3 import DirectoryStoreV3
from zarr._storage.v3_storage_transformers import ShardingStorageTransformer

shape=(128,128)
data = np.arange(0, 128 * 128, dtype='int32').reshape(shape)
chunk_shape=(32,32)
chunks_per_shard=(2,2)

path = "testdata/zarr-python"
sharded_store = DirectoryStoreV3(path)
sharding_transformer = ShardingStorageTransformer("indexed", chunks_per_shard=chunks_per_shard)

arr_sh = zarr.create(
    shape,
    chunks=chunk_shape,
    dtype=data.dtype,
    store=sharded_store,
    storage_transformers=[sharding_transformer],
)

arr_sh[:] = data

shard = "testdata/zarr-python/data/root/c1/1"
nbytes = os.path.getsize(shard)
header_bytes=reduce(lambda x, y: x*y, list(chunks_per_shard))*16
remainder = nbytes - header_bytes
nchunks_per_shard = reduce(lambda x, y: x*y, list(chunks_per_shard))

chunks = {}
with open(shard, 'rb') as f:
    _ = f.read(remainder)
    for ind in range(nchunks_per_shard):
        chunks[ind] = {}
        chunks[ind]['offset'] = int.from_bytes(f.read(8), byteorder='little')
        chunks[ind]['nbytes'] = int.from_bytes(f.read(8), byteorder='little')
        
print(chunks)
{0: {'offset': 1068, 'nbytes': 356}, 1: {'offset': 356, 'nbytes': 356}, 2: {'offset': 712, 'nbytes': 356}, 3: {'offset': 0, 'nbytes': 356}}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions