Skip to content

Conversation

@d-v-b
Copy link
Contributor

@d-v-b d-v-b commented Oct 14, 2025

Opening this PR with a draft of a rectilinear chunk grid spec

Copy link
Member

@LDeakin LDeakin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Do we want to include a recommendation that implementations SHOULD use run length encoding where appropriate when saving metadata?

Implementation here: zarrs/zarrs#284

@d-v-b d-v-b marked this pull request as ready for review October 18, 2025 18:47
@d-v-b
Copy link
Contributor Author

d-v-b commented Oct 18, 2025

This is ready for review. I would like to include a small, complete array that demonstrates this chunk grid, and a JSON schema for the metadata.

Copy link
Member

@normanrz normanrz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @d-v-b!
This PR fulfills all requirements to be merged and is a great addition to zarr-extensions.
Schema JSON and examples would be great. Do you wish add these in this PR or in a later one?

@d-v-b
Copy link
Contributor Author

d-v-b commented Oct 20, 2025

if @jbms, @LDeakin, and @manzt are all OK with how this look now then I'm happy to merge and add the schema + example data in a subsequent PR. But I am also wondering if we want to handle the origin of the chunk grid here, while this PR is still open. See #30.

@normanrz
Copy link
Member

I leave that to you. Let me know, when you want this PR merged.

@jbms
Copy link
Contributor

jbms commented Oct 20, 2025

If we want to support negative chunk indices (in order to allow expansion in the negative direction) then we need to be able to specify the sizes of the negative chunks also.

For a regular grid it is sufficient to have a single grid_origin parameter that specifies the start of chunk 0 in array index space.

But for the rectilinear grid we need to specify both the position in array index space and the position in chunk index space of the start of the chunk size list.

@jbms
Copy link
Contributor

jbms commented Oct 20, 2025

Another thing that has been suggested in the past is to allow the logical size of a chunk to differ from the physical size, i.e. allow chunks to be stored with unused padding at both the start and end.

For every chunk, you need to specify the physical size, logical size, and offset of the start of the logical chunk within the physical chunk.

This would allow you to insert and remove elements in the middle of a dimension without having to re-encode chunks, just rename them. And by using kerchunk or icechunk or OCDBT the renaming could be done "virtually".

To avoid even having to rename you could allow an arbitrary virtual to physical chunk index map, where you remap the chunk index also.

@d-v-b
Copy link
Contributor Author

d-v-b commented Oct 20, 2025

Speaking for Zarr Python, these would require some work to implement -- in particular, we will need to introduce a new array indexing API to work with negative indices that aren't referenced to the end of an array. So for that reason I'm inclined to keep this chunk grid simple for now, provided we are confident we can safely extend it in the future.

I think using additional keys in the configuration + defining new variants of the "kind" field to overload the meaning of chunk_shapes should cover the additional flexibility. Defining a totally new chunk grid could also work, but I think as long as the changes generalize this chunk grid, we should aim to update this spec to a new version with changes that will be breaking for naive implementations.

@jbms
Copy link
Contributor

jbms commented Oct 20, 2025

The kind should arguably be per-dimension rather than apply to all dimensions. Perhaps the kind could be indicated by the type of the json value instead.

@d-v-b
Copy link
Contributor Author

d-v-b commented Oct 20, 2025

The kind should arguably be per-dimension rather than apply to all dimensions.

Our intention for "kind" was that it defines the semantics for the chunk_shapes field. E.g., if "kind" was set to "reference", then chunk_shapes might be a path or URI, and thus per-dimension metadata would not be meaningful without resolving that reference. We could use the type of the chunk_shapes field to express this, but I think an explicit "kind" field gives us more flexibility here.

@jbms
Copy link
Contributor

jbms commented Oct 20, 2025

The kind should arguably be per-dimension rather than apply to all dimensions.

Our intention for "kind" was that it defines the semantics for the chunk_shapes field. E.g., if "kind" was set to "reference", then chunk_shapes might be a path or URI, and thus per-dimension metadata would not be meaningful without resolving that reference. We could use the type of the chunk_shapes field to express this, but I think an explicit "kind" field gives us more flexibility here.

What I meant is that you may want the chunk sizes for one dimension to be stored in a separate 1-d array, but another dimension might have uniform chunking or sizes specified inline in the metadata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants