Skip to content

Core protocol v3.0 - chunk grids#22

Merged
alimanfoo merged 2 commits intozarr-developers:core-protocol-v3.0-devfrom
alimanfoo:core-protocol-v3.0-grids
May 21, 2019
Merged

Core protocol v3.0 - chunk grids#22
alimanfoo merged 2 commits intozarr-developers:core-protocol-v3.0-devfrom
alimanfoo:core-protocol-v3.0-grids

Conversation

@alimanfoo
Copy link
Member

This PR adds a section introducing chunk grids and defining regular chunk grids.

@alimanfoo
Copy link
Member Author

alimanfoo commented May 7, 2019

Straw man for discussion. Note that I'm tentatively suggesting that the core protocol sticks to defining regular chunk grids aligned to the array origin, however protocol extensions can define other grid types.

For example, rectilinear grids where chunks may have different shapes would be addressed via a protocol extension.

Similarly, grids where chunks may have negative indices and the origin of the array may occur anywhere in any chunk (needed to allow arrays to "grow" in the "negative" direction) would be a protocol extension.

Again, just thinking of ways to keep the core protocol simple and minimal but allow for flexibility and development of additional features via protocol extensions.

@alimanfoo alimanfoo requested a review from a team May 7, 2019 22:01
Copy link
Member

@joshmoore joshmoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments from a first reading.

-----------

A chunk grid defines a set of chunks which contain the elements of an
array. The chunks of a grid form a tessellation of the array space,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you foresee a simpler grid definition than doesn't include the relationship to an array?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, I was thinking that the regular grid was the simplest. I think a grid has to be defined in relation to the array, i.e., how does the grid cover the array space. Although I may have misunderstood the question?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to map between what you have here and the "absolute minimal" storage layer proposed by @axtimwalde where for a given grid location one gets back nothing more than a byte stream a la #8 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK. IIUC what @axtimwalde was saying was that there are some useful functionalities that could be provided without needing to know anything about the grid layout. For example, if you wanted to copy data from one store to another, or recode chunks using a different compressor. But I think the core protocol needs to define the full picture, in the sense that chunks always belong to an array, and that array will have some grid layout that defines what the chunks contain.

hyperrectangles but do not all share the same shape.

A grid type also defines rules for constructing a unique key for each
chunk, which is a string of ASCII characters that can be used to save
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is perhaps for a different PR, but I could see specifying a "tuple of strings of ASCII characters in the set [...]". See below.


The key for chunk with grid index (i, j, k, ...) is formed by
concatenating the ASCII string representation of each index, joined
together via the period (".") character. For example, in a 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to the "tuple of strings" from above, I would propose either leaving the joining operation to the backend or going so far as to suggest "/" as the default. My understanding is that cloud storage doesn't suffer under use of "/" but local storage does suffer under use of ".".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. In general I imagine that you could have a scheme where there is a prefix (default ""), a separator (default "."), and a suffix (default ""), which could be overridden. I.e., you could allow these to be configured on a per-array basis in the array metadata. I was wondering if that's something we should include in the core protocol, or could be a protocol extension.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just tentatively pushed an edit which allows for an array to have configurable prefix, suffix and separator for chunk keys, which would be a mechanism to allow e.g. use of "/" as chunk key separator. Happy to row back on this if anyone feels that should be a protocol extension.

-----------

A chunk grid defines a set of chunks which contain the elements of an
array. The chunks of a grid form a tessellation of the array space,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, I was thinking that the regular grid was the simplest. I think a grid has to be defined in relation to the array, i.e., how does the grid cover the array space. Although I may have misunderstood the question?


The key for chunk with grid index (i, j, k, ...) is formed by
concatenating the ASCII string representation of each index, joined
together via the period (".") character. For example, in a 3
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. In general I imagine that you could have a scheme where there is a prefix (default ""), a separator (default "."), and a suffix (default ""), which could be overridden. I.e., you could allow these to be configured on a per-array basis in the array metadata. I was wondering if that's something we should include in the core protocol, or could be a protocol extension.

For example, in a 3 dimensional array at path "/foo/bar" configured
with default values for the chunk key prefix, suffix and separator,
the key for the chunk at grid index (1, 23, 45) is the string
"/foo/bar/1.23.45".
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@axtimwalde, @constantinpape, @funkey, do you want to add a configuration option to allow the chunk indices to be given in reverse order, as in n5? Or is it OK to fix on using a single ordering as in zarr v2?

@alimanfoo alimanfoo force-pushed the core-protocol-v3.0-grids branch from 7a68931 to 98c0c60 Compare May 9, 2019 16:43
@alimanfoo alimanfoo changed the title WIP: Core protocol v3.0 - chunk grids Core protocol v3.0 - chunk grids May 14, 2019
@alimanfoo
Copy link
Member Author

In the interests of having content together in one place, I'd like to merge this PR into the core-protocol-v3.0-dev branch. We can still discuss, revise and revisit anything after merge. I'll merge tomorrow if no objections.

@alimanfoo alimanfoo merged commit f153d6c into zarr-developers:core-protocol-v3.0-dev May 21, 2019
@alimanfoo alimanfoo deleted the core-protocol-v3.0-grids branch May 21, 2019 11:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants