-
Notifications
You must be signed in to change notification settings - Fork 32
Core protocol v3.0 - conceptual model #17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
e73f1a6
c9df31c
7f54bf9
4d27946
831a3c1
afc208a
e989269
31b937b
58830bb
a704813
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,3 +1,152 @@ | ||
| Zarr core protocol version 3.0 | ||
| ============================== | ||
|
|
||
|
|
||
| Conceptual model | ||
| ---------------- | ||
|
|
||
| A Zarr *hierarchy* is a tree structure, where each node in the tree is | ||
| either a *group* or an *array*. Group nodes may have children | ||
| but array nodes may not. | ||
|
|
||
| Each node in a hierarchy has a *name* which is a string of ASCII | ||
| characters with some additional constraints. Two sibling nodes cannot | ||
| have the same name. The root node does not have a name. | ||
|
|
||
| Each node in a hierarchy has a *path* which uniquely identifies that | ||
| node and defines its location within the hierarchy. The path is formed | ||
| by joining together the "/" character, followed by the names of all | ||
| ancestor nodes separated by the "/" character, followed by the name of | ||
| the node itself. For example, the path "/foo/bar" identifies a node | ||
| named "bar", whose parent is named "foo", whose parent is the root of | ||
| the hierarchy. The string "/" identifies the root node. | ||
|
|
||
| An array has a fixed number of zero or more *dimensions*. Each dimension has an | ||
| integer length. The core protocol only considers the case where the | ||
| lengths of all dimensions are finite. However, protocol extensions may | ||
| be defined which allow a dimension to have infinite or variable | ||
| length. | ||
|
|
||
| The *shape* of an array is the tuple of dimension lengths. For | ||
| example, if an array has 2 dimensions, where the length of the first | ||
| dimension is 100 and the length of the second dimension is 20, then | ||
| the shape of the array is (100, 20). | ||
|
|
||
| An array contains zero or more *elements*. Each element can be | ||
| identified by a tuple of coordinates, one for each dimension of the | ||
| array. If all dimensions of an array have finite length, then the | ||
| number of elements in the array is given by the product of the | ||
| dimension lengths. An array element may be empty, or it may have a | ||
| value. | ||
|
|
||
| An array is associated with a *data type*. A data type defines the set | ||
| of possible values that the array may contain, and a binary | ||
| representation (i.e., sequence of bytes) for each possible value. For | ||
| example, the little-endian 32-bit signed integer data type defines | ||
| binary representations for all integers in the range −2,147,483,648 to | ||
| 2,147,483,647. The core protocol only considers a limited set of data | ||
| types, but protocol extensions may define other data types. | ||
|
|
||
| An array is divided into a set of *chunks*, where each chunk is a | ||
| hyperrectangle defined by a tuple of intervals, one for each dimension | ||
| of the array. The shape of a chunk is the tuple of interval lengths, | ||
| and the size of a chunk (i.e., number of elements contained within the | ||
| chunk) is the product of its interval lengths. | ||
|
|
||
| The chunks of an array are organised into a *grid*. The core protocol | ||
| only considers the case where all chunks have the same shape and the | ||
| chunks form a regular grid. However, protocol extensions may define | ||
| other grid types such as rectilinear grids. | ||
|
|
||
| An array is associated with a *memory layout* which defines how to | ||
| construct a binary representation of a single chunk by organising the | ||
| binary values within the chunk into a single contiguous sequence of | ||
| bytes. The core protocol defines two types of memory layout based on | ||
| "C" (row-major) and "F" (column-major) ordering of values, but | ||
| protocol extensions may define other memory layouts. | ||
|
|
||
| An array is associated with an *encoding pipeline*, which is a | ||
| sequence of zero or more *codecs* that transforms the binary | ||
| representation of a chunk in some way. For example, an encoding | ||
| pipeline might include a checksum codec to ensure data integrity, and | ||
| a compression codec to reduce data size. All codecs implement a common | ||
| *codec interface* which provides a pair of operations, one to perform | ||
| the transformation (encode), the other to reverse the transformation | ||
| (decode). | ||
|
|
||
| Each node in a hierarchy is represented by a *metadata document*, | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I suggest to add: An empty metadata document is equivalent with no metadata document and means that there is no meta-data associated with the node. This is only possible for trivial group nodes.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Interesting, might need to unpack and discuss a little. FWIW in zarr v2 the presence of a metadata document indicates the existence of a node. E.g., if the key |
||
| which is a machine-readable document containing essential processing | ||
| information about the node. For example, an array metadata document | ||
| will specify the number of dimensions, length of each dimension, data | ||
| type, chunk shape, memory layout and encoding pipeline for that array. | ||
|
|
||
| Each node in a hierarchy may have an *attributes document*, which is a | ||
| machine-readable document containing information that may be useful to | ||
| users of the data but is not essential to the basic processing of the | ||
| node. | ||
|
|
||
| The metadata, attributes and encoded chunk data for all nodes in a | ||
| hierarchy are held in a *store*. To enable a variety of different | ||
| store types to be used, the core protocol defines a simple *store | ||
| interface* which is a common set of operations that a store must | ||
| provide. | ||
|
|
||
|
|
||
| Node names | ||
| ---------- | ||
|
|
||
| TODO define constraints on node names | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. N.B., I don't intend to address this TODO or any below in this PR, just adding them as placeholders for a possible structure for other sections. |
||
|
|
||
|
|
||
| Data types | ||
| ---------- | ||
|
|
||
| TODO define core data types | ||
|
|
||
| Regular chunk grids | ||
| ------------------- | ||
|
|
||
| TODO define regular chunk grids, including how to form a key for each chunk in a grid | ||
|
|
||
|
|
||
| Memory layouts | ||
| -------------- | ||
|
|
||
| TODO define "C" and "F" memory layouts | ||
|
|
||
| Codec interface | ||
| --------------- | ||
|
|
||
| TODO define the codec interface | ||
|
|
||
|
|
||
| Array metadata | ||
| -------------- | ||
|
|
||
| TODO define the structure and content of array metadata documents | ||
|
|
||
|
|
||
| Group metadata | ||
| -------------- | ||
|
|
||
| TODO define the structure and content of group metadata documents | ||
|
|
||
|
|
||
| User attributes | ||
| --------------- | ||
|
|
||
| TODO define attributes documents | ||
|
|
||
|
|
||
| Store interface | ||
| --------------- | ||
|
|
||
| TODO define the store interface | ||
|
|
||
|
|
||
| Storage protocol | ||
| ---------------- | ||
|
|
||
| TODO define how high level operations like creating a group or array | ||
| translate into low level key/value operations on the store interface | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any limitations on the characters in that string?