Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
149 changes: 149 additions & 0 deletions docs/protocol/core/v3.0.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,152 @@
Zarr core protocol version 3.0
==============================


Conceptual model
----------------

A Zarr *hierarchy* is a tree structure, where each node in the tree is
either a *group* or an *array*. Group nodes may have children
but array nodes may not.

Each node in a hierarchy has a *name* which is a string of ASCII
characters with some additional constraints. Two sibling nodes cannot
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any limitations on the characters in that string?

have the same name. The root node does not have a name.

Each node in a hierarchy has a *path* which uniquely identifies that
node and defines its location within the hierarchy. The path is formed
by joining together the "/" character, followed by the names of all
ancestor nodes separated by the "/" character, followed by the name of
the node itself. For example, the path "/foo/bar" identifies a node
named "bar", whose parent is named "foo", whose parent is the root of
the hierarchy. The string "/" identifies the root node.

An array has a fixed number of zero or more *dimensions*. Each dimension has an
integer length. The core protocol only considers the case where the
lengths of all dimensions are finite. However, protocol extensions may
be defined which allow a dimension to have infinite or variable
length.

The *shape* of an array is the tuple of dimension lengths. For
example, if an array has 2 dimensions, where the length of the first
dimension is 100 and the length of the second dimension is 20, then
the shape of the array is (100, 20).

An array contains zero or more *elements*. Each element can be
identified by a tuple of coordinates, one for each dimension of the
array. If all dimensions of an array have finite length, then the
number of elements in the array is given by the product of the
dimension lengths. An array element may be empty, or it may have a
value.

An array is associated with a *data type*. A data type defines the set
of possible values that the array may contain, and a binary
representation (i.e., sequence of bytes) for each possible value. For
example, the little-endian 32-bit signed integer data type defines
binary representations for all integers in the range −2,147,483,648 to
2,147,483,647. The core protocol only considers a limited set of data
types, but protocol extensions may define other data types.

An array is divided into a set of *chunks*, where each chunk is a
hyperrectangle defined by a tuple of intervals, one for each dimension
of the array. The shape of a chunk is the tuple of interval lengths,
and the size of a chunk (i.e., number of elements contained within the
chunk) is the product of its interval lengths.

The chunks of an array are organised into a *grid*. The core protocol
only considers the case where all chunks have the same shape and the
chunks form a regular grid. However, protocol extensions may define
other grid types such as rectilinear grids.

An array is associated with a *memory layout* which defines how to
construct a binary representation of a single chunk by organising the
binary values within the chunk into a single contiguous sequence of
bytes. The core protocol defines two types of memory layout based on
"C" (row-major) and "F" (column-major) ordering of values, but
protocol extensions may define other memory layouts.

An array is associated with an *encoding pipeline*, which is a
sequence of zero or more *codecs* that transforms the binary
representation of a chunk in some way. For example, an encoding
pipeline might include a checksum codec to ensure data integrity, and
a compression codec to reduce data size. All codecs implement a common
*codec interface* which provides a pair of operations, one to perform
the transformation (encode), the other to reverse the transformation
(decode).

Each node in a hierarchy is represented by a *metadata document*,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to add: An empty metadata document is equivalent with no metadata document and means that there is no meta-data associated with the node. This is only possible for trivial group nodes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, might need to unpack and discuss a little.

FWIW in zarr v2 the presence of a metadata document indicates the existence of a node. E.g., if the key /foo/bar/.zgroup exists in the store then that implies a group exists in the hierarchy at logical path /foo/bar. Similarly if a key exists in the store at /foo/bar/baz/.zarray then that implies an array exists in the hierarchy at logical path /foo/bar/baz. So you can actually construct the hierarchy just by knowing which keys are present in the store, without even retrieving or reading the metadata documents. That sounds slightly different from what you're suggesting here?

which is a machine-readable document containing essential processing
information about the node. For example, an array metadata document
will specify the number of dimensions, length of each dimension, data
type, chunk shape, memory layout and encoding pipeline for that array.

Each node in a hierarchy may have an *attributes document*, which is a
machine-readable document containing information that may be useful to
users of the data but is not essential to the basic processing of the
node.

The metadata, attributes and encoded chunk data for all nodes in a
hierarchy are held in a *store*. To enable a variety of different
store types to be used, the core protocol defines a simple *store
interface* which is a common set of operations that a store must
provide.


Node names
----------

TODO define constraints on node names
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N.B., I don't intend to address this TODO or any below in this PR, just adding them as placeholders for a possible structure for other sections.



Data types
----------

TODO define core data types

Regular chunk grids
-------------------

TODO define regular chunk grids, including how to form a key for each chunk in a grid


Memory layouts
--------------

TODO define "C" and "F" memory layouts

Codec interface
---------------

TODO define the codec interface


Array metadata
--------------

TODO define the structure and content of array metadata documents


Group metadata
--------------

TODO define the structure and content of group metadata documents


User attributes
---------------

TODO define attributes documents


Store interface
---------------

TODO define the store interface


Storage protocol
----------------

TODO define how high level operations like creating a group or array
translate into low level key/value operations on the store interface