-
-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v3] Structured dtype support #2134
Comments
The v3 spec lists a finite set of dtypes and structured dtypes are not explicitly in the list. To be spec compliant and support structured dtypes, I see two options:
|
@d-v-b @jhamman Has there been any movement on this? I see there is a comment in the code base about this: zarr-python/tests/test_indexing.py Line 161 in 2fe12a7
This would be something of a regression for |
@ilan-gold - no progress and I agree its a regression. I think we should support it for v2 arrays but until there is an extension dtype for this in v3, we'll likely need to leave it out. |
That would be very fine with me. We'll need to "break" things for the v3 format anyway, so I don't have strong opinions there. |
I do think we can assume that the v3 spec process will ultimately provide a clear framework for adding new dtypes, so it might be useful to think about basic questions like how these types should be named / parametrized in JSON. |
we would probably start with the v2 conventions, and then consider how to ensure that a v3 version of that isn't too numpy-specific. |
Hi, we extensively use Zarr's structured array support on v2. I want to implement structured array support for v3 so we can move to it. Also, tensorstore may have the structs working with v3 already. Can @jbms chime in? Does it sound like there is no clear-cut framework for adding types? I would like to start that conversation. My recommendation would be to use C-style definitions for the structured data types, which would be somewhat universal and also conform to numpy semantics. We already have a data model in our next-generation version of MDIO library for defining structs. Ours is quite verbose but we chose to do it that way for human readability. For instance, to define an array of structs (structured/record array in numpy language) with four fields, we do this and also name the dimensions. Some parallels to ZOM proposals: {
"name": "headers",
"dataType": {
"fields": [
{ "name": "cdp", "format": "uint32" },
{ "name": "offset", "format": "int16" },
{ "name": "cdp-x", "format": "float64" },
{ "name": "cdp-y", "format": "float64" }
]
},
"dimensions": ["inline", "crossline"]
} What would be the cleanest interface to implement the structured type support? I believe it has to be more concise if written to the Zarr JSON metadata. The numpy dtype definition is actually not that bad.. I'm talking about this one:
Offsets, titles, and itemsize are optional. We could omit the titles but the other transfer to C structs nicely. Unambiguous C numeric types:
For instance, a void type of 16 bytes without fully populated fields can be defined as below and can be cast to numpy or C structs: {
"names": ["f1", "f2", "f3"],
"formats": ["int8", "double", "bool"],
"offsets": [0, 4, 12],
"itemsize": 16
} Note the padding after the double and after the bool. Endinanness is handled nicely by the byte-codec, we don't have to specify. Array metadata can be something like (note the fill value changes to an array): {
"shape": [
10,
10
],
"data_type": {
"names": ["f1", "f2", "f3"],
"formats": ["int8", "double", "bool"],
"offsets": [0, 4, 12],
"itemsize": 16
},
"chunk_grid": {
"name": "regular",
"configuration": {
"chunk_shape": [
10,
10
]
}
},
"chunk_key_encoding": {
"name": "default",
"configuration": {
"separator": "/"
}
},
"fill_value": [0, 0.0, false],
"codecs": [
{
"name": "bytes",
"configuration": {
"endian": "little"
}
},
{
"name": "zstd",
"configuration": {
"level": 0,
"checksum": false
}
}
],
"attributes": {},
"zarr_format": 3,
"node_type": "array",
"storage_transformers": []
} |
Tensorstore supports structured data types for zarr v2 but not v3. I can imagine that structured data types are convenient for some use cases but they also introduce a number of complications and limitations:
An alternative is the xarray-style representation where there is a separate array per field. |
Good questions! Here are some more comments based on your thoughts:
Separate arrays for fields are great for many of our workloads, but for a major one they significantly slow down batch processing. Our HPC loads TB+ scale array of structs into memory, performs operations on them (usually 50-100 fields), and writes them in batch. It becomes problematic due to the number of i/o operations on disk. |
Hm looking at it again, seems like byte codec could also define the type configuration? In that case what could be the dtype metadata and when would the casting be applied in Zarr Python? |
Right, not supported in numpy but having a collection of related fields, some of which are scalars and some of which are strings, is fairly common. Any array of variable-length strings can be seen as a ragged array but I'm not sure I see the relevance to this discussion of multiple fields.
Can you explain what you mean when you say that numpy has a unified way to define SoA or AoS? I was not aware of that.
Separate logical zarr arrays does not necessarily mean separate I/o operations on disk. For example with icechunk or ocdbt (implemented by tensorstore) corresponding chunks of multiple arrays could be stored consectively within the same file such that they can be read with a single I/o operation. |
The main distinction is that with the bytes codec the only configuration option would be endianness of all fields, and presumably use an interleaved encoding, while a separate struct codec would permit specifying additional options like per-field codecs for non-interleaved encoding. |
e.g. AoS: >>> import numpy as np
>>> aos_dtype = np.dtype([
>>> ("field1", np.int32),
>>> ('field2', np.float64)]
>>> )
>>> np.zeros(5, dtype=aos_dtype)
array([(0, 0.), (0, 0.), (0, 0.), (0, 0.), (0, 0.)],
dtype=[('field1', '<i4'), ('field2', '<f8')]) This serializes as 60 bytes as expected (array length of 5, with 12-byte length struct). SoA: It is slightly more hacky but we can make the field(s) into a sized field(s), and then we can create a shape=0 (singleton) "struct of arrays". It is hacky, but I think that it can be read in as void in C and cast to a struct. The dtype description is exactly the same as above with field size. >>> import numpy as np
>>> soa_dtype = np.dtype([
>>> ("field1", np.int32, 2),
>>> ('field2', np.float64, 3)]
>>> )
>>> arr = np.asarray(([1, 2], [3.14, 42., 69.]), dtype=soa_dtype)
>>> arr
array(([1, 2], [ 3.14, 42. , 69. ]),
dtype=[('field1', '<i4', (2,)), ('field2', '<f8', (3,))]) This serializes as 32-bytes as expected. It is also a single-element with |
IIRC, a mix of fixed-length strings and numbers is ok. Its just the variable length they do not support.
This is very interesting, can you elaborate more on that and provide an example? |
With OCDBT it is just the default way that it works --- and I think the same is true for icechunk. Here is a tensorstore OCDBT example: # /// script
# requires-python = ">=3.13"
# dependencies = [
# "tensorstore",
# ]
# ///
import tensorstore as ts
ctx = ts.Context()
ocdbt_spec = {"driver": "ocdbt", "base": "file:///tmp/example/"}
with ts.Transaction(atomic=True) as txn:
variables_futures = [
ts.open(
{
"driver": "zarr3",
"kvstore": {
"driver": "ocdbt",
"base": "file:///tmp/example/",
"path": f"var{i}/",
},
},
dtype=ts.uint32,
shape=[100, 100],
context=ctx,
transaction=txn,
open=True,
create=True,
)
for i in range(80)
]
variables = [v.result() for v in variables_futures]
with ts.Transaction(atomic=True) as txn:
for i, v in enumerate(variables):
v.with_transaction(txn)[...] = i |
Thanks @jbms. To continue progress on this, I need to fully understand where this extension would go. Looking at v3 spec sounds like we would want a Is the right thing to do here is:
Do we need ZEP, or do we need to do any work before we can do this? |
Adding to the specs is a slow road, instead I recommend starting with an implementation (but build the implementation so that it could be formally specified in a language-agnostic way). Then once we have something in people's hands, we can write up the spec. Have a look at #2750 and see how structured dtypes would fit in with the framework proposed there. |
The steering council recognizes that the extension process in Zarr 3 is not working as hoped. We are working to fix the extension process and make it more accessible. @joshmoore and @normanrz are pushing hard on this, with the goals of:
For now, I would suggestion starting with the implementation, figuring out what the extension spec should look like, and then coming back to the extension spec question in a few weeks once we have this framework in place. |
Should v2 arrays created with structured dtypes work with v3? I found that selecting a field name from a record (e.g. |
Zarr-Python 2 supported structured dtypes. I don't think this has been discussed much, if at all, in the context of version 3.
This once worked:
The text was updated successfully, but these errors were encountered: