[v3] Structured dtype support #2134

jhamman · 2024-08-28T02:57:53Z

Zarr-Python 2 supported structured dtypes. I don't think this has been discussed much, if at all, in the context of version 3.

This once worked:

a = np.array(
    [(b"aaa", 1, 4.2), (b"bbb", 2, 8.4), (b"ccc", 3, 12.6)],
    dtype=[("foo", "S3"), ("bar", "i4"), ("baz", "f8")],
)
za = zarr.array(a, chunks=2, fill_value=None)
za[:]

d-v-b · 2024-08-28T09:01:24Z

The v3 spec lists a finite set of dtypes and structured dtypes are not explicitly in the list. To be spec compliant and support structured dtypes, I see two options:

express a structured dtype via the raw bits r* dtype. This seems unsatisfactory, because two structured dtypes might have the same raw bits representation but very different semantics (e.g., a pair of float64s vs a pair of int64s will look the same as raw bits)
define a spec extension for structured dtypes. I don't think this would be a technical challenge, but it would be important to get consensus across implementations about the JSON serialization for the dtype. We would not want to overfit to numpy conventions. This seems like a better option, but it requires some social logistics.

ilan-gold · 2024-12-05T15:56:18Z

@d-v-b @jhamman Has there been any movement on this? I see there is a comment in the code base about this:

zarr-python/tests/test_indexing.py

Line 161 in 2fe12a7

# todo: uncomment the structured array tests when we can make them pass,

This would be something of a regression for anndata as a project if we couldn't support this (or we'd have to go with option 1 and investigate the current status with readers in other languages to see if they could be made to support this functionality).

jhamman · 2024-12-05T16:37:52Z

@ilan-gold - no progress and I agree its a regression. I think we should support it for v2 arrays but until there is an extension dtype for this in v3, we'll likely need to leave it out.

ilan-gold · 2024-12-05T16:54:42Z

That would be very fine with me. We'll need to "break" things for the v3 format anyway, so I don't have strong opinions there.

d-v-b · 2024-12-05T17:48:44Z

but until there is an extension dtype for this in v3

I do think we can assume that the v3 spec process will ultimately provide a clear framework for adding new dtypes, so it might be useful to think about basic questions like how these types should be named / parametrized in JSON.

d-v-b · 2024-12-05T18:29:12Z

we would probably start with the v2 conventions, and then consider how to ensure that a v3 version of that isn't too numpy-specific.

tasansal · 2025-01-25T21:01:36Z

Hi, we extensively use Zarr's structured array support on v2. I want to implement structured array support for v3 so we can move to it.

Also, tensorstore may have the structs working with v3 already. Can @jbms chime in?

Does it sound like there is no clear-cut framework for adding types? I would like to start that conversation. My recommendation would be to use C-style definitions for the structured data types, which would be somewhat universal and also conform to numpy semantics.

We already have a data model in our next-generation version of MDIO library for defining structs.
See here: https://mdio-python.readthedocs.io/en/v1/data_models/data_types.html#structured-type

Ours is quite verbose but we chose to do it that way for human readability. For instance, to define an array of structs (structured/record array in numpy language) with four fields, we do this and also name the dimensions. Some parallels to ZOM proposals:

{
  "name": "headers",
  "dataType": {
    "fields": [
      { "name": "cdp", "format": "uint32" },
      { "name": "offset", "format": "int16" },
      { "name": "cdp-x", "format": "float64" },
      { "name": "cdp-y", "format": "float64" }
    ]
  },
  "dimensions": ["inline", "crossline"]
}

What would be the cleanest interface to implement the structured type support? I believe it has to be more concise if written to the Zarr JSON metadata. The numpy dtype definition is actually not that bad.. I'm talking about this one:

{'names': ..., 'formats': ..., 'offsets': ..., 'titles': ..., 'itemsize': ...}

Offsets, titles, and itemsize are optional. We could omit the titles but the other transfer to C structs nicely.

Unambiguous C numeric types:

int8_t / uint8_t for 8-bit signed/unsigned integers. We could omit the _t.
int16_t / uint16_t for 16-bit signed/unsigned integers. We could omit the _t.
int32_t / uint32_t for 32-bit signed/unsigned integers. We could omit the _t.
int64_t / uint64_t for 64-bit signed/unsigned integers. We could omit the _t.

float (Single precision, ~6-7 decimal digits, usually 32-bit)
double (Double precision, ~15-16 decimal digits, usually 64-bit)
long double (Extended precision, platform-dependent, 80-bit or 128-bit)

bool is an alias for _Bool defined in <stdbool.h>, making it more readable and idiomatic.

For instance, a void type of 16 bytes without fully populated fields can be defined as below and can be cast to numpy or C structs:

{
  "names": ["f1", "f2", "f3"],
  "formats": ["int8", "double", "bool"],
  "offsets": [0, 4, 12],
  "itemsize": 16
}

Note the padding after the double and after the bool. Endinanness is handled nicely by the byte-codec, we don't have to specify.

Array metadata can be something like (note the fill value changes to an array):

{
  "shape": [
    10,
    10
  ],
  "data_type": {
    "names": ["f1", "f2", "f3"],
    "formats": ["int8", "double", "bool"],
    "offsets": [0, 4, 12],
    "itemsize": 16
  },
  "chunk_grid": {
    "name": "regular",
    "configuration": {
      "chunk_shape": [
        10,
        10
      ]
    }
  },
  "chunk_key_encoding": {
    "name": "default",
    "configuration": {
      "separator": "/"
    }
  },
  "fill_value": [0, 0.0, false],
  "codecs": [
    {
      "name": "bytes",
      "configuration": {
        "endian": "little"
      }
    },
    {
      "name": "zstd",
      "configuration": {
        "level": 0,
        "checksum": false
      }
    }
  ],
  "attributes": {},
  "zarr_format": 3,
  "node_type": "array",
  "storage_transformers": []
}

jbms · 2025-01-26T06:16:59Z

Tensorstore supports structured data types for zarr v2 but not v3.

I can imagine that structured data types are convenient for some use cases but they also introduce a number of complications and limitations:

Are fields stored interleaved (array of structs, as in numpy) or as struct of arrays? Struct of arrays potentially offers better compression and support for reading/decoding just a subset of the fields.
Are variable length strings (likely to be added to zarr v3) supported as fields?
How do you specify the encoding of fields? Is only the bytes codec supported, or is there a new struct codec that lets you specify the encoding of each field separately (only really makes sense for struct of arrays representation)?

An alternative is the xarray-style representation where there is a separate array per field.

tasansal · 2025-01-26T13:52:15Z

Good questions! Here are some more comments based on your thoughts:

Regarding AoS or SoA, numpy supports both, so we could still support both in Zarr.
Using a variable-length string as a field can get very complicated; I don't know how we could approach that. Numpy 2+ supports it for regular arrays, but struct arrays don't allow the new StringDType. How common are these? I haven't encountered this before. It sounds like awkward (jagged) arrays.
Regarding encoding, having a struct codec makes the most sense to me. If we make it a bytes codec the consumer will have to know the layout, no? We should make it self-describing, IMO. Numpy has a unified way to define SoA or AoS while forming their dtype. We could take hints from that.

Separate arrays for fields are great for many of our workloads, but for a major one they significantly slow down batch processing. Our HPC loads TB+ scale array of structs into memory, performs operations on them (usually 50-100 fields), and writes them in batch. It becomes problematic due to the number of i/o operations on disk.

tasansal · 2025-01-26T15:38:07Z

Hm looking at it again, seems like byte codec could also define the type configuration? In that case what could be the dtype metadata and when would the casting be applied in Zarr Python?

jbms · 2025-01-26T21:13:07Z

Good questions! Here are some more comments based on your thoughts:

Regarding AoS or SoA, numpy supports both, so we could still support both in Zarr.

Using a variable-length string as a field can get very complicated; I don't know how we could approach that. Numpy 2+ supports it for regular arrays, but struct arrays don't allow the new StringDType. How common are these? I haven't encountered this before. It sounds like awkward (jagged) arrays.

Right, not supported in numpy but having a collection of related fields, some of which are scalars and some of which are strings, is fairly common. Any array of variable-length strings can be seen as a ragged array but I'm not sure I see the relevance to this discussion of multiple fields.

Regarding encoding, having a struct codec makes the most sense to me. If we make it a bytes codec the consumer will have to know the layout, no? We should make it self-describing, IMO. Numpy has a unified way to define SoA or AoS while forming their dtype. We could take hints from that.

Can you explain what you mean when you say that numpy has a unified way to define SoA or AoS? I was not aware of that.

Separate arrays for fields are great for many of our workloads, but for a major one they significantly slow down batch processing. Our HPC loads TB+ scale array of structs into memory, performs operations on them (usually 50-100 fields), and writes them in batch. It becomes problematic due to the number of i/o operations on disk.

Separate logical zarr arrays does not necessarily mean separate I/o operations on disk. For example with icechunk or ocdbt (implemented by tensorstore) corresponding chunks of multiple arrays could be stored consectively within the same file such that they can be read with a single I/o operation.

jbms · 2025-01-26T21:15:35Z

Hm looking at it again, seems like byte codec could also define the type configuration? In that case what could be the dtype metadata and when would the casting be applied in Zarr Python?

The main distinction is that with the bytes codec the only configuration option would be endianness of all fields, and presumably use an interleaved encoding, while a separate struct codec would permit specifying additional options like per-field codecs for non-interleaved encoding.

tasansal · 2025-01-26T23:13:30Z

Can you explain what you mean when you say that numpy has a unified way to define SoA or AoS? I was not aware of that.

e.g. AoS:

>>> import numpy as np
>>> aos_dtype = np.dtype([
>>>     ("field1", np.int32),
>>>     ('field2', np.float64)]
>>> )
>>> np.zeros(5, dtype=aos_dtype)
array([(0, 0.), (0, 0.), (0, 0.), (0, 0.), (0, 0.)],
      dtype=[('field1', '<i4'), ('field2', '<f8')])

This serializes as 60 bytes as expected (array length of 5, with 12-byte length struct).

SoA:

It is slightly more hacky but we can make the field(s) into a sized field(s), and then we can create a shape=0 (singleton) "struct of arrays". It is hacky, but I think that it can be read in as void in C and cast to a struct. The dtype description is exactly the same as above with field size.

>>> import numpy as np
>>> soa_dtype = np.dtype([
>>>     ("field1", np.int32, 2),
>>>     ('field2', np.float64, 3)]
>>> )
>>> arr = np.asarray(([1, 2], [3.14, 42., 69.]), dtype=soa_dtype)
>>> arr
array(([1, 2], [ 3.14, 42.  , 69.  ]),
      dtype=[('field1', '<i4', (2,)), ('field2', '<f8', (3,))])

This serializes as 32-bytes as expected. It is also a single-element with shape=().

tasansal · 2025-01-26T23:16:00Z

Right, not supported in numpy but having a collection of related fields, some of which are scalars and some of which are strings, is fairly common. Any array of variable-length strings can be seen as a ragged array but I'm not sure I see the relevance to this discussion of multiple fields.

IIRC, a mix of fixed-length strings and numbers is ok. Its just the variable length they do not support.

Separate logical zarr arrays does not necessarily mean separate I/o operations on disk. For example with icechunk or ocdbt (implemented by tensorstore) corresponding chunks of multiple arrays could be stored consectively within the same file such that they can be read with a single I/o operation.

This is very interesting, can you elaborate more on that and provide an example?

jbms · 2025-01-28T04:30:14Z

Right, not supported in numpy but having a collection of related fields, some of which are scalars and some of which are strings, is fairly common. Any array of variable-length strings can be seen as a ragged array but I'm not sure I see the relevance to this discussion of multiple fields.

IIRC, a mix of fixed-length strings and numbers is ok. Its just the variable length they do not support.

Separate logical zarr arrays does not necessarily mean separate I/o operations on disk. For example with icechunk or ocdbt (implemented by tensorstore) corresponding chunks of multiple arrays could be stored consectively within the same file such that they can be read with a single I/o operation.

This is very interesting, can you elaborate more on that and provide an example?

With OCDBT it is just the default way that it works --- and I think the same is true for icechunk.

Here is a tensorstore OCDBT example:

# /// script
# requires-python = ">=3.13"
# dependencies = [
#     "tensorstore",
# ]
# ///
import tensorstore as ts

ctx = ts.Context()
ocdbt_spec = {"driver": "ocdbt", "base": "file:///tmp/example/"}
with ts.Transaction(atomic=True) as txn:
    variables_futures = [
        ts.open(
            {
                "driver": "zarr3",
                "kvstore": {
                    "driver": "ocdbt",
                    "base": "file:///tmp/example/",
                    "path": f"var{i}/",
                },
            },
            dtype=ts.uint32,
            shape=[100, 100],
            context=ctx,
            transaction=txn,
            open=True,
            create=True,
        )
        for i in range(80)
    ]
    variables = [v.result() for v in variables_futures]
with ts.Transaction(atomic=True) as txn:
    for i, v in enumerate(variables):
        v.with_transaction(txn)[...] = i

tasansal · 2025-01-30T15:51:55Z

Thanks @jbms. To continue progress on this, I need to fully understand where this extension would go. Looking at v3 spec sounds like we would want a data_type extension. I also found this very old draft struct type extension doc in this PR zarr-developers/zarr-specs#135, but it seems like it never made its way to the main branch.

Is the right thing to do here is:

add data type extension to zarr-specs
implement in zarr-python

Do we need ZEP, or do we need to do any work before we can do this?
At what point we need to discuss and iterate/approve etc.?

@jhamman @d-v-b @joshmoore

d-v-b · 2025-01-30T16:01:18Z

Adding to the specs is a slow road, instead I recommend starting with an implementation (but build the implementation so that it could be formally specified in a language-agnostic way). Then once we have something in people's hands, we can write up the spec.

Have a look at #2750 and see how structured dtypes would fit in with the framework proposed there.

rabernat · 2025-01-30T16:06:16Z

The steering council recognizes that the extension process in Zarr 3 is not working as hoped. We are working to fix the extension process and make it more accessible. @joshmoore and @normanrz are pushing hard on this, with the goals of:

Clarifying how to actually define and share an extension spec
Separating the extension process from the ZEP process, so that anyone can publish an extension with minimal red tape
Creating infrastructure to support community-based extension specs

For now, I would suggestion starting with the implementation, figuring out what the extension spec should look like, and then coming back to the extension spec question in a few weeks once we have this framework in place.

joseph-long · 2025-02-13T21:31:14Z

Should v2 arrays created with structured dtypes work with v3? I found that selecting a field name from a record (e.g. zarr_archive['path/to/array']['field_name']) created on v2 raises an exception on v3, but I'm not sure if that's worth opening an issue for, or just a known limitation for now.

jhamman added this to Zarr-Python - 3.0 Aug 28, 2024

jhamman added the V3 label Aug 28, 2024

jhamman added this to the After 3.0.0 milestone Aug 28, 2024

This was referenced Sep 1, 2024

Monthly issue metrics report #2145

Closed

Monthly issue metrics report sanketverma1704/zarr-python#1

Open

Monthly issue metrics report enthusiastdev121/zarr-python#1

Open

ilan-gold mentioned this issue Dec 5, 2024

(feat): support for zarr-python>=3 scverse/anndata#1726

Draft

3 tasks

dstansby removed the V3 label Dec 12, 2024

bendichter mentioned this issue Jan 9, 2025

[Feature]: Support zarr-python v3 hdmf-dev/hdmf-zarr#202

Open

3 tasks

ilan-gold mentioned this issue Jan 9, 2025

(fix): structured arrays for v2 #2681

Merged

6 tasks

d-v-b mentioned this issue Jan 14, 2025

Complex dtypes lost when writing V2 arrays #2711

Closed

e-koch mentioned this issue Jan 17, 2025

Support for structured dtypes not yet supported in zarr 3 radio-astro-tools/spectral-cube#936

Open

jbms mentioned this issue Jan 28, 2025

Where did "zarr extensions" go, or "v2 to v3 migration guide"? zarr-developers/zarr-specs#325

Open

ilan-gold mentioned this issue Feb 3, 2025

read_zarr failed in rust: incompatible fill value 0 for data type string kaizhang/anndata-rs#15

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v3] Structured dtype support #2134

[v3] Structured dtype support #2134

jhamman commented Aug 28, 2024

d-v-b commented Aug 28, 2024

ilan-gold commented Dec 5, 2024

jhamman commented Dec 5, 2024

ilan-gold commented Dec 5, 2024

d-v-b commented Dec 5, 2024

d-v-b commented Dec 5, 2024

tasansal commented Jan 25, 2025 •

edited

Loading

jbms commented Jan 26, 2025

tasansal commented Jan 26, 2025 •

edited

Loading

tasansal commented Jan 26, 2025 •

edited

Loading

jbms commented Jan 26, 2025

jbms commented Jan 26, 2025

tasansal commented Jan 26, 2025 •

edited

Loading

tasansal commented Jan 26, 2025

jbms commented Jan 28, 2025

tasansal commented Jan 30, 2025 •

edited

Loading

d-v-b commented Jan 30, 2025

rabernat commented Jan 30, 2025

joseph-long commented Feb 13, 2025

[v3] Structured dtype support #2134

[v3] Structured dtype support #2134

Comments

jhamman commented Aug 28, 2024

d-v-b commented Aug 28, 2024

ilan-gold commented Dec 5, 2024

jhamman commented Dec 5, 2024

ilan-gold commented Dec 5, 2024

d-v-b commented Dec 5, 2024

d-v-b commented Dec 5, 2024

tasansal commented Jan 25, 2025 • edited Loading

jbms commented Jan 26, 2025

tasansal commented Jan 26, 2025 • edited Loading

tasansal commented Jan 26, 2025 • edited Loading

jbms commented Jan 26, 2025

jbms commented Jan 26, 2025

tasansal commented Jan 26, 2025 • edited Loading

tasansal commented Jan 26, 2025

jbms commented Jan 28, 2025

tasansal commented Jan 30, 2025 • edited Loading

d-v-b commented Jan 30, 2025

rabernat commented Jan 30, 2025

joseph-long commented Feb 13, 2025

tasansal commented Jan 25, 2025 •

edited

Loading

tasansal commented Jan 26, 2025 •

edited

Loading

tasansal commented Jan 26, 2025 •

edited

Loading

tasansal commented Jan 26, 2025 •

edited

Loading

tasansal commented Jan 30, 2025 •

edited

Loading