Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v3] Structured dtype support #2134

Open
jhamman opened this issue Aug 28, 2024 · 19 comments
Open

[v3] Structured dtype support #2134

jhamman opened this issue Aug 28, 2024 · 19 comments
Milestone

Comments

@jhamman
Copy link
Member

jhamman commented Aug 28, 2024

Zarr-Python 2 supported structured dtypes. I don't think this has been discussed much, if at all, in the context of version 3.

This once worked:

a = np.array(
    [(b"aaa", 1, 4.2), (b"bbb", 2, 8.4), (b"ccc", 3, 12.6)],
    dtype=[("foo", "S3"), ("bar", "i4"), ("baz", "f8")],
)
za = zarr.array(a, chunks=2, fill_value=None)
za[:]
@jhamman jhamman added the V3 label Aug 28, 2024
@jhamman jhamman added this to the After 3.0.0 milestone Aug 28, 2024
@d-v-b
Copy link
Contributor

d-v-b commented Aug 28, 2024

The v3 spec lists a finite set of dtypes and structured dtypes are not explicitly in the list. To be spec compliant and support structured dtypes, I see two options:

  • express a structured dtype via the raw bits r* dtype. This seems unsatisfactory, because two structured dtypes might have the same raw bits representation but very different semantics (e.g., a pair of float64s vs a pair of int64s will look the same as raw bits)
  • define a spec extension for structured dtypes. I don't think this would be a technical challenge, but it would be important to get consensus across implementations about the JSON serialization for the dtype. We would not want to overfit to numpy conventions. This seems like a better option, but it requires some social logistics.

@ilan-gold
Copy link
Contributor

@d-v-b @jhamman Has there been any movement on this? I see there is a comment in the code base about this:

# todo: uncomment the structured array tests when we can make them pass,

This would be something of a regression for anndata as a project if we couldn't support this (or we'd have to go with option 1 and investigate the current status with readers in other languages to see if they could be made to support this functionality).

@jhamman
Copy link
Member Author

jhamman commented Dec 5, 2024

@ilan-gold - no progress and I agree its a regression. I think we should support it for v2 arrays but until there is an extension dtype for this in v3, we'll likely need to leave it out.

@ilan-gold
Copy link
Contributor

That would be very fine with me. We'll need to "break" things for the v3 format anyway, so I don't have strong opinions there.

@d-v-b
Copy link
Contributor

d-v-b commented Dec 5, 2024

but until there is an extension dtype for this in v3

I do think we can assume that the v3 spec process will ultimately provide a clear framework for adding new dtypes, so it might be useful to think about basic questions like how these types should be named / parametrized in JSON.

@d-v-b
Copy link
Contributor

d-v-b commented Dec 5, 2024

we would probably start with the v2 conventions, and then consider how to ensure that a v3 version of that isn't too numpy-specific.

@tasansal
Copy link
Contributor

tasansal commented Jan 25, 2025

Hi, we extensively use Zarr's structured array support on v2. I want to implement structured array support for v3 so we can move to it.

Also, tensorstore may have the structs working with v3 already. Can @jbms chime in?

Does it sound like there is no clear-cut framework for adding types? I would like to start that conversation. My recommendation would be to use C-style definitions for the structured data types, which would be somewhat universal and also conform to numpy semantics.

We already have a data model in our next-generation version of MDIO library for defining structs.
See here: https://mdio-python.readthedocs.io/en/v1/data_models/data_types.html#structured-type

Ours is quite verbose but we chose to do it that way for human readability. For instance, to define an array of structs (structured/record array in numpy language) with four fields, we do this and also name the dimensions. Some parallels to ZOM proposals:

{
  "name": "headers",
  "dataType": {
    "fields": [
      { "name": "cdp", "format": "uint32" },
      { "name": "offset", "format": "int16" },
      { "name": "cdp-x", "format": "float64" },
      { "name": "cdp-y", "format": "float64" }
    ]
  },
  "dimensions": ["inline", "crossline"]
}

What would be the cleanest interface to implement the structured type support? I believe it has to be more concise if written to the Zarr JSON metadata. The numpy dtype definition is actually not that bad.. I'm talking about this one:

{'names': ..., 'formats': ..., 'offsets': ..., 'titles': ..., 'itemsize': ...}

Offsets, titles, and itemsize are optional. We could omit the titles but the other transfer to C structs nicely.

Unambiguous C numeric types:

int8_t / uint8_t for 8-bit signed/unsigned integers. We could omit the _t.
int16_t / uint16_t for 16-bit signed/unsigned integers. We could omit the _t.
int32_t / uint32_t for 32-bit signed/unsigned integers. We could omit the _t.
int64_t / uint64_t for 64-bit signed/unsigned integers. We could omit the _t.

float (Single precision, ~6-7 decimal digits, usually 32-bit)
double (Double precision, ~15-16 decimal digits, usually 64-bit)
long double (Extended precision, platform-dependent, 80-bit or 128-bit)

bool is an alias for _Bool defined in <stdbool.h>, making it more readable and idiomatic.

For instance, a void type of 16 bytes without fully populated fields can be defined as below and can be cast to numpy or C structs:

{
  "names": ["f1", "f2", "f3"],
  "formats": ["int8", "double", "bool"],
  "offsets": [0, 4, 12],
  "itemsize": 16
}

Note the padding after the double and after the bool. Endinanness is handled nicely by the byte-codec, we don't have to specify.

Array metadata can be something like (note the fill value changes to an array):

{
  "shape": [
    10,
    10
  ],
  "data_type": {
    "names": ["f1", "f2", "f3"],
    "formats": ["int8", "double", "bool"],
    "offsets": [0, 4, 12],
    "itemsize": 16
  },
  "chunk_grid": {
    "name": "regular",
    "configuration": {
      "chunk_shape": [
        10,
        10
      ]
    }
  },
  "chunk_key_encoding": {
    "name": "default",
    "configuration": {
      "separator": "/"
    }
  },
  "fill_value": [0, 0.0, false],
  "codecs": [
    {
      "name": "bytes",
      "configuration": {
        "endian": "little"
      }
    },
    {
      "name": "zstd",
      "configuration": {
        "level": 0,
        "checksum": false
      }
    }
  ],
  "attributes": {},
  "zarr_format": 3,
  "node_type": "array",
  "storage_transformers": []
}

@jbms
Copy link

jbms commented Jan 26, 2025

Tensorstore supports structured data types for zarr v2 but not v3.

I can imagine that structured data types are convenient for some use cases but they also introduce a number of complications and limitations:

  • Are fields stored interleaved (array of structs, as in numpy) or as struct of arrays? Struct of arrays potentially offers better compression and support for reading/decoding just a subset of the fields.

  • Are variable length strings (likely to be added to zarr v3) supported as fields?

  • How do you specify the encoding of fields? Is only the bytes codec supported, or is there a new struct codec that lets you specify the encoding of each field separately (only really makes sense for struct of arrays representation)?

An alternative is the xarray-style representation where there is a separate array per field.

@tasansal
Copy link
Contributor

tasansal commented Jan 26, 2025

Good questions! Here are some more comments based on your thoughts:

  • Regarding AoS or SoA, numpy supports both, so we could still support both in Zarr.
  • Using a variable-length string as a field can get very complicated; I don't know how we could approach that. Numpy 2+ supports it for regular arrays, but struct arrays don't allow the new StringDType. How common are these? I haven't encountered this before. It sounds like awkward (jagged) arrays.
  • Regarding encoding, having a struct codec makes the most sense to me. If we make it a bytes codec the consumer will have to know the layout, no? We should make it self-describing, IMO. Numpy has a unified way to define SoA or AoS while forming their dtype. We could take hints from that.

Separate arrays for fields are great for many of our workloads, but for a major one they significantly slow down batch processing. Our HPC loads TB+ scale array of structs into memory, performs operations on them (usually 50-100 fields), and writes them in batch. It becomes problematic due to the number of i/o operations on disk.

@tasansal
Copy link
Contributor

tasansal commented Jan 26, 2025

Hm looking at it again, seems like byte codec could also define the type configuration? In that case what could be the dtype metadata and when would the casting be applied in Zarr Python?

@jbms
Copy link

jbms commented Jan 26, 2025

Good questions! Here are some more comments based on your thoughts:

  • Regarding AoS or SoA, numpy supports both, so we could still support both in Zarr.
  • Using a variable-length string as a field can get very complicated; I don't know how we could approach that. Numpy 2+ supports it for regular arrays, but struct arrays don't allow the new StringDType. How common are these? I haven't encountered this before. It sounds like awkward (jagged) arrays.

Right, not supported in numpy but having a collection of related fields, some of which are scalars and some of which are strings, is fairly common. Any array of variable-length strings can be seen as a ragged array but I'm not sure I see the relevance to this discussion of multiple fields.

  • Regarding encoding, having a struct codec makes the most sense to me. If we make it a bytes codec the consumer will have to know the layout, no? We should make it self-describing, IMO. Numpy has a unified way to define SoA or AoS while forming their dtype. We could take hints from that.

Can you explain what you mean when you say that numpy has a unified way to define SoA or AoS? I was not aware of that.

Separate arrays for fields are great for many of our workloads, but for a major one they significantly slow down batch processing. Our HPC loads TB+ scale array of structs into memory, performs operations on them (usually 50-100 fields), and writes them in batch. It becomes problematic due to the number of i/o operations on disk.

Separate logical zarr arrays does not necessarily mean separate I/o operations on disk. For example with icechunk or ocdbt (implemented by tensorstore) corresponding chunks of multiple arrays could be stored consectively within the same file such that they can be read with a single I/o operation.

@jbms
Copy link

jbms commented Jan 26, 2025

Hm looking at it again, seems like byte codec could also define the type configuration? In that case what could be the dtype metadata and when would the casting be applied in Zarr Python?

The main distinction is that with the bytes codec the only configuration option would be endianness of all fields, and presumably use an interleaved encoding, while a separate struct codec would permit specifying additional options like per-field codecs for non-interleaved encoding.

@tasansal
Copy link
Contributor

tasansal commented Jan 26, 2025

Can you explain what you mean when you say that numpy has a unified way to define SoA or AoS? I was not aware of that.

e.g. AoS:

>>> import numpy as np
>>> aos_dtype = np.dtype([
>>>     ("field1", np.int32),
>>>     ('field2', np.float64)]
>>> )
>>> np.zeros(5, dtype=aos_dtype)
array([(0, 0.), (0, 0.), (0, 0.), (0, 0.), (0, 0.)],
      dtype=[('field1', '<i4'), ('field2', '<f8')])

This serializes as 60 bytes as expected (array length of 5, with 12-byte length struct).

SoA:

It is slightly more hacky but we can make the field(s) into a sized field(s), and then we can create a shape=0 (singleton) "struct of arrays". It is hacky, but I think that it can be read in as void in C and cast to a struct. The dtype description is exactly the same as above with field size.

>>> import numpy as np
>>> soa_dtype = np.dtype([
>>>     ("field1", np.int32, 2),
>>>     ('field2', np.float64, 3)]
>>> )
>>> arr = np.asarray(([1, 2], [3.14, 42., 69.]), dtype=soa_dtype)
>>> arr
array(([1, 2], [ 3.14, 42.  , 69.  ]),
      dtype=[('field1', '<i4', (2,)), ('field2', '<f8', (3,))])

This serializes as 32-bytes as expected. It is also a single-element with shape=().

@tasansal
Copy link
Contributor

Right, not supported in numpy but having a collection of related fields, some of which are scalars and some of which are strings, is fairly common. Any array of variable-length strings can be seen as a ragged array but I'm not sure I see the relevance to this discussion of multiple fields.

IIRC, a mix of fixed-length strings and numbers is ok. Its just the variable length they do not support.

Separate logical zarr arrays does not necessarily mean separate I/o operations on disk. For example with icechunk or ocdbt (implemented by tensorstore) corresponding chunks of multiple arrays could be stored consectively within the same file such that they can be read with a single I/o operation.

This is very interesting, can you elaborate more on that and provide an example?

@jbms
Copy link

jbms commented Jan 28, 2025

Right, not supported in numpy but having a collection of related fields, some of which are scalars and some of which are strings, is fairly common. Any array of variable-length strings can be seen as a ragged array but I'm not sure I see the relevance to this discussion of multiple fields.

IIRC, a mix of fixed-length strings and numbers is ok. Its just the variable length they do not support.

Separate logical zarr arrays does not necessarily mean separate I/o operations on disk. For example with icechunk or ocdbt (implemented by tensorstore) corresponding chunks of multiple arrays could be stored consectively within the same file such that they can be read with a single I/o operation.

This is very interesting, can you elaborate more on that and provide an example?

With OCDBT it is just the default way that it works --- and I think the same is true for icechunk.

Here is a tensorstore OCDBT example:

# /// script
# requires-python = ">=3.13"
# dependencies = [
#     "tensorstore",
# ]
# ///
import tensorstore as ts

ctx = ts.Context()
ocdbt_spec = {"driver": "ocdbt", "base": "file:///tmp/example/"}
with ts.Transaction(atomic=True) as txn:
    variables_futures = [
        ts.open(
            {
                "driver": "zarr3",
                "kvstore": {
                    "driver": "ocdbt",
                    "base": "file:///tmp/example/",
                    "path": f"var{i}/",
                },
            },
            dtype=ts.uint32,
            shape=[100, 100],
            context=ctx,
            transaction=txn,
            open=True,
            create=True,
        )
        for i in range(80)
    ]
    variables = [v.result() for v in variables_futures]
with ts.Transaction(atomic=True) as txn:
    for i, v in enumerate(variables):
        v.with_transaction(txn)[...] = i

@tasansal
Copy link
Contributor

tasansal commented Jan 30, 2025

Thanks @jbms. To continue progress on this, I need to fully understand where this extension would go. Looking at v3 spec sounds like we would want a data_type extension. I also found this very old draft struct type extension doc in this PR zarr-developers/zarr-specs#135, but it seems like it never made its way to the main branch.

Is the right thing to do here is:

  1. add data type extension to zarr-specs
  2. implement in zarr-python

Do we need ZEP, or do we need to do any work before we can do this?
At what point we need to discuss and iterate/approve etc.?

@jhamman @d-v-b @joshmoore

@d-v-b
Copy link
Contributor

d-v-b commented Jan 30, 2025

Adding to the specs is a slow road, instead I recommend starting with an implementation (but build the implementation so that it could be formally specified in a language-agnostic way). Then once we have something in people's hands, we can write up the spec.

Have a look at #2750 and see how structured dtypes would fit in with the framework proposed there.

@rabernat
Copy link
Contributor

The steering council recognizes that the extension process in Zarr 3 is not working as hoped. We are working to fix the extension process and make it more accessible. @joshmoore and @normanrz are pushing hard on this, with the goals of:

  • Clarifying how to actually define and share an extension spec
  • Separating the extension process from the ZEP process, so that anyone can publish an extension with minimal red tape
  • Creating infrastructure to support community-based extension specs

For now, I would suggestion starting with the implementation, figuring out what the extension spec should look like, and then coming back to the extension spec question in a few weeks once we have this framework in place.

@joseph-long
Copy link

Should v2 arrays created with structured dtypes work with v3? I found that selecting a field name from a record (e.g. zarr_archive['path/to/array']['field_name']) created on v2 raises an exception on v3, but I'm not sure if that's worth opening an issue for, or just a known limitation for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

8 participants