Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add built-in support to write compound data types #1013

Merged
merged 25 commits into from
Dec 9, 2022

Conversation

tamasgal
Copy link
Contributor

@tamasgal tamasgal commented Sep 27, 2022

This PR contains a utility function (EDIT: actually after some discussions it's now realised via adding a method to datatype()) which I carry around my projects and I discovered that there is already an open issue which is related too, see #819

It's by all means not fully fledged, but it works for structs with primitive field types very well. It's not reversible, meaning that read will yield the usual NamedTuple. One way to make it reversible would be to attach metadata to the dataset with e.g. the type name and when read back, check if the type is already defined in the namespace and if not, create it dynamically. Kind of what I do in https://github.com/JuliaHEP/UnROOT.jl where generating types during runtime is mandatory.

I provided an example in the function docstring which demonstrates that it's fairly intuitive to use and covers the basic needs. Here is it:

julia> struct Foo
           x::Int32
           y::Float64
       end

julia> foos = [[Foo(1, 2) Foo(3, 4) Foo(5, 6)]; [Foo(7, 8) Foo(9, 10) Foo(11, 12)]]
2×3 Matrix{Foo}:
 Foo(1, 2.0)  Foo(3, 4.0)   Foo(5, 6.0)
 Foo(7, 8.0)  Foo(9, 10.0)  Foo(11, 12.0)

julia> h5open("foo.h5", "w") do h5f
           write_dataset(h5f, "the/foo", foos)
       end

julia> thefoo = h5open("foo.h5", "r") do file
           read(file, "the/foo")
       end
2×3 Matrix{NamedTuple{(:x, :y), Tuple{Int32, Float64}}}:
 (x = 1, y = 2.0)  (x = 3, y = 4.0)   (x = 5, y = 6.0)
 (x = 7, y = 8.0)  (x = 9, y = 10.0)  (x = 11, y = 12.0)

The data can be reinterpreted to the original data type using reinterpret():

julia> reinterpret(Foo, thefoo)
2×3 reinterpret(Foo, ::Matrix{NamedTuple{(:x, :y), Tuple{Int32, Float64}}}):
 Foo(1, 2.0)  Foo(3, 4.0)   Foo(5, 6.0)
 Foo(7, 8.0)  Foo(9, 10.0)  Foo(11, 12.0)

src/datasets.jl Outdated Show resolved Hide resolved
Copy link
Member

@mkitti mkitti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At minimum I think this should be write_compound_dataset since one could also write a compound type to an attribute.

My preference would be to extend HDF5.datatype to work on arbitrary structs and named tuples so that we can just use the normal write_dataset interface. After enabling that, then we should reevaluate if a special convenience method is still warranted.

src/datasets.jl Outdated Show resolved Hide resolved
src/datasets.jl Outdated Show resolved Hide resolved
test/compound.jl Outdated Show resolved Hide resolved
src/datasets.jl Outdated Show resolved Hide resolved
src/datasets.jl Outdated Show resolved Hide resolved
src/HDF5.jl Outdated Show resolved Hide resolved
@tamasgal tamasgal changed the title Utility function to write compound data types Add built-in support to write compound data types Oct 10, 2022
@tamasgal
Copy link
Contributor Author

Let me comment at the end since I changed a few things.

  • write_compound_dataset() is now removed
  • datatype() has now a new method with datatype(A::AbstractArray{T}) where {T}

so that write_dataset() now automatically works with an array of (flat) structs.

@mkitti I am still not sure about the type freedom, so let me know what you think!

src/typeconversions.jl Outdated Show resolved Hide resolved
@mkitti
Copy link
Member

mkitti commented Oct 12, 2022

@tamasgal, in case you don't follow what I'm talking about in the comments, I mean the following:

julia> import HDF5: API, Datatype

julia> HDF5.datatype(x::T) where T = HDF5.datatype(x, Val(isstructtype(T)))

julia> function HDF5.datatype(x::T, isstruct::Val{true}) where T
           dtype = API.h5t_create(API.H5T_COMPOUND, sizeof(T))
           for (idx, fn) in enumerate(fieldnames(T))
               API.h5t_insert(dtype, fn, fieldoffset(T, idx), datatype(fieldtype(T, idx)))
           end
           Datatype(dtype)
       end

julia> datatype(MyStruct(3,5))
HDF5.Datatype: H5T_COMPOUND {
      H5T_STD_I64LE "x" : 0;
      H5T_STD_I64LE "y" : 8;
   }

julia> datatype((x=5, y=3.f0))
HDF5.Datatype: H5T_COMPOUND {
      H5T_STD_I64LE "x" : 0;
      H5T_IEEE_F32LE "y" : 8;
   }

@tamasgal
Copy link
Contributor Author

Ah yes, I was struggling with that. Did not came up with the Val-solution, thanks, I will try ;)

@mkitti
Copy link
Member

mkitti commented Oct 12, 2022

People may want to override this method still for their specific types. Your implementation relies on Julia's memory aligned version, which may take up more space than needed.

julia> struct MyStruct32
           x::Int32
           y::Int32
       end

julia> datatype(MyStruct32(3,5))
HDF5.Datatype: H5T_COMPOUND {
      H5T_STD_I32LE "x" : 0;
      H5T_STD_I32LE "y" : 4;
   }

julia> struct MyStructMixed
           x::Int32
           y::Int
       end

julia> datatype(MyStructMixed(3,5))
HDF5.Datatype: H5T_COMPOUND {
      H5T_STD_I32LE "x" : 0;
      H5T_STD_I64LE "y" : 8;
   }

julia> sizeof(MyStruct32)
8

julia> sizeof(MyStructMixed)
16

As for the NamedTuple thing, perhaps people should write conversions methods to their types from a NamedTuple? It might be as simple as using Base.@kwdef:

julia> Base.@kwdef struct MyStructKW
           x::Int32
           y::Int
       end
MyStructKW

julia> MyStructKW(;(y = 1, x = 2)...)
MyStructKW(2, 1)

julia> Base.convert(::Type{MyStructKW}, nt::NamedTuple) = MyStructKW(; nt...)

julia> convert(MyStructKW, (y = 1, x = 2))
MyStructKW(2, 1)

@tamasgal
Copy link
Contributor Author

You are right about the memory padding, I guess I will just use the actual sizes as offset then.

Not sure if I understand regarding NamedTuple, at least in the tests they work fine. Or did I miss anything? ;)

@mkitti
Copy link
Member

mkitti commented Oct 12, 2022

I'm talking about your comment:

It's not reversible, meaning that read will yield the usual NamedTuple

If they wrote a convert method they could just use a type assertion to get what they want.

@mkitti
Copy link
Member

mkitti commented Oct 12, 2022

You are right about the memory padding, I guess I will just use the actual sizes as offset then.

We would have to test the conversion closely I think. This may make memory mapping difficult for example.

@tamasgal
Copy link
Contributor Author

Ah yes, understood :). I guess that should go into the docs then.

Btw. regarding the Type Array does not have a definite size.: I think it would make sense to simply use the length of the passed AbstractArray at the time of writing the dataset. I mean, using StaticArrays is of course a viable option but it might be user-friendly to support also the "obvious" use-case. Or do you think we should force fixed size sets?

@tamasgal
Copy link
Contributor Author

You are right about the memory padding, I guess I will just use the actual sizes as offset then.

We would have to test the conversion closely I think. This may make memory mapping difficult for example.

I see. I don't know much about the memory mapping features of HDF5.jl but I'll check how to test it.

@tamasgal
Copy link
Contributor Author

I am still struggling to find out where I hit the Type Array does not have a definite size. error since when I define datatype() for AbstractArrays, it does not complain and uses the actual size of the array (read: everything works as expected). If I only define the struct (the very same function body for the method, just a different dispatch type), the error bubbles up from somewhere.

@mkitti
Copy link
Member

mkitti commented Oct 12, 2022

What is the stack trace?

@tamasgal
Copy link
Contributor Author

So with this it works (but of course not for writing single structs):

datatype(x::AbstractArray{T}) where {T} = HDF5.datatype(x, Val(isstructtype(T)))
function datatype(x::AbstractArray{T}, isstruct::Val{true}) where {T}
    dtype = API.h5t_create(API.H5T_COMPOUND, sizeof(T))
    for (idx, fn) in enumerate(fieldnames(T))
        API.h5t_insert(dtype, fn, fieldoffset(T, idx), datatype(fieldtype(T, idx)))
    end
    Datatype(dtype)
end

and if I change the signatures to:

datatype(x::T) where {T} = HDF5.datatype(x, Val(isstructtype(T)))
function datatype(x::T, isstruct::Val{true}) where {T}

and call with the same tests, like

    bars = [
        [Bar(1, 1.1, true) Bar(2, 2.1, false) Bar(3, 3.1, true)]
        [Bar(4, 4.1, false) Bar(5, 5.1, true) Bar(6, 6.1, false)]
    ]

    namedtuples = [(a=1, b=2.3), (a=4, b=5.6)]

    fn = tempname()
    h5open(fn, "w") do h5f
       write_dataset(h5f, "the/bars", bars)
       write_dataset(h5f, "the/namedtuples", namedtuples)
    end

I get Type Array does not have a definite size. and I don't even find that line in the source code. It's most likely in HDF5_jll. I checked the dataspace() methods but I have not stepped through with the debugger.

  Got exception outside of a @test
  Type Array does not have a definite size.
  Stacktrace:
    [1] sizeof(x::Type)
      @ Base ./essentials.jl:473
    [2] datatype(x::Matrix{Bar}, isstruct::Val{true})
      @ HDF5 ~/Dev/HDF5.jl/src/typeconversions.jl:66
    [3] datatype(x::Matrix{Bar})
      @ HDF5 ~/Dev/HDF5.jl/src/typeconversions.jl:64
    [4] create_dataset(parent::HDF5.File, name::String, data::Matrix{Bar}; pv::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
      @ HDF5 ~/Dev/HDF5.jl/src/datasets.jl:268
    [5] create_dataset(parent::HDF5.File, name::String, data::Matrix{Bar})
      @ HDF5 ~/Dev/HDF5.jl/src/datasets.jl:265
    [6] write_dataset(parent::HDF5.File, name::String, data::Matrix{Bar}; pv::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
      @ HDF5 ~/Dev/HDF5.jl/src/datasets.jl:282
    [7] write_dataset(parent::HDF5.File, name::String, data::Matrix{Bar})
      @ HDF5 ~/Dev/HDF5.jl/src/datasets.jl:279
    [8] (::var"#254#257"{Vector{NamedTuple{(:a, :b), Tuple{Int64, Float64}}}, Matrix{Bar}})(h5f::HDF5.File)
      @ Main ~/Dev/HDF5.jl/test/compound.jl:167
    [9] (::HDF5.var"#17#18"{HDF5.HDF5Context, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, var"#254#257"{Vector{NamedTuple{(:a, :b), Tuple{Int64, Float64}}}, Matrix{Bar}}, HDF5.File})()
      @ HDF5 ~/Dev/HDF5.jl/src/file.jl:96
   [10] task_local_storage(body::HDF5.var"#17#18"{HDF5.HDF5Context, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, var"#254#257"{Vector{NamedTuple{(:a, :b), Tuple{Int64, Float64}}}, Matrix{Bar}}, HDF5.File}, key::Symbol, val::HDF5.HDF5Context)
      @ Base ./task.jl:292
   [11] h5open(::var"#254#257"{Vector{NamedTuple{(:a, :b), Tuple{Int64, Float64}}}, Matrix{Bar}}, ::String, ::Vararg{String}; context::HDF5.HDF5Context, pv::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
      @ HDF5 ~/Dev/HDF5.jl/src/file.jl:91
   [12] h5open(::Function, ::String, ::String)
      @ HDF5 ~/Dev/HDF5.jl/src/file.jl:89
   [13] macro expansion
      @ ~/Dev/HDF5.jl/test/compound.jl:165 [inlined]
   [14] macro expansion
      @ ~/.julia/juliaup/julia-1.8.1+0.aarch64/share/julia/stdlib/v1.8/Test/src/Test.jl:1357 [inlined]
   [15] top-level scope
      @ ~/Dev/HDF5.jl/test/compound.jl:157
   [16] include(fname::String)
      @ Base.MainInclude ./client.jl:476
   [17] macro expansion
      @ ~/Dev/HDF5.jl/test/runtests.jl:24 [inlined]
   [18] macro expansion
      @ ~/.julia/juliaup/julia-1.8.1+0.aarch64/share/julia/stdlib/v1.8/Test/src/Test.jl:1357 [inlined]
   [19] top-level scope
      @ ~/Dev/HDF5.jl/test/runtests.jl:19
   [20] include(fname::String)
      @ Base.MainInclude ./client.jl:476
   [21] top-level scope
      @ none:6
   [22] eval
      @ ./boot.jl:368 [inlined]
   [23] exec_options(opts::Base.JLOptions)
      @ Base ./client.jl:276
   [24] _start()
      @ Base ./client.jl:52

@tamasgal
Copy link
Contributor Author

I also checked with StaticArrays but then it hangs and uses 100% CPU (I waited for 5 minutes, it's stuck somewhere):

    bars = @SMatrix [
        [Bar(1, 1.1, true) Bar(2, 2.1, false) Bar(3, 3.1, true)]
        [Bar(4, 4.1, false) Bar(5, 5.1, true) Bar(6, 6.1, false)]
    ]

    namedtuples = @SVector [(a=1, b=2.3), (a=4, b=5.6)]

@@ -60,6 +60,25 @@ function datatype(str::VLen{C}) where {C<:CharType}
Datatype(API.h5t_vlen_create(type_id))
end

# Compound types

datatype(::T) where {T} = Datatype(hdf5_type_id(T), false)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why toclose = false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I am not sure why this is needed, I took it over from @mkitti

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just following similar methods here:

datatype(x::ScalarType) = Datatype(hdf5_type_id(typeof(x)), false)
datatype(::Type{T}) where {T<:ScalarType} = Datatype(hdf5_type_id(T), false)
datatype(A::AbstractArray{T}) where {T<:ScalarType} = Datatype(hdf5_type_id(T), false)

Copy link
Collaborator

@simonbyrne simonbyrne Oct 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are simply accessing pre-defined constants:

hdf5_type_id(::Type{Bool}) = API.H5T_NATIVE_B8
hdf5_type_id(::Type{Int8}) = API.H5T_NATIVE_INT8
hdf5_type_id(::Type{UInt8}) = API.H5T_NATIVE_UINT8
hdf5_type_id(::Type{Int16}) = API.H5T_NATIVE_INT16
hdf5_type_id(::Type{UInt16}) = API.H5T_NATIVE_UINT16
hdf5_type_id(::Type{Int32}) = API.H5T_NATIVE_INT32
hdf5_type_id(::Type{UInt32}) = API.H5T_NATIVE_UINT32
hdf5_type_id(::Type{Int64}) = API.H5T_NATIVE_INT64
hdf5_type_id(::Type{UInt64}) = API.H5T_NATIVE_UINT64
hdf5_type_id(::Type{Float32}) = API.H5T_NATIVE_FLOAT
hdf5_type_id(::Type{Float64}) = API.H5T_NATIVE_DOUBLE
hdf5_type_id(::Type{Reference}) = API.H5T_STD_REF_OBJ
hdf5_type_id(::Type{<:AbstractString}) = API.H5T_C_S1

In this case you're creating a new datatype handle on each call to the function: either they need to be cleaned up via GC, or you need some sort of caching mechanism so that subsequent invocations won't keep creating new handles.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're also defining new hdf5_type_id below.

Using the finalizer can get messy due to multithreading issues. What we should consider, perhaps in another pull request, is a do method where we automatically close the Datatype in the current thread.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're also defining new hdf5_type_id below.

What I meant is that e,g, hdf5_type_id(::Type{Int64}) is just accessing a constant, it doesn't create a new handle. The new method in this PR (https://github.com/JuliaIO/HDF5.jl/pull/1013/files#diff-bf557f1b887a640c4fe55cee7279c9f621001ff726a876b9f7bfbd7f8a7d0b44R70) calls h5t_create, creating a new handle. They need to be cleaned up, or you'll have a memory leak.

I'm not sure I understand the threading issue: we have finalizers on lots of other objects, why would Datatypes be special in this case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The finalizers are problematic elsewhere as well. But we also have a do interface in many of those cases which helps.

Quoting Jeff Benzanson in JuliaLang/julia#11207 :

Finalizers are inefficient and unpredictable. And with the new GC, it might take much longer to get around to freeing an object, therefore tying up its resources longer. Ideally releasing external resources should not be tied to how memory management works.

The problem is particularly acute here since the underlying HDF5 C library is not thread-safe. JuliaLang/julia#45272 improves the situation by making the finalizer calls eager and deterministic, making it likely that the finalizer calls H5T_close using the same thread.

The better solution here is to either explicitly call close which should invoke HDF5.API.h5t_close or we provide a do syntax to assist with this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't disagree, and adding some sort of do syntax pattern is a good idea, but that doesn't preclude having finalizers as well for when that approach isn't feasible.

Within the scope of this PR, these objects should be finalized, or otherwise you're going to get memory leaks.

@tamasgal
Copy link
Contributor Author

Writing a dataset with an array of mutable structs will fail silently, and just write garbage data.

I played around a bit but the problem is bigger than I thought. It basically boils down to the low level I/O where pointers are written instead of the values of the primitive types.

I am not sure which what's the best approach is but one fast solution would be throwing an error if not isimmutable(T) (need something similar for Julia <1.7). I don't think that mutable structs are a common thing in data which is dumped to HDF5 but I might be wrong. Note also that nested structures will not work as well, so we might get away with throwing an error until someone else really needs mutable struct support (or nested structures).

What do you think?

@mkitti
Copy link
Member

mkitti commented Oct 24, 2022

We may need to dispatch on isbitstype.

@nhz2
Copy link
Member

nhz2 commented Oct 24, 2022

Yeah, I think it makes sense to error somewhere if the struct type isn't isbitstype. Also if you update to Compat 3.40 you can use ismutabletype, but I'm not sure if that would work, because an immutable struct may have mutable struct type fields.

@mkitti
Copy link
Member

mkitti commented Oct 25, 2022

I have solution for mutable types. Basically, we need to convert the mutable type to a NamedTuple. Fortunately, we already have HDF5.get_mem_compatible_jl_type which already describes that NamedTuple. I will push soon.

@tamasgal
Copy link
Contributor Author

Thanks for jumping in @mkitti ! :)

How should we proceed with the finalisation problem?

@mkitti
Copy link
Member

mkitti commented Oct 25, 2022

Thanks for jumping in @mkitti ! :)

How should we proceed with the finalisation problem?

For now, I've changed the second argument to true in 8eacf5f with some reservation.

Perhaps we should probably close any compound datatypes that we create eagerly. Perhaps we should even close any compound datatypes passed into these methods. The larger issue of threading and overall use of finalization should be addressed, but perhaps elsewhere.

@mkitti mkitti mentioned this pull request Nov 18, 2022
@simonbyrne
Copy link
Collaborator

One thing we recently added in MPI.jl (which has a similar construction for custom datatypes) is an internal cache, which may be helpful when reading/writing multiple times:
https://github.com/JuliaParallel/MPI.jl/blob/eaa2599f4f79611852df21d7c86cd81da8b9f156/src/datatypes.jl#L132-L153

@mkitti
Copy link
Member

mkitti commented Nov 28, 2022

Should this code be here or should it be in JLD.jl or a third package?

@tamasgal
Copy link
Contributor Author

I am a bit confused which actions need to be taken in order to merge this. Can I do anything else to make progress?

@mkitti
Copy link
Member

mkitti commented Nov 28, 2022

I'm leaning towards the merge side, but I think @musm needs to review.

I think we should consider @simonbyrne suggestion of a cache, but that could be another pull request.


# These will use finalizers. Close them eagerly to avoid issues.
datatype(::T) where {T} = Datatype(hdf5_type_id(T), true)
datatype(::Type{T}) where {T} = Datatype(hdf5_type_id(T), true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this definition redundant with the method right above it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. The first works on instances of a type. The second works on the type itself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't seem to need this definition (on L67)

@@ -46,6 +46,12 @@ The default `Dataspace` used for representing a Julia object `data`:
- arrays: a simple `Dataspace`
- `nothing` or an `EmptyArray`: a null dataspace
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we update the docs here to include the fact that the new default is a scalar Dataspace for struct types?

@musm
Copy link
Member

musm commented Dec 2, 2022

LGTM sans minor comment

Agree cache sounds interesting but probably for seperate PR

@mkitti
Copy link
Member

mkitti commented Dec 6, 2022

I will see if I can address @musm's comments and then merge.

@musm
Copy link
Member

musm commented Dec 6, 2022

Sounds good!

src/typeconversions.jl Outdated Show resolved Hide resolved
src/dataspaces.jl Show resolved Hide resolved
@musm musm merged commit 205518d into JuliaIO:master Dec 9, 2022
@tamasgal tamasgal deleted the write-compound branch December 9, 2022 09:01
@alhirzel
Copy link

Great work! I am wondering if it is possible to do a release soon so that this can be consumed by general users without ]deving the package?

@musm
Copy link
Member

musm commented Dec 20, 2022

done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants