Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-haul and implement a more generic read #652

Merged
merged 18 commits into from
Sep 2, 2020
Merged

Conversation

kleinhenz
Copy link
Contributor

This adds a new generic (not type constrained) read method

read(obj::DatasetOrAttribute, ::Type{T}, I...) where T

which reads a hyperslab of a dataset or an attribute into an array of type T.

It also adds convenience methods

read(parent::Union{HDF5File, HDF5Group}, name_type_pair::Pair{String, DataType}; pv...)
h5read(filename, name_type_pair::Pair{String, DataType}; pv...)

which can be used like v_read = read(h5f, "data"=>Simple) or v_read = h5read(fn, "data"=>Simple) to read
data into an array with element type Simple (see test/custom.jl). This provides the functionality of #559 in a way that is more integrated with the rest of the library than adding a new separate h5read_compound method and closes #169.

Although the function does not have a type constraint I do check that the type is the correct size so this should still prevent you from shooting yourself in the foot too easily. The behavior of the generic read can be customized by defining do_reclaim, do_normalize and normalize_types on your custom type. If we decide to do this, I will write up some documentation for customization.

Internally, I was able to replace a bunch of separate read methods for different types with this one generic method. This works by separating reading from hdf5 into a memory compatible datatype from mapping that data into a native julia type. The first part can be handled generically (HDF5Scalar types are not special) and the second part can be handled by the normalize_types methods which we have been using with compound types and can be extended. I think this is a win in terms of simplification and reducing the number of lines of code. The only type that requires special handling is now HDF5Opaque.

As a side-effect of this hyperslab indexing works generically for everything besides HDF5Opaque. Closes #625.
As another side-effect, vlen strings are now handled uniformly via unsafe_string so this should hopefully close #627.

The rules for mapping from hdf5 to julia types now uniformly follow the conventions we have been using for reading compound types which shouldn't be too breaking but does cause a few changes in behavior. In particular the data

HDF5 "output.h5" {
DATASET "/salut_vlen" {
   DATATYPE  H5T_VLEN { H5T_STRING {
      STRSIZE 1;
      STRPAD H5T_STR_NULLTERM;
      CSET H5T_CSET_ASCII;
      CTYPE H5T_C_S1;
   }}
   DATASPACE  SIMPLE { ( 2 ) / ( 2 ) }
   DATA {
   (0): ("H", "i"), ("t", "h", "e", "r", "e")
   }
}
}

is now read as salut_vlenr = [["H", "i"], ["t", "h", "e", "r", "e"]] instead of ["Hi", "There"]. However, I think the new behavior is actually a more correct interpretation of the datatype since STRSIZE=1. Also now hdf5 strings are read into julia strings even if they are length 1 whereas previously these would have been read in character types. This could be changed if people think that it is important. Other than the salut_vlenr example all other tests pass without modification.

@kleinhenz
Copy link
Contributor Author

@tknopp, @tamasgal would this work for your usecases?

@tamasgal
Copy link
Contributor

Looks good to me!

@musm
Copy link
Member

musm commented Aug 28, 2020

As a side-effect of this hyperslab indexing works generically for everything besides HDF5Opaque. Closes #625.
As another side-effect, vlen strings are now handled uniformly via unsafe_string so this should hopefully close #627.

👍

is now read as salut_vlenr = [["H", "i"], ["t", "h", "e", "r", "e"]] instead of ["Hi", "There"]. However, I think the new behavior is actually a more correct interpretation of the datatype since STRSIZE=1. Also now hdf5 strings are read into julia strings even if they are length 1 whereas previously these would have been read in character types. This could be changed if people think that it is important.

I don't have a strong opinion whether it should a string or character, but I tend to agree with you. Perhaps there is some efficiency lost by it not being a character, when STRSIZE=1. But wouldn't that mean we would need to special case STRSIZE=1, which seems cumbersome/inelegant.

src/HDF5.jl Outdated
@@ -1952,6 +1820,15 @@ function hdf5_to_julia_eltype(objtype)
return T
end

function get_jl_type(objtype)
Copy link
Member

@musm musm Aug 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand the point of this method.

The hdf5_to_julia_eltype method
already has defined class_id == H5T_OPAQUE

elseif class_id == H5T_OPAQUE

So it seems that without this method things should just work already.

Edit: NVM that's for hdf5_to_julia_eltype

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this mostly duplicates hdf5_to_julia_eltype but changes the string handling part so there are no special cases for the length=1 case. I wrote a new function instead of changing the original because I wanted to only touch the read part of the library. There is probably some consolidation/cleanup possible.

src/HDF5.jl Outdated
@@ -1952,6 +1820,15 @@ function hdf5_to_julia_eltype(objtype)
return T
end

function get_jl_type(objtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a comment that this is temporarily special cased for H5T_OPAQUE

src/HDF5.jl Outdated Show resolved Hide resolved
@musm
Copy link
Member

musm commented Aug 28, 2020

massive improvement in simplicity and readability.

@musm
Copy link
Member

musm commented Aug 28, 2020

feel free to merge whenever

src/HDF5.jl Show resolved Hide resolved
src/HDF5.jl Show resolved Hide resolved
@musm
Copy link
Member

musm commented Aug 28, 2020

What should we do with the UTF8Char and ASCIIChar types

@kleinhenz
Copy link
Contributor Author

Cool, if you are overall happy with this approach I can clean things up a bit and write some documentation. I left some stuff that can probably be removed like the hdf5_to_julia_eltype function which mostly but not quite duplicates get_jl_type because I was trying to only touch the read part of the library and some of that is used in the write part. However there is probably at least some cleanup/consolidation possible.

I was actually wrong about the string reading behavior. Fixed length strings with length 1 are already read as strings not characters. The only place where the character types are actually used to change behavior is for reading H5T_VLEN datatypes with fixed string element types. In that case, if STRSIZE=1 then the data is read as an array of strings instead of an array of arrays of characters. This is what is going on in the salut_vlenr example. We could replicate the current behavior by adding a method for normalize_types(x::VariableArray{FixedString}). This is a fairly weird example though since the natural thing to do would be to just use a variable length string type. I need to check if the character types are important for any of the write methods but otherwise I think they can be removed.

@musm
Copy link
Member

musm commented Aug 29, 2020

Cool, if you are overall happy with this approach I can clean things up a bit and write some documentation. I left some stuff that can probably be removed like the hdf5_to_julia_eltype function which mostly but not quite duplicates get_jl_type because I was trying to only touch the read part of the library and some of that is used in the write part. However there is probably at least some cleanup/consolidation possible.

Right now I think is the appropriate time to make such changes in the upcoming breaking release. It would certainly be good to clean up as much of the old cruft and any ugly parts of the code that we can, so I'm in support.

@tknopp
Copy link
Contributor

tknopp commented Aug 30, 2020

@kleinhenz: I have switched to the regular compound reading with named tuples: MagneticResonanceImaging/MRIReco.jl@9b1a219

works fine although there was some drop in performance.

* requires Compat for julia 1.3
@musm
Copy link
Member

musm commented Sep 2, 2020

@kleinhenz LMK if this is g2g, I'm happy with the changes here.

@kleinhenz
Copy link
Contributor Author

yeah I think this is good to go. There is some more cleanup of internals possible but it is only sort of related to the changes here and I don't want to bloat this pull request too much.

@musm musm changed the title RFC: generic read Re-haul and implement a more generic read Sep 2, 2020
@musm musm merged commit 15a521d into JuliaIO:master Sep 2, 2020
jmert added a commit to jmert/HDF5.jl that referenced this pull request Sep 9, 2020
jmert added a commit to jmert/HDF5.jl that referenced this pull request Sep 9, 2020
jmert added a commit to jmert/HDF5.jl that referenced this pull request Sep 9, 2020
jmert added a commit to jmert/HDF5.jl that referenced this pull request Sep 11, 2020
musm pushed a commit that referenced this pull request Sep 14, 2020
* Add deprecated bindings for backwards compatibility with ecosystem packages

Namely, get the tests of JLD.jl and MAT.jl to pass again.

* Deprecations for #632 - keywords instead of property lists

* Deprecations for #652 - generic read

* Deprecate {d,a}_{create,write} methods with property lists

The calls should use keyword properties instead.
jmert added a commit to jmert/HDF5.jl that referenced this pull request Oct 13, 2020
HDF5 has the concept of NULL arrays which are distinct from 0-size
arrays. As shown in JuliaIO#705, writing any 0-size array was being translated
to the NULL object during writing, but having 0-length axes is also
perfectly valid. This PR preserves 0-sizes when writing arrays.

To represent the HDF5 NULL object, there was already the
`HDF5.EmptyArray` type defined, but prior to JuliaIO#652 it appears to have
been used only for internal dispatch on the generic read (which returned
a 0-length vector for NULL objects). Since the generic read overhaul, it
has been unused.

This PR reuses `HDF5.EmptyArray` and elevates it to a full subtype of
`AbstractArray`, which then becomes the returned object for any NULL
datasets/attributes which are read in. This array type carries an
element type but has no contents. This is distinct from Julia's
`Array{T,0}` 0-dimensional arrays because the latter _does_ have a
single element. (I tried just doing duck typing, but you end up playing
whack-a-mole with having to define enough new method definitions that
you might as well just implement the `AbstractArray` interface and take
advantage of its fallbacks once the few methods are defined.)

(One oddity is that Julia's 0-dimensional `Arrays` are turned into HDF5
scalars and will not be read back in as a 0-dimensional array, but (a)
that is the existing behavior, and (b) I haven't thought of any other
way to handle the case.)
jmert added a commit to jmert/HDF5.jl that referenced this pull request Oct 13, 2020
HDF5 has the concept of NULL arrays which are distinct from 0-size
arrays. As shown in JuliaIO#705, writing any 0-size array was being translated
to the NULL object during writing, but having 0-length axes is also
perfectly valid. This PR preserves 0-sizes when writing arrays.

To represent the HDF5 NULL object, there was already the
`HDF5.EmptyArray` type defined, but prior to JuliaIO#652 it appears to have
been used only for internal dispatch on the generic read (which returned
a 0-length vector for NULL objects). Since the generic read overhaul, it
has been unused.

This PR reuses `HDF5.EmptyArray` and elevates it to a full subtype of
`AbstractArray`, which then becomes the returned object for any NULL
datasets/attributes which are read in. This array type carries an
element type but has no contents. This is distinct from Julia's
`Array{T,0}` 0-dimensional arrays because the latter _does_ have a
single element. (I tried just doing duck typing, but you end up playing
whack-a-mole with having to define enough new method definitions that
you might as well just implement the `AbstractArray` interface and take
advantage of its fallbacks once the few methods are defined.)

(One oddity is that Julia's 0-dimensional `Arrays` are turned into HDF5
scalars and will not be read back in as a 0-dimensional array, but (a)
that is the existing behavior, and (b) I haven't thought of any other
way to handle the case.)
@kleinhenz kleinhenz mentioned this pull request Nov 4, 2020
@@ -1991,19 +1874,22 @@ function get_mem_compatible_jl_type(objtype)
finally
h5t_close(super_type)
end
elseif class_id == H5T_REFERENCE
# TODO update to use version 1.12 reference functions/types
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants