Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incompatibility reading boolean data written using pynwb/h5py #206

Closed
bendichter opened this issue Apr 20, 2020 · 7 comments
Closed

incompatibility reading boolean data written using pynwb/h5py #206

bendichter opened this issue Apr 20, 2020 · 7 comments

Comments

@bendichter
Copy link
Contributor

HDF5 does not natively have a boolean format. h5py (and by extension pynwb) handles this by creating an enumerated type with an underlying data type of uint8 that is constrained to 0-1 and maps onto the values "FALSE" and "TRUE" (see documentation). h5py knows to write data this way, and it also knows to convert this type of data automatically to a boolean numpy array on read. However, MATLAB's HDF5 API does not use this convention, and reads the data according to the enumerated type mapping defined in the dataset: "TRUE" and "FALSE". The data is stored efficiently in the HDF5 file, but is not read efficiently by MATLAB, and requires the user to wrangle the data back to boolean form. The pipeline is working as expected on the python side, but we have a bug/incompatibility when boolean data written by PyNWB is read using MatNWB.

@lawrence-mbf
Copy link
Collaborator

Related to #77

@bahanonu
Copy link
Contributor

@bendichter Thanks for raising this issue. Specifically, MATLAB will read a multi-dimensional/nested cell array of TRUE and FALSE (e.g. 512x512x88 cell array of strings).

Fiji appears to have issues reading, but ImageJ loads them fine.

Is the way h5py stores 0-1 uint8 any more efficient (space-wise) after compression that storing it as just a uint8 matrix of 0s and 1s?

@bendichter
Copy link
Contributor Author

bendichter commented Apr 20, 2020

@bahanonu Thanks for the additional context. I think the only advantage of storing them this way as opposed to 0s and 1s in uint8 is that h5py knows to convert these values to boolean on read. If h5py reads a uint8 array, it will keep the values as uint8. I can't imagine it would cause any space savings, because the data itself is stored in uint8 in either case. If anything, h5py's approach takes up slightly more space to define the enumerated type mapping. I don't think there is anything inherently wrong with casting boolean values to uint8, but we should also support read of h5py style booleans because that is the most popular python library for interacting with HDF5.

@bahanonu
Copy link
Contributor

@bendichter Sounds good, makes sense to keep the h5py compatibility in that case.

Are you thinking of including a small function within matnwb repo to convert to a logical array (as opposed to each user writing their own)?

Tested the issue in R with rhdf5 package and that same dataset loads as expected after coercing into a numeric array. Using ophys data in https://gui.dandiarchive.org/#/file-browser/folder/5e834c3b3c4aab7fa53666ad.

@bendichter
Copy link
Contributor Author

Are you thinking of including a small function within matnwb repo to convert to a logical array (as opposed to each user writing their own)?

Yes, that is the goal of this ticket :-)

@bendichter
Copy link
Contributor Author

Maybe the best approach is to use this example code:

fid = H5F.open('example.h5');
dset_id = H5D.open(fid,'/g3/enum');
type_id = H5D.get_type(dset_id);
num_members = H5T.get_nmembers(type_id);
for j = 1:num_members
    member_name{j} = H5T.get_member_name(type_id,j-1);
    member_value(j) = H5T.enum_valueof(type_id,member_name{j});
end

running that on an h5py-style boolean, I get

>> member_name =

  1×2 cell array

    {'FALSE'}    {'TRUE'}

>> member_value

member_value =

  1×2 int8 row vector

   0   1

We could use this as a test and then access the data via H5D.read(dset_id), which outputs uint8 data. Ideally we would then convert that to boolean. But of course we also need to support lazy reading, and if doing both of those is tough we can skip the final conversion for now. It's really just cosmetic anyway.

@lawrence-mbf
Copy link
Collaborator

Fixed via #77

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants