incompatibility reading boolean data written using pynwb/h5py #206

bendichter · 2020-04-20T20:25:58Z

HDF5 does not natively have a boolean format. h5py (and by extension pynwb) handles this by creating an enumerated type with an underlying data type of uint8 that is constrained to 0-1 and maps onto the values "FALSE" and "TRUE" (see documentation). h5py knows to write data this way, and it also knows to convert this type of data automatically to a boolean numpy array on read. However, MATLAB's HDF5 API does not use this convention, and reads the data according to the enumerated type mapping defined in the dataset: "TRUE" and "FALSE". The data is stored efficiently in the HDF5 file, but is not read efficiently by MATLAB, and requires the user to wrangle the data back to boolean form. The pipeline is working as expected on the python side, but we have a bug/incompatibility when boolean data written by PyNWB is read using MatNWB.

lawrence-mbf · 2020-04-20T20:33:24Z

Related to #77

bahanonu · 2020-04-20T21:04:41Z

@bendichter Thanks for raising this issue. Specifically, MATLAB will read a multi-dimensional/nested cell array of TRUE and FALSE (e.g. 512x512x88 cell array of strings).

Fiji appears to have issues reading, but ImageJ loads them fine.

Is the way h5py stores 0-1 uint8 any more efficient (space-wise) after compression that storing it as just a uint8 matrix of 0s and 1s?

bendichter · 2020-04-20T21:08:31Z

@bahanonu Thanks for the additional context. I think the only advantage of storing them this way as opposed to 0s and 1s in uint8 is that h5py knows to convert these values to boolean on read. If h5py reads a uint8 array, it will keep the values as uint8. I can't imagine it would cause any space savings, because the data itself is stored in uint8 in either case. If anything, h5py's approach takes up slightly more space to define the enumerated type mapping. I don't think there is anything inherently wrong with casting boolean values to uint8, but we should also support read of h5py style booleans because that is the most popular python library for interacting with HDF5.

bahanonu · 2020-04-20T22:28:01Z

@bendichter Sounds good, makes sense to keep the h5py compatibility in that case.

Are you thinking of including a small function within matnwb repo to convert to a logical array (as opposed to each user writing their own)?

Tested the issue in R with rhdf5 package and that same dataset loads as expected after coercing into a numeric array. Using ophys data in https://gui.dandiarchive.org/#/file-browser/folder/5e834c3b3c4aab7fa53666ad.

bendichter · 2020-04-21T00:01:24Z

Are you thinking of including a small function within matnwb repo to convert to a logical array (as opposed to each user writing their own)?

Yes, that is the goal of this ticket :-)

bendichter · 2020-04-21T00:14:13Z

Maybe the best approach is to use this example code:

fid = H5F.open('example.h5');
dset_id = H5D.open(fid,'/g3/enum');
type_id = H5D.get_type(dset_id);
num_members = H5T.get_nmembers(type_id);
for j = 1:num_members
    member_name{j} = H5T.get_member_name(type_id,j-1);
    member_value(j) = H5T.enum_valueof(type_id,member_name{j});
end

running that on an h5py-style boolean, I get

>> member_name =

  1×2 cell array

    {'FALSE'}    {'TRUE'}

>> member_value

member_value =

  1×2 int8 row vector

   0   1

We could use this as a test and then access the data via H5D.read(dset_id), which outputs uint8 data. Ideally we would then convert that to boolean. But of course we also need to support lazy reading, and if doing both of those is tough we can skip the final conversion for now. It's really just cosmetic anyway.

lawrence-mbf · 2023-05-16T14:05:49Z

Fixed via #77

lawrence-mbf closed this as completed May 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

incompatibility reading boolean data written using pynwb/h5py #206

incompatibility reading boolean data written using pynwb/h5py #206

bendichter commented Apr 20, 2020

lawrence-mbf commented Apr 20, 2020

bahanonu commented Apr 20, 2020

bendichter commented Apr 20, 2020 •

edited

Loading

bahanonu commented Apr 20, 2020

bendichter commented Apr 21, 2020

bendichter commented Apr 21, 2020

lawrence-mbf commented May 16, 2023

incompatibility reading boolean data written using pynwb/h5py #206

incompatibility reading boolean data written using pynwb/h5py #206

Comments

bendichter commented Apr 20, 2020

lawrence-mbf commented Apr 20, 2020

bahanonu commented Apr 20, 2020

bendichter commented Apr 20, 2020 • edited Loading

bahanonu commented Apr 20, 2020

bendichter commented Apr 21, 2020

bendichter commented Apr 21, 2020

lawrence-mbf commented May 16, 2023

bendichter commented Apr 20, 2020 •

edited

Loading