Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

default object_codec_class -> JSON #173

Closed
wants to merge 5 commits into from
Closed

default object_codec_class -> JSON #173

wants to merge 5 commits into from

Conversation

magland
Copy link

@magland magland commented Mar 5, 2024

Motivation

See #171

I was working on adding nwb-zarr support to neurosift and I ran into a problem where many of the datasets within the zarr were pickle-encoded. It is difficult to read pickled data in languages other than Python, and it is practically impossible using JavaScript in the browser. So I would like to request that hdmf-zarr be updated to avoid using pickling when writing datasets.

In this PR I adjusted the default object_codec_class to numcodecs.JSON.

In addition, I also exposed this parameter in the constructor of NWBZarrIO

Fix #171

How to test the behavior?

I ran the following with my forked version, and verified that the resulting zarr opens properly in Neurosift:

import os
from pynwb import NWBHDF5IO
from hdmf_zarr.nwb import NWBZarrIO
import numcodecs

filename = 'sub-anm00239123_ses-20170627T093549_ecephys+ogen.nwb'

def main():
    zarr_filename = 'test.zarr'
    with NWBHDF5IO(path=filename, mode='r', load_namespaces=False) as read_io:  # Create HDF5 IO object for read
        with NWBZarrIO(path=zarr_filename, mode='w') as export_io:         # Create Zarr IO object for write
            export_io.export(src_io=read_io, write_args=dict(link_data=False))   # Export from HDF5 to Zarr

if __name__ == "__main__":
    main()

Checklist

Note: I wasn't sure how to update CHANGELOG.md -- I mean where to put the new tet

  • Did you update CHANGELOG.md with your changes?
  • Have you checked our Contributing document?
  • Have you ensured the PR clearly describes the problem and the solution?
  • Is your contribution compliant with our coding style? This can be checked running ruff from the source directory.
  • Have you checked to ensure that there aren't other open Pull Requests for the same change?
  • Have you included the relevant issue number using "Fix #XXX" notation where XXX is the issue number? By including "Fix #XXX" you allow GitHub to close issue #XXX when the PR is merged.

Copy link
Contributor

@oruebel oruebel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oruebel
Copy link
Contributor

oruebel commented Mar 5, 2024

In this PR I adjusted the default object_codec_class to numcodecs.JSON.

I think the main reason I didn't use JSON at the time was because of cases where objects are not JSON serializable. However, I don't remember whether that was an actual issue I encountered or whether I was just being overly cautious and used Pickle because I knew it would work. But I think if the test suite passes we should be Ok.

@magland
Copy link
Author

magland commented Mar 5, 2024

Could you also please add a line to the CHANGELOG https://github.com/hdmf-dev/hdmf-zarr/blob/dev/CHANGELOG.md

Do you want me to add it under the 0.6.0 heading, or make a new heading? (0.6.1, 0.7.0)

@oruebel
Copy link
Contributor

oruebel commented Mar 5, 2024

Do you want me to add it under the 0.6.0 heading, or make a new heading? (0.6.1, 0.7.0)

Please add it under a new heading "0.7.0 (upcoming)" since 0.6.0 has already been released

@magland
Copy link
Author

magland commented Mar 5, 2024

Do you want me to add it under the 0.6.0 heading, or make a new heading? (0.6.1, 0.7.0)

Please add it under a new heading "0.7.0 (upcoming)" since 0.6.0 has already been released

Done.

@magland
Copy link
Author

magland commented Mar 5, 2024

Now that I made the actual modification, some tests are failing on my fork. So I guess the process is going to be more involved.

@oruebel
Copy link
Contributor

oruebel commented Mar 5, 2024

Now that I made the actual modification, some tests are failing on my fork. So I guess the process is going to be more involved.

It looks like there are two main types of errors related to the JSON encoding:

  1. TypeError: Object of type void is not JSON serializable --> This appears in 5 failing tests, which all seem to be related to writing ZarrWriteUnit, i.e., I think these are likely all the same issue.
  2. AttributeError: 'int' object has no attribute 'append' --> This appears in 18 failing tests, and as far as I can tell they seem to be all related to writing compound dataset with references, more specifically, these I believe are writing a compound data type that is a mix of strings and int, similar to TimeSeriesReferenceVectorData in NWB.

@oruebel
Copy link
Contributor

oruebel commented Mar 5, 2024

2. AttributeError: 'int' object has no attribute 'append'

The point of failure for this error appears to be in all cases in the write_dataset function at:

dset = parent.require_dataset(
name,
shape=(len(arr),),
dtype=dtype,
object_codec=self.__codec_cls(),
**options['io_settings']
)

#146 updated the writing of compound datasets to fix the determination of the dtype and may be relevant here.

@magland
Copy link
Author

magland commented Mar 5, 2024

Thanks, I'm working on a potential solution.

@magland
Copy link
Author

magland commented Mar 6, 2024

@oruebel I have pushed a solution where it resorts to a Pickle codec in certain (hopefully rare) cases, and prints a warning.

The cases are: (a) structured array and (b) compound dtype with embedded references

For the purposes of neurosift, I am okay with this, because it seems to be a rare occurrence. But I wasn't sure whether you wanted to either suppress the warnings or try to serialize this in a different way.

@oruebel
Copy link
Contributor

oruebel commented Mar 8, 2024

Thanks a lot! I think it would be great to see if we can use JSON for those cases too. The TimeSeriesReferenceVectorData is probably the main case in NWB. This is a compound of two ints and a reference. References are stored as a jSOn string, so I think those should be serializable with JSON too.

@mavaylon1 could you take a look at this PR to see if we can use JSON serialization for the compound data case that @magland mentioned as well. It would be good to be consistent in how we serialize objects. I think thus is related to the recent PR on compound data types.

@oruebel oruebel requested a review from mavaylon1 March 8, 2024 03:12
@magland
Copy link
Author

magland commented Mar 8, 2024

I found this particularly challenging example in the wild. It seems it has a dataset that is a list of lists with embedded references.

000713 - Allen Institute - Visual Behavior - Neuropixels

https://neurosift.app/?p=/nwb&url=https://api.dandiarchive.org/api/assets/b2391922-c9a6-43f9-8b92-043be4015e56/download/&dandisetId=000713&dandisetVersion=draft

https://api.dandiarchive.org/api/assets/b2391922-c9a6-43f9-8b92-043be4015e56/download/

intervals/grating_presentations/timeseries

[[3, 135, <HDF5 object reference>], [138, 348, <HDF5 object reference>], [486, 392, <HDF5 object reference>], [878, 474, <HDF5 object reference>], [1352, 373, <HDF5 object reference>], [1725, 394, <HDF5 object reference>], [2119, 428, <HDF5 object reference>], [2547, 363, <HDF5 object reference>], [2910, 396, <HDF5 object reference>], [3306, 430, <HDF5 object reference>], [3736, 405, <HDF5 object reference>], [4141, 400, <HDF5 object reference>], [4541, 495, <HDF5 object reference>], [5036, 372, <HDF5 object reference>], [5408, 528, <HDF5 object reference>], [5936, 369, <HDF5 object reference>], [6305, 358, <HDF5 object reference>], [6663, 347, <HDF5 object reference>], [7010, 401, <HDF5 object reference>], [7411, 372, <HDF5 object reference>], [7783, 441, <HDF5 object reference>], [8224, 478, <HDF5 object reference>], [8702, 428, <HDF5 object reference>], [9130, 490, <HDF5 object reference>], [9620, 426, <HDF5 object reference>], [10046, 642, <HDF5 object reference>], [10688, 454, <HDF5 object reference>], [11142, 366, <HDF5 object reference>], [11508, 570, <HDF5 object reference>], [12078, 594, <HDF5 object reference>], [12672, 387, <HDF5 object reference>], [13059, 484, <HDF5 object reference>], [13543, 586, <HDF5 object reference>], [14129, 627, <HDF5 object reference>], [14756, 374, <HDF5 object reference>], [15130, 390, <HDF5 object reference>], [15520, 362, <HDF5 object reference>], [15882, 414, <HDF5 object reference>], [16296, 624, <HDF5 object reference>], [16920, 353, <HDF5 object reference>], [17273, 474, <HDF5 object reference>], [17747, 416, <HDF5 object reference>], [18163, 413, <HDF5 object reference>], [18576, 453, <HDF5 object reference>], [19029, 355, <HDF5 object reference>], [19384, 369, <HDF5 object reference>], [19753, 478, <HDF5 object reference>], [20231, 499, <HDF5 object reference>], [20730, 391, <HDF5 object reference>], [21121, 419, <HDF5 object reference>], [21540, 476, <HDF5 object reference>], [22016, 377, <HDF5 object reference>], [22393, 393, <HDF5 object reference>], [22786, 593, <HDF5 object reference>], [23379, 414, <HDF5 object reference>], [23793, 385, <HDF5 object reference>], [24178, 477, <HDF5 object reference>], [24655, 566, <HDF5 object reference>], [25221, 363, <HDF5 object reference>], [25584, 510, <HDF5 object reference>], [26094, 633, <HDF5 object reference>], [26727, 651, <HDF5 object reference>], [27378, 598, <HDF5 object reference>], [27976, 351, <HDF5 object reference>], [28327, 465, <HDF5 object reference>], [28792, 347, <HDF5 object reference>], [29139, 458, <HDF5 object reference>], [29597, 408, <HDF5 object reference>], [30005, 534, <HDF5 object reference>], [30539, 368, <HDF5 object reference>], [30907, 411, <HDF5 object reference>], [31318, 683, <HDF5 object reference>], [32001, 516, <HDF5 object reference>], [32517, 532, <HDF5 object reference>], [33049, 379, <HDF5 object reference>], [33428, 395, <HDF5 object reference>], [33823, 413, <HDF5 object reference>], [34236, 485, <HDF5 object reference>], [34721, 457, <HDF5 object reference>], [35178, 386, <HDF5 object reference>], [35564, 554, <HDF5 object reference>], [36118, 473, <HDF5 object reference>], [36591, 505, <HDF5 object reference>], [37096, 353, <HDF5 object reference>], [37449, 388, <HDF5 object reference>], [37837, 355, <HDF5 object reference>], [38192, 361, <HDF5 object reference>], [38553, 380, <HDF5 object reference>], [38933, 512, <HDF5 object reference>], [39445, 403, <HDF5 object reference>], [39848, 544, <HDF5 object reference>], [40392, 351, <HDF5 object reference>], [40743, 440, <HDF5 object reference>], [41183, 387, <HDF5 object reference>], [41570, 359, <HDF5 object reference>], [41929, 485, <HDF5 object reference>], [42414, 429, <HDF5 object reference>], [42843, 404, <HDF5 object reference>], [43247, 417, <HDF5 object reference>], [43664, 458, <HDF5 object reference>], [44122, 443, <HDF5 object reference>], [44565, 422, <HDF5 object reference>], [44987, 410, <HDF5 object reference>], [45397, 510, <HDF5 object reference>], [45907, 526, <HDF5 object reference>], [46433, 638, <HDF5 object reference>], [47071, 447, <HDF5 object reference>], [47518, 456, <HDF5 object reference>], [47974, 352, <HDF5 object reference>], [48326, 493, <HDF5 object reference>], [48819, 392, <HDF5 object reference>], [49211, 420, <HDF5 object reference>], [49631, 391, <HDF5 object reference>], [50022, 377, <HDF5 object reference>], [50399, 366, <HDF5 object reference>], [50765, 428, <HDF5 object reference>], [51193, 351, <HDF5 object reference>], [51544, 356, <HDF5 object reference>], [51900, 347, <HDF5 object reference>], [52247, 656, <HDF5 object reference>], [52903, 427, <HDF5 object reference>], [53330, 430, <HDF5 object reference>], [53760, 210, <HDF5 object reference>]]

@oruebel
Copy link
Contributor

oruebel commented Mar 8, 2024

I found this particularly challenging example in the wild. It seems it has a dataset that is a list of lists with embedded references.

Yes, that is a TimeSeriesReferenceVectorData https://nwb-schema.readthedocs.io/en/stable/format.html#timeseriesreferencevectordata which is a compound dataset which has a reference to a TimeSeries and then two ints with the start and stop index.

@oruebel
Copy link
Contributor

oruebel commented Mar 9, 2024

I found this particularly challenging example in the wild.

Note, in Zarr, the object reference is being represented as a JSON, so in Zarr the dtype for TimeSeriesReferenceVectorData ends up being a (int, int, str). hdmf-zarr then deals with translating the SON string part back to resolve the reference. This is done lazily on read. I.e., from a Zarr perspective, I think this datatype of (int, int, str) should in principle still be serializable in JSON.

@codecov-commenter
Copy link

codecov-commenter commented Mar 9, 2024

Codecov Report

Attention: Patch coverage is 89.47368% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 86.07%. Comparing base (556ed12) to head (29adcca).

Files Patch % Lines
src/hdmf_zarr/backend.py 88.88% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##              dev     #173      +/-   ##
==========================================
+ Coverage   86.04%   86.07%   +0.03%     
==========================================
  Files           5        5              
  Lines        1161     1178      +17     
  Branches      296      303       +7     
==========================================
+ Hits          999     1014      +15     
  Misses        107      107              
- Partials       55       57       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines +1281 to +1286
for c in np.ndindex(data_shape):
o = data
for i in c:
o = o[i]
if isinstance(o, np.void) and o.dtype.names is not None:
has_structured_array = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems pretty slow, iterating through every value in the array to check type. in structured arrays, wouldn't you just have to check the dtype of the whole array? in any case shouldn't we break the first time we get a True?

Copy link
Contributor

@sneakers-the-rat sneakers-the-rat Mar 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also couldn't this switch happen here:

for substype in dtype.fields.items():
if np.issubdtype(substype[1][0], np.flexible) or np.issubdtype(substype[1][0], np.object_):
dtype = object
io_settings['object_codec'] = self.__codec_cls()
break
?

@sneakers-the-rat
Copy link
Contributor

Can we turn the codec class into a getter method so we can encapsulate all switching logic in one place?

currently there's:

def object_codec_class(self):

the logic of when to use which codec is split in a few places, but it seems to ultimately depend on builder.dtype:

this check here:

if not isinstance(object_codec, numcodecs.Pickle):

has its roots here:
options['dtype'] = builder.dtype

and similarly this check:

has_structured_array = False

has its roots here:
dtype = options.get('dtype')

called from here:
dset = self.__list_fill__(parent, name, data, options)

which is just the other leg of the above check, and L1276-1291 is very similar to L1037-1050.

write_builder and the other write_* methods are almost private methods since the docs only demonstrate using write(), but the builder is passed through them rather than stored as an instance attr (presumably since it gets remade on each write()? not really sure) so we can't assume we have it and use the above object_codec_class property, but it looks like every time we get __codec_cls we also have the builder or the options dict that gets constructed from it in scope, so couldn't we do something like get_codec_cls(dtype) ?

not my package, but when i was working on this before i had left this note:

# TODO: Replace with a simple one-liner once __resolve_dtype_helper__ is
# compatible with zarr's need for fixed-length string dtypes.
# dtype = self.__resolve_dtype_helper__(options['dtype'])
, which seems to be out of date bc we're using objects rather than converting to strings (presumably the root if the issue here), and so it also seems like we can consolidate the several places where we create and check a dtype into that __resolve_dtype_helper__ method which could then make get_codec_cls() well defined as well.

@oruebel
Copy link
Contributor

oruebel commented Mar 16, 2024

@sneakers-the-rat some more details below, but to make a long story short, my hope is that we can:

  1. Avoid having to use a mix of different codecs in the same file and as such avoid the need for logic to switch between codecs if possible
  2. If we really need to use multiple codecs, then:
    • Yes, we should encapsulate the logic for selecting codes in a function, as you proposed
    • We should remove the object_codec_class parameter on the ZarrIO.__init__ method and not allow users to manually specify the codec to use (this is probably a good idea in general to avoid having to deal with to many corner cases)

@magland @sneakers-the-rat sorry that I have not had the chance to dive into the code for this PR yet, but those are things I hope we can do before merging this PR, i.e., try to do 1. if possible and implement 2. if we can't make 1. work.

the logic of when to use which codec is split in a few places

TBH, my hope is that we can avoid having to use different codecs if possible. The use of numcodecs.Pickle was convenient, because it is pretty reliable in handling even complex cases, however, it makes it hard for non-Python codes (e.g., from JavaScript in the case of neurosift) to use the Zarr files directly. numcodecs.JSON makes that easier, but JSON is not as flexible as Pickle in terms of the cases it handles. The main issue seems to be structured arrays. However, in most cases those are a combination of ints and strings in our case, so my hope is that we can handle those in JSON as well, and only fall back to Pickle when absolutely necessary.

Can we turn the codec class into a getter method so we can encapsulate all switching logic in one place?

Currently the backend uses a single codec throughout, so a getter was not need, but if we really need to use multiple codes (which, again, I ideally would like to avoid if possible), then I totally agree that encapsulating this logic in a method is a good idea. We probably, then also need to get rid of object_codec_class as a parameter on __init__ since we need to control when to use what codec

{'name': 'object_codec_class', 'type': None,
'doc': 'Set the numcodec object codec class to be used to encode objects.'
'Use numcodecs.pickles.Pickle by default.',

so couldn't we do something like get_codec_cls(dtype)

Yes, I believe that should work. The object_codec_class property is mainly there since the codec is currently configurable on __init__.

@mavaylon1
Copy link
Contributor

@oruebel I'll dive in this week.

@mavaylon1
Copy link
Contributor

Recap and Updates:
We ideally only want one codec. I've been trying to get the compound case for references to work but I've ran into some issues. Using the dev branch of numcodecs, the problem is decoding the data. I made an issue ticket on the numcodec github and will try to find a fix. zarr-developers/numcodecs#518

@mavaylon1
Copy link
Contributor

I have an "almost" proof of concept for this. I will share this week if it holds up with more exploration.

@magland magland closed this by deleting the head repository Sep 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature]: Write zarr without using pickle
5 participants