default object_codec_class -> JSON #173

magland · 2024-03-05T15:51:37Z

Motivation

I was working on adding nwb-zarr support to neurosift and I ran into a problem where many of the datasets within the zarr were pickle-encoded. It is difficult to read pickled data in languages other than Python, and it is practically impossible using JavaScript in the browser. So I would like to request that hdmf-zarr be updated to avoid using pickling when writing datasets.

In this PR I adjusted the default object_codec_class to numcodecs.JSON.

In addition, I also exposed this parameter in the constructor of NWBZarrIO

Fix #171

How to test the behavior?

I ran the following with my forked version, and verified that the resulting zarr opens properly in Neurosift:

import os
from pynwb import NWBHDF5IO
from hdmf_zarr.nwb import NWBZarrIO
import numcodecs

filename = 'sub-anm00239123_ses-20170627T093549_ecephys+ogen.nwb'

def main():
    zarr_filename = 'test.zarr'
    with NWBHDF5IO(path=filename, mode='r', load_namespaces=False) as read_io:  # Create HDF5 IO object for read
        with NWBZarrIO(path=zarr_filename, mode='w') as export_io:         # Create Zarr IO object for write
            export_io.export(src_io=read_io, write_args=dict(link_data=False))   # Export from HDF5 to Zarr

if __name__ == "__main__":
    main()

Checklist

Note: I wasn't sure how to update CHANGELOG.md -- I mean where to put the new tet

Did you update CHANGELOG.md with your changes?
Have you checked our Contributing document?
Have you ensured the PR clearly describes the problem and the solution?
Is your contribution compliant with our coding style? This can be checked running ruff from the source directory.
Have you checked to ensure that there aren't other open Pull Requests for the same change?
Have you included the relevant issue number using "Fix #XXX" notation where XXX is the issue number? By including "Fix #XXX" you allow GitHub to close issue #XXX when the PR is merged.

src/hdmf_zarr/backend.py

oruebel

Could you also please add a line to the CHANGELOG https://github.com/hdmf-dev/hdmf-zarr/blob/dev/CHANGELOG.md

oruebel · 2024-03-05T18:33:58Z

In this PR I adjusted the default object_codec_class to numcodecs.JSON.

I think the main reason I didn't use JSON at the time was because of cases where objects are not JSON serializable. However, I don't remember whether that was an actual issue I encountered or whether I was just being overly cautious and used Pickle because I knew it would work. But I think if the test suite passes we should be Ok.

magland · 2024-03-05T19:05:18Z

Could you also please add a line to the CHANGELOG https://github.com/hdmf-dev/hdmf-zarr/blob/dev/CHANGELOG.md

Do you want me to add it under the 0.6.0 heading, or make a new heading? (0.6.1, 0.7.0)

oruebel · 2024-03-05T20:29:27Z

Do you want me to add it under the 0.6.0 heading, or make a new heading? (0.6.1, 0.7.0)

Please add it under a new heading "0.7.0 (upcoming)" since 0.6.0 has already been released

magland · 2024-03-05T20:55:17Z

Do you want me to add it under the 0.6.0 heading, or make a new heading? (0.6.1, 0.7.0)

Please add it under a new heading "0.7.0 (upcoming)" since 0.6.0 has already been released

Done.

magland · 2024-03-05T21:10:07Z

Now that I made the actual modification, some tests are failing on my fork. So I guess the process is going to be more involved.

oruebel · 2024-03-05T22:26:07Z

Now that I made the actual modification, some tests are failing on my fork. So I guess the process is going to be more involved.

It looks like there are two main types of errors related to the JSON encoding:

TypeError: Object of type void is not JSON serializable --> This appears in 5 failing tests, which all seem to be related to writing ZarrWriteUnit, i.e., I think these are likely all the same issue.
AttributeError: 'int' object has no attribute 'append' --> This appears in 18 failing tests, and as far as I can tell they seem to be all related to writing compound dataset with references, more specifically, these I believe are writing a compound data type that is a mix of strings and int, similar to TimeSeriesReferenceVectorData in NWB.

oruebel · 2024-03-05T22:30:05Z

2. AttributeError: 'int' object has no attribute 'append'

The point of failure for this error appears to be in all cases in the write_dataset function at:

hdmf-zarr/src/hdmf_zarr/backend.py

Lines 1055 to 1061 in 556ed12

    
           dset = parent.require_dataset( 
        
               name, 
        
               shape=(len(arr),), 
        
               dtype=dtype, 
        
               object_codec=self.__codec_cls(), 
        
               **options['io_settings'] 
        
           )

#146 updated the writing of compound datasets to fix the determination of the dtype and may be relevant here.

magland · 2024-03-05T23:46:46Z

Thanks, I'm working on a potential solution.

magland · 2024-03-06T00:21:05Z

@oruebel I have pushed a solution where it resorts to a Pickle codec in certain (hopefully rare) cases, and prints a warning.

The cases are: (a) structured array and (b) compound dtype with embedded references

For the purposes of neurosift, I am okay with this, because it seems to be a rare occurrence. But I wasn't sure whether you wanted to either suppress the warnings or try to serialize this in a different way.

oruebel · 2024-03-08T03:12:24Z

Thanks a lot! I think it would be great to see if we can use JSON for those cases too. The TimeSeriesReferenceVectorData is probably the main case in NWB. This is a compound of two ints and a reference. References are stored as a jSOn string, so I think those should be serializable with JSON too.

@mavaylon1 could you take a look at this PR to see if we can use JSON serialization for the compound data case that @magland mentioned as well. It would be good to be consistent in how we serialize objects. I think thus is related to the recent PR on compound data types.

magland · 2024-03-08T17:20:36Z

I found this particularly challenging example in the wild. It seems it has a dataset that is a list of lists with embedded references.

000713 - Allen Institute - Visual Behavior - Neuropixels

https://neurosift.app/?p=/nwb&url=https://api.dandiarchive.org/api/assets/b2391922-c9a6-43f9-8b92-043be4015e56/download/&dandisetId=000713&dandisetVersion=draft

https://api.dandiarchive.org/api/assets/b2391922-c9a6-43f9-8b92-043be4015e56/download/

intervals/grating_presentations/timeseries

[[3, 135, <HDF5 object reference>], [138, 348, <HDF5 object reference>], [486, 392, <HDF5 object reference>], [878, 474, <HDF5 object reference>], [1352, 373, <HDF5 object reference>], [1725, 394, <HDF5 object reference>], [2119, 428, <HDF5 object reference>], [2547, 363, <HDF5 object reference>], [2910, 396, <HDF5 object reference>], [3306, 430, <HDF5 object reference>], [3736, 405, <HDF5 object reference>], [4141, 400, <HDF5 object reference>], [4541, 495, <HDF5 object reference>], [5036, 372, <HDF5 object reference>], [5408, 528, <HDF5 object reference>], [5936, 369, <HDF5 object reference>], [6305, 358, <HDF5 object reference>], [6663, 347, <HDF5 object reference>], [7010, 401, <HDF5 object reference>], [7411, 372, <HDF5 object reference>], [7783, 441, <HDF5 object reference>], [8224, 478, <HDF5 object reference>], [8702, 428, <HDF5 object reference>], [9130, 490, <HDF5 object reference>], [9620, 426, <HDF5 object reference>], [10046, 642, <HDF5 object reference>], [10688, 454, <HDF5 object reference>], [11142, 366, <HDF5 object reference>], [11508, 570, <HDF5 object reference>], [12078, 594, <HDF5 object reference>], [12672, 387, <HDF5 object reference>], [13059, 484, <HDF5 object reference>], [13543, 586, <HDF5 object reference>], [14129, 627, <HDF5 object reference>], [14756, 374, <HDF5 object reference>], [15130, 390, <HDF5 object reference>], [15520, 362, <HDF5 object reference>], [15882, 414, <HDF5 object reference>], [16296, 624, <HDF5 object reference>], [16920, 353, <HDF5 object reference>], [17273, 474, <HDF5 object reference>], [17747, 416, <HDF5 object reference>], [18163, 413, <HDF5 object reference>], [18576, 453, <HDF5 object reference>], [19029, 355, <HDF5 object reference>], [19384, 369, <HDF5 object reference>], [19753, 478, <HDF5 object reference>], [20231, 499, <HDF5 object reference>], [20730, 391, <HDF5 object reference>], [21121, 419, <HDF5 object reference>], [21540, 476, <HDF5 object reference>], [22016, 377, <HDF5 object reference>], [22393, 393, <HDF5 object reference>], [22786, 593, <HDF5 object reference>], [23379, 414, <HDF5 object reference>], [23793, 385, <HDF5 object reference>], [24178, 477, <HDF5 object reference>], [24655, 566, <HDF5 object reference>], [25221, 363, <HDF5 object reference>], [25584, 510, <HDF5 object reference>], [26094, 633, <HDF5 object reference>], [26727, 651, <HDF5 object reference>], [27378, 598, <HDF5 object reference>], [27976, 351, <HDF5 object reference>], [28327, 465, <HDF5 object reference>], [28792, 347, <HDF5 object reference>], [29139, 458, <HDF5 object reference>], [29597, 408, <HDF5 object reference>], [30005, 534, <HDF5 object reference>], [30539, 368, <HDF5 object reference>], [30907, 411, <HDF5 object reference>], [31318, 683, <HDF5 object reference>], [32001, 516, <HDF5 object reference>], [32517, 532, <HDF5 object reference>], [33049, 379, <HDF5 object reference>], [33428, 395, <HDF5 object reference>], [33823, 413, <HDF5 object reference>], [34236, 485, <HDF5 object reference>], [34721, 457, <HDF5 object reference>], [35178, 386, <HDF5 object reference>], [35564, 554, <HDF5 object reference>], [36118, 473, <HDF5 object reference>], [36591, 505, <HDF5 object reference>], [37096, 353, <HDF5 object reference>], [37449, 388, <HDF5 object reference>], [37837, 355, <HDF5 object reference>], [38192, 361, <HDF5 object reference>], [38553, 380, <HDF5 object reference>], [38933, 512, <HDF5 object reference>], [39445, 403, <HDF5 object reference>], [39848, 544, <HDF5 object reference>], [40392, 351, <HDF5 object reference>], [40743, 440, <HDF5 object reference>], [41183, 387, <HDF5 object reference>], [41570, 359, <HDF5 object reference>], [41929, 485, <HDF5 object reference>], [42414, 429, <HDF5 object reference>], [42843, 404, <HDF5 object reference>], [43247, 417, <HDF5 object reference>], [43664, 458, <HDF5 object reference>], [44122, 443, <HDF5 object reference>], [44565, 422, <HDF5 object reference>], [44987, 410, <HDF5 object reference>], [45397, 510, <HDF5 object reference>], [45907, 526, <HDF5 object reference>], [46433, 638, <HDF5 object reference>], [47071, 447, <HDF5 object reference>], [47518, 456, <HDF5 object reference>], [47974, 352, <HDF5 object reference>], [48326, 493, <HDF5 object reference>], [48819, 392, <HDF5 object reference>], [49211, 420, <HDF5 object reference>], [49631, 391, <HDF5 object reference>], [50022, 377, <HDF5 object reference>], [50399, 366, <HDF5 object reference>], [50765, 428, <HDF5 object reference>], [51193, 351, <HDF5 object reference>], [51544, 356, <HDF5 object reference>], [51900, 347, <HDF5 object reference>], [52247, 656, <HDF5 object reference>], [52903, 427, <HDF5 object reference>], [53330, 430, <HDF5 object reference>], [53760, 210, <HDF5 object reference>]]

oruebel · 2024-03-08T21:00:25Z

I found this particularly challenging example in the wild. It seems it has a dataset that is a list of lists with embedded references.

Yes, that is a TimeSeriesReferenceVectorData https://nwb-schema.readthedocs.io/en/stable/format.html#timeseriesreferencevectordata which is a compound dataset which has a reference to a TimeSeries and then two ints with the start and stop index.

oruebel · 2024-03-09T01:09:54Z

I found this particularly challenging example in the wild.

Note, in Zarr, the object reference is being represented as a JSON, so in Zarr the dtype for TimeSeriesReferenceVectorData ends up being a (int, int, str). hdmf-zarr then deals with translating the SON string part back to resolve the reference. This is done lazily on read. I.e., from a Zarr perspective, I think this datatype of (int, int, str) should in principle still be serializable in JSON.

codecov-commenter · 2024-03-09T01:14:47Z

Codecov Report

Attention: Patch coverage is 89.47368% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 86.07%. Comparing base (556ed12) to head (29adcca).

Files	Patch %	Lines
src/hdmf_zarr/backend.py	88.88%	0 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##              dev     #173      +/-   ##
==========================================
+ Coverage   86.04%   86.07%   +0.03%     
==========================================
  Files           5        5              
  Lines        1161     1178      +17     
  Branches      296      303       +7     
==========================================
+ Hits          999     1014      +15     
  Misses        107      107              
- Partials       55       57       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

sneakers-the-rat · 2024-03-15T20:25:52Z

src/hdmf_zarr/backend.py

+            for c in np.ndindex(data_shape):
+                o = data
+                for i in c:
+                    o = o[i]
+                if isinstance(o, np.void) and o.dtype.names is not None:
+                    has_structured_array = True


this seems pretty slow, iterating through every value in the array to check type. in structured arrays, wouldn't you just have to check the dtype of the whole array? in any case shouldn't we break the first time we get a True?

also couldn't this switch happen here:

hdmf-zarr/src/hdmf_zarr/backend.py

Lines 1254 to 1258 in 29adcca

for substype in dtype.fields.items():

if np.issubdtype(substype[1][0], np.flexible) or np.issubdtype(substype[1][0], np.object_):

dtype = object

io_settings['object_codec'] = self.__codec_cls()

break

?

sneakers-the-rat · 2024-03-15T20:37:17Z

Can we turn the codec class into a getter method so we can encapsulate all switching logic in one place?

currently there's:

hdmf-zarr/src/hdmf_zarr/backend.py

Line 154 in 29adcca

def object_codec_class(self):

the logic of when to use which codec is split in a few places, but it seems to ultimately depend on builder.dtype:

this check here:

hdmf-zarr/src/hdmf_zarr/backend.py

Line 1054 in 29adcca

if not isinstance(object_codec, numcodecs.Pickle):

has its roots here:

hdmf-zarr/src/hdmf_zarr/backend.py

Line 955 in 29adcca

options['dtype'] = builder.dtype

and similarly this check:

hdmf-zarr/src/hdmf_zarr/backend.py

Line 1279 in 29adcca

has_structured_array = False

has its roots here:

hdmf-zarr/src/hdmf_zarr/backend.py

Line 1218 in 29adcca

dtype = options.get('dtype')

called from here:

hdmf-zarr/src/hdmf_zarr/backend.py

Line 1071 in 29adcca

dset = self.__list_fill__(parent, name, data, options)

which is just the other leg of the above check, and L1276-1291 is very similar to L1037-1050.

write_builder and the other write_* methods are almost private methods since the docs only demonstrate using write(), but the builder is passed through them rather than stored as an instance attr (presumably since it gets remade on each write()? not really sure) so we can't assume we have it and use the above object_codec_class property, but it looks like every time we get __codec_cls we also have the builder or the options dict that gets constructed from it in scope, so couldn't we do something like get_codec_cls(dtype) ?

not my package, but when i was working on this before i had left this note:

hdmf-zarr/src/hdmf_zarr/backend.py

Lines 1033 to 1036 in 29adcca

    
           # TODO: Replace with a simple one-liner once __resolve_dtype_helper__ is 
        
           # compatible with zarr's need for fixed-length string dtypes. 
        
           # dtype = self.__resolve_dtype_helper__(options['dtype'])

, which seems to be out of date bc we're using objects rather than converting to strings (presumably the root if the issue here), and so it also seems like we can consolidate the several places where we create and check a dtype into that __resolve_dtype_helper__ method which could then make get_codec_cls() well defined as well.

oruebel · 2024-03-16T05:21:12Z

@sneakers-the-rat some more details below, but to make a long story short, my hope is that we can:

Avoid having to use a mix of different codecs in the same file and as such avoid the need for logic to switch between codecs if possible
If we really need to use multiple codecs, then:
- Yes, we should encapsulate the logic for selecting codes in a function, as you proposed
- We should remove the object_codec_class parameter on the ZarrIO.__init__ method and not allow users to manually specify the codec to use (this is probably a good idea in general to avoid having to deal with to many corner cases)

@magland @sneakers-the-rat sorry that I have not had the chance to dive into the code for this PR yet, but those are things I hope we can do before merging this PR, i.e., try to do 1. if possible and implement 2. if we can't make 1. work.

the logic of when to use which codec is split in a few places

TBH, my hope is that we can avoid having to use different codecs if possible. The use of numcodecs.Pickle was convenient, because it is pretty reliable in handling even complex cases, however, it makes it hard for non-Python codes (e.g., from JavaScript in the case of neurosift) to use the Zarr files directly. numcodecs.JSON makes that easier, but JSON is not as flexible as Pickle in terms of the cases it handles. The main issue seems to be structured arrays. However, in most cases those are a combination of ints and strings in our case, so my hope is that we can handle those in JSON as well, and only fall back to Pickle when absolutely necessary.

Can we turn the codec class into a getter method so we can encapsulate all switching logic in one place?

Currently the backend uses a single codec throughout, so a getter was not need, but if we really need to use multiple codes (which, again, I ideally would like to avoid if possible), then I totally agree that encapsulating this logic in a method is a good idea. We probably, then also need to get rid of object_codec_class as a parameter on __init__ since we need to control when to use what codec

hdmf-zarr/src/hdmf_zarr/backend.py

Lines 95 to 97 in 897d3b9

    
                       {'name': 'object_codec_class', 'type': None, 
        
                        'doc': 'Set the numcodec object codec class to be used to encode objects.' 
        
                               'Use numcodecs.pickles.Pickle by default.',

so couldn't we do something like get_codec_cls(dtype)

Yes, I believe that should work. The object_codec_class property is mainly there since the codec is currently configurable on __init__.

mavaylon1 · 2024-03-18T19:28:31Z

@oruebel I'll dive in this week.

mavaylon1 · 2024-04-09T13:57:24Z

Recap and Updates:
We ideally only want one codec. I've been trying to get the compound case for references to work but I've ran into some issues. Using the dev branch of numcodecs, the problem is decoding the data. I made an issue ticket on the numcodec github and will try to find a fix. zarr-developers/numcodecs#518

mavaylon1 · 2024-04-29T15:05:24Z

I have an "almost" proof of concept for this. I will share this week if it holds up with more exploration.

default object_codec_class -> JSON

8958fc7

oruebel reviewed Mar 5, 2024

View reviewed changes

src/hdmf_zarr/backend.py Show resolved Hide resolved

oruebel reviewed Mar 5, 2024

View reviewed changes

update CHANGELOG.md

3994017

change default object_codec_class to numcodecs.JSON

b8300e8

magland added 2 commits March 5, 2024 19:05

resort to Pickle codec for complex cases

2ac5a3f

update CHANGELOG.md

29adcca

oruebel requested a review from mavaylon1 March 8, 2024 03:12

sneakers-the-rat reviewed Mar 15, 2024

View reviewed changes

magland closed this by deleting the head repository Sep 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

default object_codec_class -> JSON #173

default object_codec_class -> JSON #173

magland commented Mar 5, 2024 •

edited

Loading

oruebel left a comment

oruebel commented Mar 5, 2024

magland commented Mar 5, 2024

oruebel commented Mar 5, 2024

magland commented Mar 5, 2024

magland commented Mar 5, 2024

oruebel commented Mar 5, 2024

oruebel commented Mar 5, 2024 •

edited

Loading

magland commented Mar 5, 2024

magland commented Mar 6, 2024

oruebel commented Mar 8, 2024

magland commented Mar 8, 2024

oruebel commented Mar 8, 2024

oruebel commented Mar 9, 2024

codecov-commenter commented Mar 9, 2024 •

edited

Loading

sneakers-the-rat Mar 15, 2024

sneakers-the-rat Mar 15, 2024 •

edited

Loading

sneakers-the-rat commented Mar 15, 2024

oruebel commented Mar 16, 2024

mavaylon1 commented Mar 18, 2024

mavaylon1 commented Apr 9, 2024

mavaylon1 commented Apr 29, 2024

	for substype in dtype.fields.items():
	if np.issubdtype(substype[1][0], np.flexible) or np.issubdtype(substype[1][0], np.object_):
	dtype = object
	io_settings['object_codec'] = self.__codec_cls()
	break

default object_codec_class -> JSON #173

default object_codec_class -> JSON #173

Conversation

magland commented Mar 5, 2024 • edited Loading

Motivation

How to test the behavior?

Checklist

oruebel left a comment

Choose a reason for hiding this comment

oruebel commented Mar 5, 2024

magland commented Mar 5, 2024

oruebel commented Mar 5, 2024

magland commented Mar 5, 2024

magland commented Mar 5, 2024

oruebel commented Mar 5, 2024

oruebel commented Mar 5, 2024 • edited Loading

magland commented Mar 5, 2024

magland commented Mar 6, 2024

oruebel commented Mar 8, 2024

magland commented Mar 8, 2024

oruebel commented Mar 8, 2024

oruebel commented Mar 9, 2024

codecov-commenter commented Mar 9, 2024 • edited Loading

Codecov Report

sneakers-the-rat Mar 15, 2024

Choose a reason for hiding this comment

sneakers-the-rat Mar 15, 2024 • edited Loading

Choose a reason for hiding this comment

sneakers-the-rat commented Mar 15, 2024

oruebel commented Mar 16, 2024

mavaylon1 commented Mar 18, 2024

mavaylon1 commented Apr 9, 2024

mavaylon1 commented Apr 29, 2024

magland commented Mar 5, 2024 •

edited

Loading

oruebel commented Mar 5, 2024 •

edited

Loading

codecov-commenter commented Mar 9, 2024 •

edited

Loading

sneakers-the-rat Mar 15, 2024 •

edited

Loading