[Feature]: Add Compression support for compound objects #1041

pauladkisson · 2024-08-29T01:02:35Z

What would you like to see added to NeuroConv?

DatasetIOConfiguration.from_neurodata_object currently raises a NotImplementedError for compound objects with dtype 'object'.

This makes it very difficult to use the KiloSortSortingInterface, which includes a compound object for unit quality ('good' or 'noise').

Some chunking/compression should be selected by default, maybe with a warning that it might be sub-optimal for compound objects.

Do you have any interest in helping implement the feature?

Yes.

Code of Conduct

I agree to follow this project's Code of Conduct
Have you ensured this bug was not already reported?

The text was updated successfully, but these errors were encountered:

h-mayorquin · 2024-08-30T15:41:40Z

Some chunking/compression should be selected by default, maybe with a warning that it might be sub-optimal for compound objects.

Why? How are you thinking about this? Is the cost of leaving them uncompressed so high? They usually appear on dynamic tables where I think the effect is not that large.

This makes it very difficult to use the KiloSortSortingInterface, which includes a compound object for unit quality ('good' or 'noise').

Question about this for my own learning: shouldn't those be strings and not objects?

https://hdmf-schema-language.readthedocs.io/en/latest/description.html#dtype

pauladkisson · 2024-08-30T17:53:54Z

Why? How are you thinking about this? Is the cost of leaving them uncompressed so high? They usually appear on dynamic tables where I think the effect is not that large.

The cost of leaving them uncompressed isn't the problem. The problem is that get_default_backend_configuration runs by default in run_conversion, which throws an error. Plus, it is actually not so easy to specify no compression, since you have to iterate through the neurodata objects yourself to avoid the error.

To be honest, it might be worthwhile to be able to specify no-compression at a higher level in addition to addressing this specific error.

pauladkisson · 2024-08-30T18:21:17Z

Question about this for my own learning: shouldn't those be strings and not objects?

pandas defaults to object for strings for backwards compatibility (https://pandas.pydata.org/docs/user_guide/text.html). And spikeinterface's BasePhyKilosortSortingExtractor uses pd.read_csv to get the unit info: https://github.com/SpikeInterface/spikeinterface/blob/161efc4ac4fd710fcf17cc836da82ac5c3e63c5a/src/spikeinterface/extractors/phykilosortextractors.py#L75

h-mayorquin · 2024-08-30T19:05:54Z

Regarding the Phy/Kilosort this is an easy case that we can modify on the fly here or in spikeinterface.

But in general maybe we should allow compression settings to strings. The problem would be to differentiate which arrays of dtype objects are really strings and which are not. Do we have examples of dtype objects that are not strings? I think things like groups but we would need to verify.

h-mayorquin · 2024-09-05T21:35:46Z

OK, the phy/kilosort problem is fixed, merged and should be included in the incoming release of Spikeinterface:
SpikeInterface/spikeinterface#3365

This issue can be separated into three main cases:

Arrays of strings but have object as a dtype. This is the case that started this issue. In this case, I think we can just add a check to see if all the elements are strings and then use the same logic as here:

neuroconv/src/neuroconv/tools/nwb_helpers/_configuration_models/_base_dataset_io.py

Lines 271 to 278 in 675ec48

    
           elif dtype != np.dtype("object"): 
        
               chunk_shape = SliceableDataChunkIterator.estimate_default_chunk_shape( 
        
                   chunk_mb=10.0, maxshape=full_shape, dtype=np.dtype(dtype) 
        
               ) 
        
               buffer_shape = SliceableDataChunkIterator.estimate_default_buffer_shape( 
        
                   buffer_gb=0.5, chunk_shape=chunk_shape, maxshape=full_shape, dtype=np.dtype(dtype) 
        
               ) 
        
               compression_method = "gzip"

(we are already doing that for arrays of numpy type "U" so it should be unified)

Arrays that contain a collection of heterogenous basic types: np.array(["string", 3.0, 1]. I think the solution of @pauladkisson in Added chunking/compression for string-only objects #1042 works for this. No compression, a chunk as big as the data (for lack of a better option) including a warning.
Arrays that contain python objects: np.array([dict(x=1), dict(y=2), dict(z=3)]. Those ones we can't even write so we can either just pass them empty and let pynwb handle the failure?

Examples:
@CodyCBakerPhD
Arrays with different basic types.

from hdmf.common import DynamicTable
from pynwb.testing.mock.file import mock_NWBFile
from pynwb import TimeSeries    
import numpy as np


nwbfile = mock_NWBFile()


dynamic_table = DynamicTable(name="table", description="desc")
dynamic_table.add_row()
dynamic_table.add_row()
dynamic_table.add_row()

data = np.asarray(["this", 35.0, 1])
dynamic_table.add_column(name="stuff", description="description", data=data)

time_series = TimeSeries(name="ts",data=data.copy(), unit="n.a.", rate=1.0)

nwbfile.add_acquisition(dynamic_table)
nwbfile.add_acquisition(time_series)

from pynwb import NWBHDF5IO

with NWBHDF5IO("test.nwb", mode="w") as io:
    io.write(nwbfile)
    
io = NWBHDF5IO("test.nwb", mode="r")
nwbfile_read = io.read()

Arrays with python objects (this does not work):

import numpy as np

class MyObject:
    def __init__(self, name, value):
        self.name = name
        self.value = value

obj1 = MyObject("Object1", 10)
obj2 = MyObject("Object2", 20)
obj3 = MyObject("Object3", 30)

# NumPy array containing custom objects (none work)
object_array = np.array([obj1, obj2, obj3], dtype=object)
object_array = np.array([dict(x=1), dict(y=2), dict(z=3)])


from pynwb import TimeSeries
from pynwb.testing.mock.file import mock_NWBFile
from hdmf.common import DynamicTable



nwbfile = mock_NWBFile()

dynamic_table = DynamicTable(name="table", description="desc")
dynamic_table.add_row()
dynamic_table.add_row()
dynamic_table.add_row()

dynamic_table.add_column(name="stuff", description="description", data=object_array.copy())

#time_series = TimeSeries(name="data", data=object_array.copy(), unit="n.a.", rate=1.0)

nwbfile.add_acquisition(dynamic_table)
nwbfile.add_acquisition(time_series)

from pynwb import NWBHDF5IO

# Write the NWBFile to a file
with NWBHDF5IO("test.nwb", mode="w") as io:
    io.write(nwbfile)

pauladkisson · 2024-09-06T17:30:48Z

Sounds good!

I can take 1&2 in #1042

3 should probably be a separate PR?

pauladkisson · 2024-09-06T17:38:10Z

Actually, numpy strings have itemsize 0, so the estimate methods fail.

we are already doing that for arrays of numpy type "U" so it should be unified

How did you deal with the itemsize problem in those cases?

h-mayorquin · 2024-09-06T17:51:41Z

Can you show me the code that is failing, maybe I misunderstood but I just ran:

import numpy as np


x = np.asarray(["a", "b", "c"])
x.dtype.itemsize, x.dtype, np.__version__
(4, dtype('<U1'), '1.26.4')

Yeah, 3 I would just ask the pynwb team, probably we don't need to do anything about it...

pauladkisson · 2024-09-06T18:00:34Z

I see now.

I was specifying dtype="U", which doesn't have an itemsize, but if all the strings are below a certain length it will use a fixed size (<U1, <U2, <U3, etc.)

h-mayorquin · 2024-09-06T18:21:11Z

Ah, that's a catch.

pauladkisson added the enhancement label Aug 29, 2024

pauladkisson mentioned this issue Aug 29, 2024

Added chunking/compression for string-only objects #1042

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Add Compression support for compound objects #1041

[Feature]: Add Compression support for compound objects #1041

pauladkisson commented Aug 29, 2024 •

edited

Loading

h-mayorquin commented Aug 30, 2024

pauladkisson commented Aug 30, 2024

pauladkisson commented Aug 30, 2024

h-mayorquin commented Aug 30, 2024

h-mayorquin commented Sep 5, 2024

pauladkisson commented Sep 6, 2024

pauladkisson commented Sep 6, 2024 •

edited

Loading

h-mayorquin commented Sep 6, 2024

pauladkisson commented Sep 6, 2024

h-mayorquin commented Sep 6, 2024

[Feature]: Add Compression support for compound objects #1041

[Feature]: Add Compression support for compound objects #1041

Comments

pauladkisson commented Aug 29, 2024 • edited Loading

What would you like to see added to NeuroConv?

Do you have any interest in helping implement the feature?

Code of Conduct

h-mayorquin commented Aug 30, 2024

pauladkisson commented Aug 30, 2024

pauladkisson commented Aug 30, 2024

h-mayorquin commented Aug 30, 2024

h-mayorquin commented Sep 5, 2024

pauladkisson commented Sep 6, 2024

pauladkisson commented Sep 6, 2024 • edited Loading

h-mayorquin commented Sep 6, 2024

pauladkisson commented Sep 6, 2024

h-mayorquin commented Sep 6, 2024

pauladkisson commented Aug 29, 2024 •

edited

Loading

pauladkisson commented Sep 6, 2024 •

edited

Loading