Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Add Compression support for compound objects #1041

Open
2 tasks done
pauladkisson opened this issue Aug 29, 2024 · 10 comments
Open
2 tasks done

[Feature]: Add Compression support for compound objects #1041

pauladkisson opened this issue Aug 29, 2024 · 10 comments

Comments

@pauladkisson
Copy link
Member

pauladkisson commented Aug 29, 2024

What would you like to see added to NeuroConv?

DatasetIOConfiguration.from_neurodata_object currently raises a NotImplementedError for compound objects with dtype 'object'.

This makes it very difficult to use the KiloSortSortingInterface, which includes a compound object for unit quality ('good' or 'noise').

Some chunking/compression should be selected by default, maybe with a warning that it might be sub-optimal for compound objects.

Do you have any interest in helping implement the feature?

Yes.

Code of Conduct

@h-mayorquin
Copy link
Collaborator

Some chunking/compression should be selected by default, maybe with a warning that it might be sub-optimal for compound objects.

Why? How are you thinking about this? Is the cost of leaving them uncompressed so high? They usually appear on dynamic tables where I think the effect is not that large.

This makes it very difficult to use the KiloSortSortingInterface, which includes a compound object for unit quality ('good' or 'noise').

Question about this for my own learning: shouldn't those be strings and not objects?

https://hdmf-schema-language.readthedocs.io/en/latest/description.html#dtype

@pauladkisson
Copy link
Member Author

Why? How are you thinking about this? Is the cost of leaving them uncompressed so high? They usually appear on dynamic tables where I think the effect is not that large.

The cost of leaving them uncompressed isn't the problem. The problem is that get_default_backend_configuration runs by default in run_conversion, which throws an error. Plus, it is actually not so easy to specify no compression, since you have to iterate through the neurodata objects yourself to avoid the error.

To be honest, it might be worthwhile to be able to specify no-compression at a higher level in addition to addressing this specific error.

@pauladkisson
Copy link
Member Author

Question about this for my own learning: shouldn't those be strings and not objects?

pandas defaults to object for strings for backwards compatibility (https://pandas.pydata.org/docs/user_guide/text.html). And spikeinterface's BasePhyKilosortSortingExtractor uses pd.read_csv to get the unit info: https://github.com/SpikeInterface/spikeinterface/blob/161efc4ac4fd710fcf17cc836da82ac5c3e63c5a/src/spikeinterface/extractors/phykilosortextractors.py#L75

@h-mayorquin
Copy link
Collaborator

Regarding the Phy/Kilosort this is an easy case that we can modify on the fly here or in spikeinterface.

But in general maybe we should allow compression settings to strings. The problem would be to differentiate which arrays of dtype objects are really strings and which are not. Do we have examples of dtype objects that are not strings? I think things like groups but we would need to verify.

@h-mayorquin
Copy link
Collaborator

OK, the phy/kilosort problem is fixed, merged and should be included in the incoming release of Spikeinterface:
SpikeInterface/spikeinterface#3365

This issue can be separated into three main cases:

  1. Arrays of strings but have object as a dtype. This is the case that started this issue. In this case, I think we can just add a check to see if all the elements are strings and then use the same logic as here:
    elif dtype != np.dtype("object"):
    chunk_shape = SliceableDataChunkIterator.estimate_default_chunk_shape(
    chunk_mb=10.0, maxshape=full_shape, dtype=np.dtype(dtype)
    )
    buffer_shape = SliceableDataChunkIterator.estimate_default_buffer_shape(
    buffer_gb=0.5, chunk_shape=chunk_shape, maxshape=full_shape, dtype=np.dtype(dtype)
    )
    compression_method = "gzip"

(we are already doing that for arrays of numpy type "U" so it should be unified)

  1. Arrays that contain a collection of heterogenous basic types: np.array(["string", 3.0, 1]. I think the solution of @pauladkisson in Added chunking/compression for string-only objects #1042 works for this. No compression, a chunk as big as the data (for lack of a better option) including a warning.

  2. Arrays that contain python objects: np.array([dict(x=1), dict(y=2), dict(z=3)]. Those ones we can't even write so we can either just pass them empty and let pynwb handle the failure?

Examples:
@CodyCBakerPhD
Arrays with different basic types.

from hdmf.common import DynamicTable
from pynwb.testing.mock.file import mock_NWBFile
from pynwb import TimeSeries    
import numpy as np


nwbfile = mock_NWBFile()


dynamic_table = DynamicTable(name="table", description="desc")
dynamic_table.add_row()
dynamic_table.add_row()
dynamic_table.add_row()

data = np.asarray(["this", 35.0, 1])
dynamic_table.add_column(name="stuff", description="description", data=data)

time_series = TimeSeries(name="ts",data=data.copy(), unit="n.a.", rate=1.0)

nwbfile.add_acquisition(dynamic_table)
nwbfile.add_acquisition(time_series)

from pynwb import NWBHDF5IO

with NWBHDF5IO("test.nwb", mode="w") as io:
    io.write(nwbfile)
    
io = NWBHDF5IO("test.nwb", mode="r")
nwbfile_read = io.read()

Arrays with python objects (this does not work):

import numpy as np

class MyObject:
    def __init__(self, name, value):
        self.name = name
        self.value = value

obj1 = MyObject("Object1", 10)
obj2 = MyObject("Object2", 20)
obj3 = MyObject("Object3", 30)

# NumPy array containing custom objects (none work)
object_array = np.array([obj1, obj2, obj3], dtype=object)
object_array = np.array([dict(x=1), dict(y=2), dict(z=3)])


from pynwb import TimeSeries
from pynwb.testing.mock.file import mock_NWBFile
from hdmf.common import DynamicTable



nwbfile = mock_NWBFile()

dynamic_table = DynamicTable(name="table", description="desc")
dynamic_table.add_row()
dynamic_table.add_row()
dynamic_table.add_row()

dynamic_table.add_column(name="stuff", description="description", data=object_array.copy())

#time_series = TimeSeries(name="data", data=object_array.copy(), unit="n.a.", rate=1.0)

nwbfile.add_acquisition(dynamic_table)
nwbfile.add_acquisition(time_series)

from pynwb import NWBHDF5IO

# Write the NWBFile to a file
with NWBHDF5IO("test.nwb", mode="w") as io:
    io.write(nwbfile)

@pauladkisson
Copy link
Member Author

Sounds good!

I can take 1&2 in #1042

3 should probably be a separate PR?

@pauladkisson
Copy link
Member Author

pauladkisson commented Sep 6, 2024

Actually, numpy strings have itemsize 0, so the estimate methods fail.

we are already doing that for arrays of numpy type "U" so it should be unified

How did you deal with the itemsize problem in those cases?

@h-mayorquin
Copy link
Collaborator

Can you show me the code that is failing, maybe I misunderstood but I just ran:

import numpy as np


x = np.asarray(["a", "b", "c"])
x.dtype.itemsize, x.dtype, np.__version__
(4, dtype('<U1'), '1.26.4')

Yeah, 3 I would just ask the pynwb team, probably we don't need to do anything about it...

@pauladkisson
Copy link
Member Author

I see now.

I was specifying dtype="U", which doesn't have an itemsize, but if all the strings are below a certain length it will use a fixed size (<U1, <U2, <U3, etc.)

@h-mayorquin
Copy link
Collaborator

Ah, that's a catch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants