Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically migrate TimeIntervals.timeseries to use TimeSeriesReferenceVectorData #1390

Merged
merged 24 commits into from
May 17, 2022

Conversation

oruebel
Copy link
Contributor

@oruebel oruebel commented Aug 11, 2021

Motivation

Fix NeurodataWithoutBorders/nwb-schema#484 . This PR automatically migrates TimeIntervals.timeseries column to use the new TimeSeriesReferenceVectorData container class. It does so by patching the __class__ at runtime in the TimeIntervalsMap ObjectMapper class when the TimeIntervals object is being built on read. Since TimeIntervals.timeseries and TimeSeriesReferenceVectorData have the same schema and memory layout, swapping the class (i.e., object type) at runtime, after-the-fact should be fine.

Advantages

  • does not require changes to the Container classes as all changes are transparently handled in the ObjectMapper before the user Containers are set up.
  • leads to consistent API behavior for old and new files when accessing data

Disadvantages

  • Relies on monkey patching the objects after-the-fact

Major changes

  • Update the ObjectMapper for TimeIntervals to update the timeseries column on read
  • Update schema
  • Add tests for original schema (that use VectorData for TimeIntervals.timeseries)
  • Add tutorial for using TimeIntervals and TimeSeriesReferenceVectorData and TimeSeriesReference

Checklist

  • Did you update CHANGELOG.md with your changes?
  • Have you checked our Contributing document?
  • Have you ensured the PR clearly describes the problem and the solution?
  • Is your contribution compliant with our coding style? This can be checked running flake8 from the source directory.
  • Have you checked to ensure that there aren't other open Pull Requests for the same change?
  • Have you included the relevant issue number using "Fix #XXX" notation where XXX is the issue number? By including "Fix #XXX" you allow GitHub to close issue #XXX when the PR is merged.

@oruebel oruebel requested a review from rly August 11, 2021 13:06
@codecov
Copy link

codecov bot commented Aug 11, 2021

Codecov Report

Merging #1390 (faaac10) into dev (3b886ca) will increase coverage by 0.70%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##              dev    #1390      +/-   ##
==========================================
+ Coverage   77.47%   78.18%   +0.70%     
==========================================
  Files          37       37              
  Lines        2735     2759      +24     
  Branches      455      461       +6     
==========================================
+ Hits         2119     2157      +38     
+ Misses        535      522      -13     
+ Partials       81       80       -1     
Impacted Files Coverage Δ
src/pynwb/base.py 94.97% <100.00%> (+0.16%) ⬆️
src/pynwb/epoch.py 88.46% <100.00%> (+5.76%) ⬆️
src/pynwb/io/epoch.py 100.00% <100.00%> (ø)
src/pynwb/file.py 87.62% <0.00%> (+1.26%) ⬆️
src/pynwb/io/base.py 58.62% <0.00%> (+2.29%) ⬆️
src/pynwb/__init__.py 76.64% <0.00%> (+2.91%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3b886ca...faaac10. Read the comment docs.

@rly
Copy link
Contributor

rly commented Aug 11, 2021

Interesting approach. I'm surprised that changing __class__ of an object is allowed, but I guess Python allows you to do anything.

A couple comments:

  • The migration would only happen when timeintervals.timeseries or timeintervals['timeseries'] is called but not when timeintervals.columns or timeintervals[0, 'timeseries'] are called and not when any future changes to how columns can be accessed are called.
  • Can the migration be done in __init__?
  • If an old file is read with tuples, then would timeintervals.timeseries return a TimeSeriesReferenceVectorData object containing tuples?

@oruebel
Copy link
Contributor Author

oruebel commented Aug 11, 2021

Interesting approach. I'm surprised that changing __class__ of an object is allowed, but I guess Python allows you to do anything.

Yeah, Python allows you to do a lot of strange things that usually shouldn't be done ;-) It works in this case mainly because VectorData and TimeSeriesReferenceVectorData are essentially the same aside from changing behavior of a few instance methods. https://stackoverflow.com/a/42137440

  • The migration would only happen when timeintervals.timeseries or timeintervals['timeseries'] is called but not when

Good point.

  • Can the migration be done in __init__?

Yes. That is how I originally did it in this commit fe87879 Maybe the better place would be to do this in the ObjectMapper when getting the constructor arguments. In this way, the whole build-process should already be done and the swap just happens right before the TimeIntervals object is being instantiated.

  • If an old file is read with tuples, then would timeintervals.timeseries return a TimeSeriesReferenceVectorData object containing tuples?

I don't think this is an issue. The datatypes in HDF5 are identical for TimeIntervals.timeseries and TimeSeriesReferenceVectorData and the conversion to TimeSeriesReference objects happen when you slice into the column, so the behavior for "old" vs. "new" files should be identical.

@rly
Copy link
Contributor

rly commented Aug 11, 2021

Yeah, I agree - the better place to do this would be in the ObjectMapper when getting the constructor arguments. Do you want to tackle that or do you want me to?

@oruebel
Copy link
Contributor Author

oruebel commented Aug 11, 2021

Do you want to tackle that or do you want me to?

I can give it a try. We can always scrap this PR if we decide on another option. I just wanted to see if this "creative" way of changing types could work.

@oruebel
Copy link
Contributor Author

oruebel commented Aug 11, 2021

I think it may be simpler to just do it in __init__. In fact, if we update the schema, then I think the only time we should have to migrate the type is on read of old files anyways and that will always go through setting the columns constructor argument, so we shouldn't have to worry about functions like, add_column, add_interval, __getitem__ etc. as those are not relevant when reading a file and once the column type has been migrated they should function the same for old and new files.

For what it's worth, here my attempt at doing the migration in the ObjectMapper. I didn't see a way to get the values for the constructor arguments that are not overwritten. It looks like this is all handled in the ObjectMapper.construct function directly. The following should work to migrate the column in the objectmapper but its a bit ugly because it uses private functions from the ObjectMapper and requires iterating overall subspecs (which isn't too terrible but doesn't seem very clean)

@register_map(TimeIntervals)
class TimeIntervalsMap(DynamicTableMap):
    pass

    @ObjectMapper.constructor_arg('columns')
    def columns(self, builder, manager):
        # This goes through all the subspecs, but we really just need the VectorData.
        # Is there a better way to get all the VectorData containers?
        subspecs = self._ObjectMapper__get_subspec_values(builder, self.spec, manager)
        vectordata_cols = []
        for subspec, value in subspecs.items():
            const_arg = self.get_const_arg(subspec)
            if const_arg is not None and const_arg == 'columns':
               vectordata_cols  += value
        # Find the timeseries column and update it if necessary
        for v in vectordata_cols:
            if v.name == 'timeseries':
                if not isinstance(v, TimeSeriesReferenceVectorData):
                    v.__class__ = TimeSeriesReferenceVectorData
        return vectordata_cols

@rly
Copy link
Contributor

rly commented Aug 12, 2021

That approach isn't terrible, especially if we make some changes to the ObjectMapper to make certain functions not private. I should have pointed you to my previous two approaches here: bef1be8 (#1274)

I think we can combine yours and mine. In my approach, I also updated the builders just in case that was needed later for write/append but I'm not sure that is necessary. So something like:

@DynamicTableMap.constructor_arg('columns')
def columns_carg(self, builder, manager):
    # handle case when a TimeIntervals is read with a non-TimeSeriesReferenceVectorData "timeseries" column
    timeseries_builder = builder.get('timeseries')
    if timeseries_builder.attributes['neurodata_type'] != 'TimeSeriesReferenceVectorData':
        # override builder attributes
        timeseries_builder.attributes['neurodata_type'] = 'TimeSeriesReferenceVectorData'
        timeseries_builder.attributes['namespace'] = 'core'
        # construct new columns list
        columns = list()
        for dset_builder in builder.datasets.values():
            dset_obj = manager.construct(dset_builder)  # these have already been constructed
            # go through only the column datasets and replace the 'timeseries' column class in-place
            if isinstance(dset_obj, VectorData):
                if dset_obj.name == 'timeseries':
                    dset_obj.__class__ = TimeSeriesReferenceVectorData
                columns.append(dset_obj)

        return columns

    return None  # do not override

@oruebel
Copy link
Contributor Author

oruebel commented Aug 12, 2021

30a4a75 updates the PR to do the conversion in TimeIntervals.__init__ only. I believe this works with old and new schema.

I think we can combine yours and mine.

Sure, I think that should work. Generally speaking, doing this in the ObjectMapper seems slightly trickier to do but I think its the more appropriate place for this sort of thing, rather than cluttering user-facing classes with backwards compatibility code. One thing I noticed is that it seems that DynamicTable.from_dataframe currently may not use theclass defined for the column. As such, calling TimeIntervals.from_dataframe currently triggers migration of the timeseries column in __init__ even when updating the schema to use TimeSeriesReferenceColumn for TimeIntervals.timeseries. I.e., if we fix this in the ObjectMapper then we'll likely also need to update DynamicTable.from_dataframe(but that seems like a bug that should be fixed anyways).

@oruebel
Copy link
Contributor Author

oruebel commented Aug 12, 2021

I think we can combine yours and mine.

@rly I updated the code as you suggested to do the migration of the builders and container in the ObjectMapper only. After a few minor fixes (the code failed when TimeIntervals.timeseries was missing), this seems to work fine. Thanks for your patience in iterating on possible solutions, but I think what we have here now seems like an elegant solution.

it seems that DynamicTable.from_dataframe currently may not use theclass

I stand corrected this seems to work correctly and I added an assert in the unit tests to make sure.

@rly You mentioned at some point that you had a test to also test agains previous files that use VectorData. Can you add that test to this PR. I've tested agains both the updated and current schema and it seems fine, but I've done this only by changing the schema by hand. It would be good to have an actual test for this.

@oruebel oruebel marked this pull request as ready for review August 12, 2021 23:54
@oruebel
Copy link
Contributor Author

oruebel commented Aug 13, 2021

This is good to go.

rly
rly previously approved these changes May 17, 2022
@rly rly merged commit c9ea3f2 into dev May 17, 2022
@rly rly deleted the enh/migrate_timeintervals_col branch May 17, 2022 21:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update TimeIntervals to use the new TimeSeriesReferenceVectorData
2 participants