Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New "associations" metadata #431

Closed
oesteban opened this issue Apr 8, 2019 · 10 comments · Fixed by #432
Closed

New "associations" metadata #431

oesteban opened this issue Apr 8, 2019 · 10 comments · Fixed by #432

Comments

@oesteban
Copy link
Collaborator

oesteban commented Apr 8, 2019

Add a new attribute to BIDSFile with a list of files associated to this particular BIDSFile. Files listed here are, for instance:

  • Events files and physio files corresponding to a particular functional run
  • bvecs/bvals tables corresponding to a dwi image
  • Any valid path with IntendedFor metadata pointing to this hit (e.g. fieldmaps).

The idea would be to minimize subsequent queries to the layout, and having a quick glimpse at the involvement of this hit in the context of the full dataset.

If this feature can be time consuming, then it could be protected by a return_associations argument to get, which would be False by default.

What do you guys think? I could work on this if you see it useful.

Proposed associations

  • BOLD runs:

    • events, stim, physio.
    • fieldmaps/EPIs with IntendedFor pointers
    • SBRefs with matching names (i.e. sub-01_sbref.nii.gz would match all tasks, although this is unclear in the specs at the moment) or IntendedFor field (this IntendedFor is not in the spec at the moment, but I would submit a PR to have it).
    • other echoes of ME-EPI
  • DWI runs:

    • bvecs and bvals
    • Matching B0s
    • Matching SBRefs
    • fieldmaps/EPIs with IntendedFor pointers
  • Multiecho anats:

    • Matching echoes.
@tyarkoni
Copy link
Collaborator

tyarkoni commented Apr 8, 2019

I've been meaning to add something like this for a while, so let's definitely keep this open. Before going down this road though, I want to (a) refactor the variables module to use an OO architecture (where each type of file has its own Reader class) and (b) add a BIDSVariableFile layer in between a BIDSFile and a BIDSVariable that provides simple access to the contents via pandas DFs (i.e., before all the extra processing it takes to extract BIDSVariable objects).

@oesteban
Copy link
Collaborator Author

oesteban commented Apr 8, 2019

That sounds pretty sensible to me. A couple of questions:

Is (a) already in the works somewhere? I'd love to take a look and maybe try to help with that.

For (b), is it really necessary to have BIDSVariableFiles before working on this extension to the interface?

@tyarkoni
Copy link
Collaborator

tyarkoni commented Apr 8, 2019

No, haven't started on (a) at all. You're welcome to take a crack at that if you want—my top priority right now is refactoring the layout module to use a database for storage and querying instead of native Python dicts. Once that's done, I'll probably move on to cleaning on the I/O.

For (b), no, I don't think that would hold up implementing an .associations field. It's just that I suspect the refactoring of the I/O module will end up implementing that anyway. For that matter, if you want to implement a temporary solution without waiting on the I/O changes, that's fine also.

Just so I can keep this in mind when working on the I/O stuff, what would you want .associations to return? A list of filenames? Pandas DataFrames? BIDSFiles or BIDSVariableFiles? If we go with the first option, I think it would be handy to also have a way of returning the others, so maybe we can implement a .get_associations() method that allows specification of return_type, which would nicely mirror the BIDSLayout.get() API. Thoughts?

@effigies
Copy link
Collaborator

effigies commented Apr 9, 2019

Could we add a parameter to BIDSLayout.get to restrict the query to files associated with the argument? e.g.,

>>> layout.get(suffix=['epi', 'fmap', 'phasediff'])
['sub-01/fmap/sub-01_dir-AP_epi.nii.gz',
 'sub-02/fmap/sub-02_dir-AP_epi.nii.gz',
 'sub-03/fmap/sub-03_dir-AP_epi.nii.gz']
>>> layout.get(suffix=['epi', 'fmap', 'phasediff'],
...            associated='sub-01/func/sub-01_task-rest_bold.nii.gz')
['sub-01/fmap/sub-01_dir-AP_epi.nii.gz']

That would reuse the existing return_type infrastructure.

@tyarkoni
Copy link
Collaborator

tyarkoni commented Apr 9, 2019

That's certainly possible—and will be easier to implement efficiently once we're using a database. It's probably worth getting clear on exactly how we're defining "associated". E.g., if there's an events.tsv in the project root, should that be associated with each individual run that inherits from it? Do we include the JSON sidecars in the list? Etc...

@tyarkoni
Copy link
Collaborator

tyarkoni commented Apr 10, 2019

@oesteban okay now that I have a mostly working SQLAlchemy-backed branch of pybids (the variable ingestion stuff is currently broken, but all the core layout tests pass), I've thought about this a bit more, and I think it will actually be quite easy to implement using the DB, without needing to worry (yet) about all the I/O stuff I mentioned above. It's straightforward to treat this purely at a database level. I'll probably just create a FileAssociation join table that links any two files. Then there will be a single mapping function that executes immediately after the current indexing process is done, adding all associations between files. These will be queryable via a bidirectional .associations property on the model.

It would be helpful if you could flesh out as many of the association-defining rules you can think of. The main source of potential ambiguity/problems I can see is inheritance—it feels like it could get messy if we associate a root-level file with 600 files that inherit from it. On the other hand, it would still be good to register this association at DB level, as that would make walking up the tree unnecessary past the initial indexing process. So maybe we can have a closed set of relationship types or indicators, with one binary column being, say, inherited, where True would imply that the file is only associated via inheritance, and False means there's a direct relationship.

@oesteban
Copy link
Collaborator Author

I've edited my original post to include associations I've thought so far. Please feel free to edit it and update/fix it in place

tyarkoni added a commit that referenced this issue Apr 11, 2019
@tyarkoni
Copy link
Collaborator

@oesteban thanks!

The interface I've implemented for this lets you pass a kind argument to BIDSFile.get_associations() that optionally limits the kind(s) of associations you want to retrieve. This graph is directed, so we can distinguish between, say, 'MetadataIn' and 'MetadataFor'. Feel free to suggest a vocabulary above, otherwise I'll exercise my own judgment and we can revise later. (Tentatively, I'm thinking of something like ["ChildOf", "ParentOf", "IntendedFor", "ReceivesFrom", "MetadataFor", and "MetadataFrom"]. We can also define hierarchies, so that "IntendedFor" includes "MetadataFor".)

@tyarkoni tyarkoni mentioned this issue Apr 20, 2019
@tyarkoni
Copy link
Collaborator

tyarkoni commented Apr 20, 2019

@oesteban when you get a chance, please try out the sqlalchemy branch and see if the get_associations() implementation meets your needs. I suspect I'm not tracking all of the associations you listed just yet, but most of them should be there. Also, any thoughts on the above question (i.e., how to categorize associations) would be helpful.

Note: I also added an include_parents flag (False by default) that allows you to include files that are associated purely due to inheritance. By default, only direct associations will be returned—e.g., a file's JSON sidecar, but not any JSON files that file inherits from.

@oesteban
Copy link
Collaborator Author

oesteban commented Aug 5, 2019

Hi @tyarkoni, sorry for my slow response testing this. I've found a use case that needs revision:

>>> bold = layout.get(suffix='bold', extension=['.nii', '.nii.gz'])
>>> # Get associations of the ``phasediff`` informer (index [-1])
>>> bold[0].get_associations(kind='InformedBy')[-1].get_associations()
[<BIDSJSONFile filename='/home/oesteban/Data/sdcflows-tests/testdata/sub-HCP101006/fmap/sub-HCP101006_phasediff.json'>,
 <BIDSImageFile filename='/home/oesteban/Data/sdcflows-tests/testdata/sub-HCP101006/func/sub-HCP101006_task-rest_dir-LR_bold.nii.gz'>,
 <BIDSImageFile filename='/home/oesteban/Data/sdcflows-tests/testdata/sub-HCP101006/func/sub-HCP101006_task-rest_dir-LR_sbref.nii.gz'>,
 <BIDSImageFile filename='/home/oesteban/Data/sdcflows-tests/testdata/sub-HCP101006/func/sub-HCP101006_task-rest_dir-RL_bold.nii.gz'>,
 <BIDSImageFile filename='/home/oesteban/Data/sdcflows-tests/testdata/sub-HCP101006/func/sub-HCP101006_task-rest_dir-RL_sbref.nii.gz'>]

What I'm missing here is one <BIDSJSONFile filename='/home/oesteban/Data/sdcflows-tests/testdata/sub-HCP101006/fmap/sub-HCP101006_magnitude.nii.gz'>, probably of kind "MetadataFrom" or "InformedBy" ?

That would allow reconstructing the whole metadata graph to set up distortion correction:

bold -> phasediff (via IntendedFor) -> magnitude{1,2,} (via hierarchy)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants