Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search space ambiguity in cases where keywords can occur in either entities or metadata #398

Closed
tyarkoni opened this issue Feb 27, 2019 · 5 comments · Fixed by #432
Closed

Comments

@tyarkoni
Copy link
Collaborator

tyarkoni commented Feb 27, 2019

The derivatives spec encourages some keywords to be used in either filenames or JSON sidecars, without necessarily mandating one or the other. For instance, I believe space is mandatory for many files, but the spec allows that the information can be contained in JSON rather than filenames. So, if someone does this:

files = layout.get(subject='01', space='MNI152', extensions='.nii.gz')

...then currently space will be treated as a filename entity and metadata will never be inspected for space. This means there's no way to specify a disjunction over filenames and metadata.

Unfortunately, allowing disjunctions would reduce the efficiency of the search/indexing process considerably, because we would need to pre-index metadata for all files up front, instead of using the current approach (only checking metadata for files that match the entity-level constraints, and caching any inspected file for future use). But that would at least make the implementation of the fix relatively simple.

See also #397 for a related discussion.

@tyarkoni
Copy link
Collaborator Author

On further thought, this does raise the broader question of whether it's time to start using a relational database internally. Something like dataset would work nicely, though that has SQLAlchemy as a dependency (but if we stick to in-memory sqlite, we at least don't need to bother with any other adapters). Moving to a DB approach would save memory, make it easier to read/write indices, and would greatly speed up most queries. The main cost would come from the querywise overhead of the ORM layer, as well as the initial indexing process.

@satra
Copy link
Contributor

satra commented Feb 27, 2019

@tyarkoni - if they are in both places, shouldn't the validator complain if they are different. as such, if there are duplicates, that should not impact get.

@tyarkoni
Copy link
Collaborator Author

tyarkoni commented Feb 27, 2019

@satra I'm not sure if the validator currently even checks metadata (@chrisfilo?). I agree that it should complain if the same key is found in both places but with different values.

That said, that isn't really the problem here. The problem here is that it's legal to use either of those locations to store the same piece of information, whereas the current get() implementation is unable to check both—if a key is valid as entities at the BIDSLayout level, then it will never be used to search metadata (i.e., it's not used as a fall-back constraint at the individual file level).

@satra
Copy link
Contributor

satra commented Feb 27, 2019

@tyarkoni - so in the current implementation the file level key-value pairs are not treated as metadata of objects.

in my mind, there should be no distinction between key-values in metadata vs key-values in filenames from a search perspective or an information perspective. from an efficiency perspective i can see the layering being potentially useful, but from a search perspective not so much. all those key-values are properties of the file entity.

one way to think of the json file is that it holds keys and values that would have made the filename too long! but they are still first class citizens for searching attributes.

@tyarkoni
Copy link
Collaborator Author

@satra yes, I agree. It's just that making metadata first-class citizens will probably require a switch to using a DB internally. I'm not against that (actually, I'm probably for it), but it'll take a bit of work. See comments above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants