Search space ambiguity in cases where keywords can occur in either entities or metadata #398

tyarkoni · 2019-02-27T14:38:30Z

The derivatives spec encourages some keywords to be used in either filenames or JSON sidecars, without necessarily mandating one or the other. For instance, I believe space is mandatory for many files, but the spec allows that the information can be contained in JSON rather than filenames. So, if someone does this:

files = layout.get(subject='01', space='MNI152', extensions='.nii.gz')

...then currently space will be treated as a filename entity and metadata will never be inspected for space. This means there's no way to specify a disjunction over filenames and metadata.

Unfortunately, allowing disjunctions would reduce the efficiency of the search/indexing process considerably, because we would need to pre-index metadata for all files up front, instead of using the current approach (only checking metadata for files that match the entity-level constraints, and caching any inspected file for future use). But that would at least make the implementation of the fix relatively simple.

See also #397 for a related discussion.

The text was updated successfully, but these errors were encountered:

tyarkoni · 2019-02-27T14:55:36Z

On further thought, this does raise the broader question of whether it's time to start using a relational database internally. Something like dataset would work nicely, though that has SQLAlchemy as a dependency (but if we stick to in-memory sqlite, we at least don't need to bother with any other adapters). Moving to a DB approach would save memory, make it easier to read/write indices, and would greatly speed up most queries. The main cost would come from the querywise overhead of the ORM layer, as well as the initial indexing process.

satra · 2019-02-27T14:55:44Z

@tyarkoni - if they are in both places, shouldn't the validator complain if they are different. as such, if there are duplicates, that should not impact get.

tyarkoni · 2019-02-27T14:57:58Z

@satra I'm not sure if the validator currently even checks metadata (@chrisfilo?). I agree that it should complain if the same key is found in both places but with different values.

That said, that isn't really the problem here. The problem here is that it's legal to use either of those locations to store the same piece of information, whereas the current get() implementation is unable to check both—if a key is valid as entities at the BIDSLayout level, then it will never be used to search metadata (i.e., it's not used as a fall-back constraint at the individual file level).

satra · 2019-02-27T15:04:21Z

@tyarkoni - so in the current implementation the file level key-value pairs are not treated as metadata of objects.

in my mind, there should be no distinction between key-values in metadata vs key-values in filenames from a search perspective or an information perspective. from an efficiency perspective i can see the layering being potentially useful, but from a search perspective not so much. all those key-values are properties of the file entity.

one way to think of the json file is that it holds keys and values that would have made the filename too long! but they are still first class citizens for searching attributes.

tyarkoni · 2019-02-27T15:07:05Z

@satra yes, I agree. It's just that making metadata first-class citizens will probably require a switch to using a DB internally. I'm not against that (actually, I'm probably for it), but it'll take a bit of work. See comments above.

yarikoptic mentioned this issue Feb 27, 2019

Formalize the "extension" entity to replace "extensions" argument to get #404

Closed

tyarkoni mentioned this issue Mar 21, 2019

Take the DB plunge #422

Closed

tyarkoni mentioned this issue Apr 20, 2019

WIP: 0.9 #432

Merged

tyarkoni closed this as completed in #432 May 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search space ambiguity in cases where keywords can occur in either entities or metadata #398

Search space ambiguity in cases where keywords can occur in either entities or metadata #398

tyarkoni commented Feb 27, 2019 •

edited

Loading

tyarkoni commented Feb 27, 2019

satra commented Feb 27, 2019

tyarkoni commented Feb 27, 2019 •

edited

Loading

satra commented Feb 27, 2019

tyarkoni commented Feb 27, 2019

Search space ambiguity in cases where keywords can occur in either entities or metadata #398

Search space ambiguity in cases where keywords can occur in either entities or metadata #398

Comments

tyarkoni commented Feb 27, 2019 • edited Loading

tyarkoni commented Feb 27, 2019

satra commented Feb 27, 2019

tyarkoni commented Feb 27, 2019 • edited Loading

satra commented Feb 27, 2019

tyarkoni commented Feb 27, 2019

tyarkoni commented Feb 27, 2019 •

edited

Loading

tyarkoni commented Feb 27, 2019 •

edited

Loading