-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Search space ambiguity in cases where keywords can occur in either entities or metadata #398
Comments
On further thought, this does raise the broader question of whether it's time to start using a relational database internally. Something like dataset would work nicely, though that has SQLAlchemy as a dependency (but if we stick to in-memory sqlite, we at least don't need to bother with any other adapters). Moving to a DB approach would save memory, make it easier to read/write indices, and would greatly speed up most queries. The main cost would come from the querywise overhead of the ORM layer, as well as the initial indexing process. |
@tyarkoni - if they are in both places, shouldn't the validator complain if they are different. as such, if there are duplicates, that should not impact get. |
@satra I'm not sure if the validator currently even checks metadata (@chrisfilo?). I agree that it should complain if the same key is found in both places but with different values. That said, that isn't really the problem here. The problem here is that it's legal to use either of those locations to store the same piece of information, whereas the current |
@tyarkoni - so in the current implementation the file level key-value pairs are not treated as metadata of objects. in my mind, there should be no distinction between key-values in metadata vs key-values in filenames from a search perspective or an information perspective. from an efficiency perspective i can see the layering being potentially useful, but from a search perspective not so much. all those key-values are properties of the file entity. one way to think of the json file is that it holds keys and values that would have made the filename too long! but they are still first class citizens for searching attributes. |
@satra yes, I agree. It's just that making metadata first-class citizens will probably require a switch to using a DB internally. I'm not against that (actually, I'm probably for it), but it'll take a bit of work. See comments above. |
The derivatives spec encourages some keywords to be used in either filenames or JSON sidecars, without necessarily mandating one or the other. For instance, I believe
space
is mandatory for many files, but the spec allows that the information can be contained in JSON rather than filenames. So, if someone does this:...then currently
space
will be treated as a filename entity and metadata will never be inspected forspace
. This means there's no way to specify a disjunction over filenames and metadata.Unfortunately, allowing disjunctions would reduce the efficiency of the search/indexing process considerably, because we would need to pre-index metadata for all files up front, instead of using the current approach (only checking metadata for files that match the entity-level constraints, and caching any inspected file for future use). But that would at least make the implementation of the fix relatively simple.
See also #397 for a related discussion.
The text was updated successfully, but these errors were encountered: