-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questioning mixed use of file paths and digests #275
Comments
Thanks for this question, it goes quite deep into the approach of OCFL in v0.1. The strategy of v0.1 as illustrated in example 5.2 is that there is a mapping from existing file path(s) <-> digests <-> logical file path(s) in each state. This provides for:
This arrangement has the following features:
I'm not keen on suggestion 2), using one digest as the key to another in
which seems less obvious and harder to use. It gets especially weird if one allows for the real possibility of collisions in legacy digests where we'd end up with entries that might map one md5 to two sha512 digests like:
I'm not sure I correctly understand the suggestion 1), but it is certainly the case that we could avoid content addressing and have a simple map from an existing file path to logical file path(s) for each version. If we did this and then merged all digests into
As before, this would provide:
This arrangement has the following features:
|
I'm still not convinced that an "arbitrary" string will not break things, and a filename from a potentially unknown source is effectively an arbitrary string. I could imagine, for example: "v1/content/thesis/أطروحة الدكتوراه.pdf" or "v1/content/thesis/🤣🤣🤩🥳💋.➕✖➕" Which, in theory, could be fine but I don't really know enough about how Posix filesystems handle things like combining characters, LTR languages, etc. to be comfortable with using them as the primary means of addressability. On my laptop, it looks like this: The question marks are probably just terminal-speak for "I don't know how to render this character" Then importing this into Python (3.7) you get: Which, again, is more-or-less OK in theory but still not really the same as the nice, neat file paths in your example. If we use hashes, though, we at least have a way of identifying the name the thing is on the current filesystem, rather than depending on the name of the thing it was on the filesystem where the inventory was written. We could use the fixity sections for that, of course, but it seems to me that a single manifest block is a clearer way of identifying all the files in the object. I also like the idea of having a fixed-length key, but that's primarily aesthetic. |
@ahankinson -- I think the issues you bring up about mapping datastreams-in-inventory to datastreams-on-disk is the same whether it is done in the |
I would like to go back and challenge the original premise of the concern. In the fixity block of example 5.2 file digests are used to reference file paths. See the language below:
If there is a serious and real concern that I am missing, then I would suggest removing the fixity block from the inventory file and including it another way (perhaps a separate file). Remember the fixity block is relatively new and was provided for those who wish to port over legacy algorithms (note this could be better explained in the implementation notes). |
Recent conversations in the Fedora OCFL paper makes me wonder if our definitions for logical file path and existing file path aren't descriptive enough. I wonder if we should add a section to the implementation notes that talks about our decision to have these two concepts AND/OR expand the definition a bit (this might mean going beyond the one or two sentence rule.) |
The topic of logical and existing file path came up today at the last Fedora committers' meeting. The spec defines no particular relationship between logical paths and existing paths, but does stipulate that for any given logical path in a particular version, its content can be found via logical -> digest -> existing and reading the content of existing. This is clear and easily reasoned about. An example of "no particular relationship between the logical and existing paths" can be found in some test data I wrote by hand a while back. Ignore the content of the hashes :) In particular:
and
In other words, an implementation stored a file named Exploring this decoupling a bit further, one could represent:
and
Notably, in the domain of physical files, we have two discrete content files In any case the spec as written affords flexibility for either construct. The former is straightforward, but the latter raises interesting implementation questions. So we thought it would a good idea to bring these examples to the editors. |
I think for the sake of @OCFL/editors , what @birkland is raising is a fact of how the specification is currently written. Although most of the examples in the spec assume the "logical path" is the path relative to the "content/" directory, that does not have to be the case, per the spec. Some of the earlier OCFL conversations expressed the value in having meaningful filenames persisted to disk... but the specification allows for a degree of flexibility which may be helpful in certain circumstances. |
I think the key thing is that the spec allows you to use a "nice" path but does not require it. I think this is rather different issue than the original one though, not sure whether there is a new question here. |
Have creates #310 and #311 to deal with side issues that came up recently in this thread. In editors' meeting we have decided that we will keep using digests as the keys in the inventory file in order to have clean/simple strings for reference, as opposed to the issues described in #275 (comment) . Closing. |
Comment about mixed use of file paths and digests from @justinlittman #272:
The text was updated successfully, but these errors were encountered: