Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questioning mixed use of file paths and digests #275

Closed
zimeon opened this issue Nov 9, 2018 · 9 comments
Closed

Questioning mixed use of file paths and digests #275

zimeon opened this issue Nov 9, 2018 · 9 comments
Assignees
Labels
Needs Discussion OCFL Object Question Further information is requested
Milestone

Comments

@zimeon
Copy link
Contributor

zimeon commented Nov 9, 2018

Comment about mixed use of file paths and digests from @justinlittman #272:

Based on scrutinizing example 5.2, I find the use of file digests and filepaths to reference content inconsistent and confusing. In fixity, filepaths are used to reference content. In manifests, file digests are mapped to filepaths. In version state, file digests are again mapped to filepaths.

It seems that either (1) manifests and file digests should be removed or (2) file digests should be used to reference content in the fixity section.

@zimeon
Copy link
Contributor Author

zimeon commented Nov 9, 2018

Thanks for this question, it goes quite deep into the approach of OCFL in v0.1. The strategy of v0.1 as illustrated in example 5.2 is that there is a mapping from existing file path(s) <-> digests <-> logical file path(s) in each state. This provides for:

  • deduplication by digest (though multiple file copies for a given digest are supported)
  • arbitrary renaming of stored vs logical file path

This arrangement has the following features:

  • ensures/requires digest for each file in the manifest
  • reference keys in the inventory are consistent/limited size digests with a limited character set
  • cannot support (with sha512 the VERY unlikely case of) different content for the same digest

I'm not keen on suggestion 2), using one digest as the key to another in fixity. For the example that would give something along the lines of:

  "fixity": {
    "md5": {
      "184f84e28cbe75e050e9c25ea7f2e939": ["7dcc35...c31"],
      "2673a7b11a70bc7ff960ad8127b4adeb":  ["4d27c8...b53"] ,
      "c289c8ccd4bab6e385f5afdd89b5bda2": ["ffccf6...62e"],
      "d41d8cd98f00b204e9800998ecf8427e": ["cf83e1...a3e"]
    },
    "sha1": {
      "66709b068a2faead97113559db78ccd44712cbf2": ["7dcc35...c31"],
      "a6357c99ecc5752931e133227581e914968f3b9c": ["4d27c8...b53"],
      "b9c7ccc6154974288132b63c15db8d2750716b49": ["ffccf6...62e"],
      "da39a3ee5e6b4b0d3255bfef95601890afd80709": ["cf83e1...a3e"]
    },

which seems less obvious and harder to use. It gets especially weird if one allows for the real possibility of collisions in legacy digests where we'd end up with entries that might map one md5 to two sha512 digests like:

  "fixity": {
    "md5": {
      "184f84e28cbe75e050e9c25ea7f2e939": ["7dcc35...c31", "a4ecf6...621"],

I'm not sure I correctly understand the suggestion 1), but it is certainly the case that we could avoid content addressing and have a simple map from an existing file path to logical file path(s) for each version. If we did this and then merged all digests into fixity (deleting manifest) we might have something like:

{
  "fixity": {
    "md5": {
      "184f84e28cbe75e050e9c25ea7f2e939": [ "v1/content/foo/bar.xml" ],
      "2673a7b11a70bc7ff960ad8127b4adeb": [ "v2/content/foo/bar.xml" ],
      "c289c8ccd4bab6e385f5afdd89b5bda2": [ "v1/content/image.tiff" ],
      "d41d8cd98f00b204e9800998ecf8427e": [ "v1/content/empty.txt" ]
    },
    "sha1": {
      "66709b068a2faead97113559db78ccd44712cbf2": [ "v1/content/foo/bar.xml" ],
      "a6357c99ecc5752931e133227581e914968f3b9c": [ "v2/content/foo/bar.xml" ],
      "b9c7ccc6154974288132b63c15db8d2750716b49": [ "v1/content/image.tiff" ],
      "da39a3ee5e6b4b0d3255bfef95601890afd80709": [ "v1/content/empty.txt" ]
    },
    "sha512": {
      "4d27c8...b53": [ "v2/content/foo/bar.xml" ],
      "7dcc35...c31": [ "v1/content/foo/bar.xml" ],
      "cf83e1...a3e": [ "v1/content/empty.txt" ],
      "ffccf6...62e": [ "v1/content/image.tiff" ]
    },
  },
  "head": "v3",
  "id": "ark:/12345/bcd987",
  "type": "Object",
  "versions": {
    "v1": {
      "created": "2018-01-01T01:01:01Z",
      "message": "Initial import",
      "state": {
        "v1/content/foo/bar.xml": [ "foo/bar.xml" ],
        "v1/content/empty.txt": [ "empty.txt" ],
        "v1/content/image.tiff": [ "image.tiff" ]
      },
      "type": "Version",
      "user": {
        "address": "[email protected]",
        "name": "Alice"
      }
    },
    "v2": {
      "created": "2018-02-02T02:02:02Z",
      "message": "Fix bar.xml, remove image.tiff, add empty2.txt",
      "state": {
        "v2/content/foo/bar.xml": [ "foo/bar.xml" ],
        "v1/content/empty.txt": [ "empty.txt", "empty2.txt" ]
      },
      "type": "Version",
      "user": {
        "address": "[email protected]",
        "name": "Bob"
      }
    },
    "v3": {
      "created": "2018-03-03T03:03:03Z",
      "message": "Reinstate image.tiff, delete empty.txt",
      "state": {
        "v2/content/foo/bar.xml": [ "foo/bar.xml" ],
        "v1/content/empty.txt": [ "empty2.txt" ],
        "v1/content/image.tiff": [ "image.tiff" ]
      },
      "type": "Version",
      "user": {
        "address": "[email protected]",
        "name": "Cecilia"
      }
    }
  }
}

As before, this would provide:

  • deduplication with multiple file copies optionally supported
  • arbitrary renaming of stored vs logical file path

This arrangement has the following features:

  • rules for digests separate from needs of object representation (might still want a rule that says there must be one of some digest for every file, could also restructure fixity/manifest section in various ways)
  • reference keys in the inventory are existing file paths and NOT consistent/limited size digests with a limited character set
  • can support (with sha512 the VERY unlikely case of) different content for the same digest

@zimeon zimeon added this to the Beta milestone Dec 5, 2018
@ahankinson
Copy link
Contributor

I'm still not convinced that an "arbitrary" string will not break things, and a filename from a potentially unknown source is effectively an arbitrary string. I could imagine, for example:

"v1/content/thesis/أطروحة الدكتوراه.pdf" or "v1/content/thesis/🤣🤣🤩🥳💋.➕✖➕"

Which, in theory, could be fine but I don't really know enough about how Posix filesystems handle things like combining characters, LTR languages, etc. to be comfortable with using them as the primary means of addressability.

On my laptop, it looks like this:

image

The question marks are probably just terminal-speak for "I don't know how to render this character"

Then importing this into Python (3.7) you get:

image

Which, again, is more-or-less OK in theory but still not really the same as the nice, neat file paths in your example. If we use hashes, though, we at least have a way of identifying the name the thing is on the current filesystem, rather than depending on the name of the thing it was on the filesystem where the inventory was written.

We could use the fixity sections for that, of course, but it seems to me that a single manifest block is a clearer way of identifying all the files in the object.

I also like the idea of having a fixed-length key, but that's primarily aesthetic.

@zimeon
Copy link
Contributor Author

zimeon commented Dec 6, 2018

@ahankinson -- I think the issues you bring up about mapping datastreams-in-inventory to datastreams-on-disk is the same whether it is done in the manifest (as spec currently written) or done via keys in the state blocks. I think key difference between the two approaches is whether the arbitrary filename string is also used, possibly multiple times, as keys in the state blocks.

@rosy1280
Copy link
Contributor

rosy1280 commented Jan 6, 2019

I would like to go back and challenge the original premise of the concern. In the fixity block of example 5.2 file digests are used to reference file paths. See the language below:

The structure of the fixity section must contain a key corresponding to an approved digest algorithm. The value of this key must follow the structure of the manifest section; that is, a key corresponding to the digest value, and an array of existing file paths that match that digest.

If there is a serious and real concern that I am missing, then I would suggest removing the fixity block from the inventory file and including it another way (perhaps a separate file). Remember the fixity block is relatively new and was provided for those who wish to port over legacy algorithms (note this could be better explained in the implementation notes).

@rosy1280
Copy link
Contributor

Recent conversations in the Fedora OCFL paper makes me wonder if our definitions for logical file path and existing file path aren't descriptive enough. I wonder if we should add a section to the implementation notes that talks about our decision to have these two concepts AND/OR expand the definition a bit (this might mean going beyond the one or two sentence rule.)

@birkland
Copy link
Contributor

The topic of logical and existing file path came up today at the last Fedora committers' meeting. The spec defines no particular relationship between logical paths and existing paths, but does stipulate that for any given logical path in a particular version, its content can be found via logical -> digest -> existing and reading the content of existing. This is clear and easily reasoned about.

An example of "no particular relationship between the logical and existing paths" can be found in some test data I wrote by hand a while back. Ignore the content of the hashes :)

In particular:

    "manifest": {
      "ad10": [
        "v1/content/1"
      ],
      "ad11": [
        "v3/content/2"
      ]

and

            "state": {
              "ad10": [
                "obj3.txt"
              ],
              "ad11": [
                "obj3-new.txt"
              ]

In other words, an implementation stored a file named 1, whose content appears in the object's logical state as obj3.txt. This is logical, useful, and flexible.

Exploring this decoupling a bit further, one could represent:

    "manifest": {
      "ad10": [
        "v1/content/1"
      ],
      "ad11": [
        "v3/content/2"
      ]

and

            "state": {
              "ad10": [
                "foo"
              ],
              "ad11": [
                "foo/bar.xml"
              ]

Notably, in the domain of physical files, we have two discrete content files 1 and 2. In the logical domain, we have assigned content to things called foo, and foo/bar.xml. That doesn't make a lot of sense on most file systems, but then again there is no defined relationship between the logical state of an object and a filesystem. In the context of web resources and the Fedora API, this sort of construct actually could make a lot of sense. Directories (LDP containers) can and do have a representation, and the logical state of this object could easily be exposed as a container foo with a member foo/bar.xml.

In any case the spec as written affords flexibility for either construct. The former is straightforward, but the latter raises interesting implementation questions. So we thought it would a good idea to bring these examples to the editors.

@awoods
Copy link
Member

awoods commented Feb 14, 2019

I think for the sake of @OCFL/editors , what @birkland is raising is a fact of how the specification is currently written. Although most of the examples in the spec assume the "logical path" is the path relative to the "content/" directory, that does not have to be the case, per the spec.

Some of the earlier OCFL conversations expressed the value in having meaningful filenames persisted to disk... but the specification allows for a degree of flexibility which may be helpful in certain circumstances.

@zimeon
Copy link
Contributor Author

zimeon commented Feb 14, 2019

I think the key thing is that the spec allows you to use a "nice" path but does not require it. I think this is rather different issue than the original one though, not sure whether there is a new question here.

@zimeon
Copy link
Contributor Author

zimeon commented Feb 20, 2019

Have creates #310 and #311 to deal with side issues that came up recently in this thread.

In editors' meeting we have decided that we will keep using digests as the keys in the inventory file in order to have clean/simple strings for reference, as opposed to the issues described in #275 (comment) . Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Discussion OCFL Object Question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants