Questioning mixed use of file paths and digests #275

zimeon · 2018-11-09T17:18:05Z

Comment about mixed use of file paths and digests from @justinlittman #272:

Based on scrutinizing example 5.2, I find the use of file digests and filepaths to reference content inconsistent and confusing. In fixity, filepaths are used to reference content. In manifests, file digests are mapped to filepaths. In version state, file digests are again mapped to filepaths.

It seems that either (1) manifests and file digests should be removed or (2) file digests should be used to reference content in the fixity section.

zimeon · 2018-11-09T18:30:58Z

Thanks for this question, it goes quite deep into the approach of OCFL in v0.1. The strategy of v0.1 as illustrated in example 5.2 is that there is a mapping from existing file path(s) <-> digests <-> logical file path(s) in each state. This provides for:

deduplication by digest (though multiple file copies for a given digest are supported)
arbitrary renaming of stored vs logical file path

This arrangement has the following features:

ensures/requires digest for each file in the manifest
reference keys in the inventory are consistent/limited size digests with a limited character set
cannot support (with sha512 the VERY unlikely case of) different content for the same digest

I'm not keen on suggestion 2), using one digest as the key to another in fixity. For the example that would give something along the lines of:

  "fixity": {
    "md5": {
      "184f84e28cbe75e050e9c25ea7f2e939": ["7dcc35...c31"],
      "2673a7b11a70bc7ff960ad8127b4adeb":  ["4d27c8...b53"] ,
      "c289c8ccd4bab6e385f5afdd89b5bda2": ["ffccf6...62e"],
      "d41d8cd98f00b204e9800998ecf8427e": ["cf83e1...a3e"]
    },
    "sha1": {
      "66709b068a2faead97113559db78ccd44712cbf2": ["7dcc35...c31"],
      "a6357c99ecc5752931e133227581e914968f3b9c": ["4d27c8...b53"],
      "b9c7ccc6154974288132b63c15db8d2750716b49": ["ffccf6...62e"],
      "da39a3ee5e6b4b0d3255bfef95601890afd80709": ["cf83e1...a3e"]
    },

which seems less obvious and harder to use. It gets especially weird if one allows for the real possibility of collisions in legacy digests where we'd end up with entries that might map one md5 to two sha512 digests like:

  "fixity": {
    "md5": {
      "184f84e28cbe75e050e9c25ea7f2e939": ["7dcc35...c31", "a4ecf6...621"],

I'm not sure I correctly understand the suggestion 1), but it is certainly the case that we could avoid content addressing and have a simple map from an existing file path to logical file path(s) for each version. If we did this and then merged all digests into fixity (deleting manifest) we might have something like:

{
  "fixity": {
    "md5": {
      "184f84e28cbe75e050e9c25ea7f2e939": [ "v1/content/foo/bar.xml" ],
      "2673a7b11a70bc7ff960ad8127b4adeb": [ "v2/content/foo/bar.xml" ],
      "c289c8ccd4bab6e385f5afdd89b5bda2": [ "v1/content/image.tiff" ],
      "d41d8cd98f00b204e9800998ecf8427e": [ "v1/content/empty.txt" ]
    },
    "sha1": {
      "66709b068a2faead97113559db78ccd44712cbf2": [ "v1/content/foo/bar.xml" ],
      "a6357c99ecc5752931e133227581e914968f3b9c": [ "v2/content/foo/bar.xml" ],
      "b9c7ccc6154974288132b63c15db8d2750716b49": [ "v1/content/image.tiff" ],
      "da39a3ee5e6b4b0d3255bfef95601890afd80709": [ "v1/content/empty.txt" ]
    },
    "sha512": {
      "4d27c8...b53": [ "v2/content/foo/bar.xml" ],
      "7dcc35...c31": [ "v1/content/foo/bar.xml" ],
      "cf83e1...a3e": [ "v1/content/empty.txt" ],
      "ffccf6...62e": [ "v1/content/image.tiff" ]
    },
  },
  "head": "v3",
  "id": "ark:/12345/bcd987",
  "type": "Object",
  "versions": {
    "v1": {
      "created": "2018-01-01T01:01:01Z",
      "message": "Initial import",
      "state": {
        "v1/content/foo/bar.xml": [ "foo/bar.xml" ],
        "v1/content/empty.txt": [ "empty.txt" ],
        "v1/content/image.tiff": [ "image.tiff" ]
      },
      "type": "Version",
      "user": {
        "address": "[email protected]",
        "name": "Alice"
      }
    },
    "v2": {
      "created": "2018-02-02T02:02:02Z",
      "message": "Fix bar.xml, remove image.tiff, add empty2.txt",
      "state": {
        "v2/content/foo/bar.xml": [ "foo/bar.xml" ],
        "v1/content/empty.txt": [ "empty.txt", "empty2.txt" ]
      },
      "type": "Version",
      "user": {
        "address": "[email protected]",
        "name": "Bob"
      }
    },
    "v3": {
      "created": "2018-03-03T03:03:03Z",
      "message": "Reinstate image.tiff, delete empty.txt",
      "state": {
        "v2/content/foo/bar.xml": [ "foo/bar.xml" ],
        "v1/content/empty.txt": [ "empty2.txt" ],
        "v1/content/image.tiff": [ "image.tiff" ]
      },
      "type": "Version",
      "user": {
        "address": "[email protected]",
        "name": "Cecilia"
      }
    }
  }
}

As before, this would provide:

deduplication with multiple file copies optionally supported
arbitrary renaming of stored vs logical file path

This arrangement has the following features:

rules for digests separate from needs of object representation (might still want a rule that says there must be one of some digest for every file, could also restructure fixity/manifest section in various ways)
reference keys in the inventory are existing file paths and NOT consistent/limited size digests with a limited character set
can support (with sha512 the VERY unlikely case of) different content for the same digest

ahankinson · 2018-12-05T23:28:51Z

I'm still not convinced that an "arbitrary" string will not break things, and a filename from a potentially unknown source is effectively an arbitrary string. I could imagine, for example:

"v1/content/thesis/أطروحة الدكتوراه.pdf" or "v1/content/thesis/🤣🤣🤩🥳💋.➕✖➕"

Which, in theory, could be fine but I don't really know enough about how Posix filesystems handle things like combining characters, LTR languages, etc. to be comfortable with using them as the primary means of addressability.

On my laptop, it looks like this:

The question marks are probably just terminal-speak for "I don't know how to render this character"

Then importing this into Python (3.7) you get:

Which, again, is more-or-less OK in theory but still not really the same as the nice, neat file paths in your example. If we use hashes, though, we at least have a way of identifying the name the thing is on the current filesystem, rather than depending on the name of the thing it was on the filesystem where the inventory was written.

We could use the fixity sections for that, of course, but it seems to me that a single manifest block is a clearer way of identifying all the files in the object.

I also like the idea of having a fixed-length key, but that's primarily aesthetic.

zimeon · 2018-12-06T20:02:45Z

@ahankinson -- I think the issues you bring up about mapping datastreams-in-inventory to datastreams-on-disk is the same whether it is done in the manifest (as spec currently written) or done via keys in the state blocks. I think key difference between the two approaches is whether the arbitrary filename string is also used, possibly multiple times, as keys in the state blocks.

rosy1280 · 2019-01-06T21:10:12Z

I would like to go back and challenge the original premise of the concern. In the fixity block of example 5.2 file digests are used to reference file paths. See the language below:

The structure of the fixity section must contain a key corresponding to an approved digest algorithm. The value of this key must follow the structure of the manifest section; that is, a key corresponding to the digest value, and an array of existing file paths that match that digest.

If there is a serious and real concern that I am missing, then I would suggest removing the fixity block from the inventory file and including it another way (perhaps a separate file). Remember the fixity block is relatively new and was provided for those who wish to port over legacy algorithms (note this could be better explained in the implementation notes).

rosy1280 · 2019-02-14T13:37:45Z

Recent conversations in the Fedora OCFL paper makes me wonder if our definitions for logical file path and existing file path aren't descriptive enough. I wonder if we should add a section to the implementation notes that talks about our decision to have these two concepts AND/OR expand the definition a bit (this might mean going beyond the one or two sentence rule.)

birkland · 2019-02-14T20:09:30Z

The topic of logical and existing file path came up today at the last Fedora committers' meeting. The spec defines no particular relationship between logical paths and existing paths, but does stipulate that for any given logical path in a particular version, its content can be found via logical -> digest -> existing and reading the content of existing. This is clear and easily reasoned about.

An example of "no particular relationship between the logical and existing paths" can be found in some test data I wrote by hand a while back. Ignore the content of the hashes :)

In particular:

    "manifest": {
      "ad10": [
        "v1/content/1"
      ],
      "ad11": [
        "v3/content/2"
      ]

and

            "state": {
              "ad10": [
                "obj3.txt"
              ],
              "ad11": [
                "obj3-new.txt"
              ]

In other words, an implementation stored a file named 1, whose content appears in the object's logical state as obj3.txt. This is logical, useful, and flexible.

Exploring this decoupling a bit further, one could represent:

    "manifest": {
      "ad10": [
        "v1/content/1"
      ],
      "ad11": [
        "v3/content/2"
      ]

and

            "state": {
              "ad10": [
                "foo"
              ],
              "ad11": [
                "foo/bar.xml"
              ]

Notably, in the domain of physical files, we have two discrete content files 1 and 2. In the logical domain, we have assigned content to things called foo, and foo/bar.xml. That doesn't make a lot of sense on most file systems, but then again there is no defined relationship between the logical state of an object and a filesystem. In the context of web resources and the Fedora API, this sort of construct actually could make a lot of sense. Directories (LDP containers) can and do have a representation, and the logical state of this object could easily be exposed as a container foo with a member foo/bar.xml.

In any case the spec as written affords flexibility for either construct. The former is straightforward, but the latter raises interesting implementation questions. So we thought it would a good idea to bring these examples to the editors.

awoods · 2019-02-14T20:53:49Z

I think for the sake of @OCFL/editors , what @birkland is raising is a fact of how the specification is currently written. Although most of the examples in the spec assume the "logical path" is the path relative to the "content/" directory, that does not have to be the case, per the spec.

Some of the earlier OCFL conversations expressed the value in having meaningful filenames persisted to disk... but the specification allows for a degree of flexibility which may be helpful in certain circumstances.

zimeon · 2019-02-14T22:01:31Z

I think the key thing is that the spec allows you to use a "nice" path but does not require it. I think this is rather different issue than the original one though, not sure whether there is a new question here.

zimeon · 2019-02-20T16:36:16Z

Have creates #310 and #311 to deal with side issues that came up recently in this thread.

In editors' meeting we have decided that we will keep using digests as the keys in the inventory file in order to have clean/simple strings for reference, as opposed to the issues described in #275 (comment) . Closing.

zimeon added Question Further information is requested OCFL Object labels Nov 9, 2018

justinlittman mentioned this issue Nov 9, 2018

Support for empty directories (v0.1 feedback) #272

Closed

zimeon added this to the Beta milestone Dec 5, 2018

awoods added the Needs Discussion label Dec 5, 2018

ahankinson mentioned this issue Dec 6, 2018

digest/file state maps -- wrong direction? #277

Closed

This was referenced Feb 20, 2019

Add inventory example to implementation notes that has different existing and logical paths #310

Closed

Add note to specification about the importance of the implementation notes as explaining #311

Closed

awoods assigned zimeon Feb 20, 2019

zimeon closed this as completed Feb 20, 2019

zimeon mentioned this issue Nov 5, 2019

Fixity field structure #406

Closed

zimeon mentioned this issue Jul 31, 2023

using unique IDs instead of digest to reference files #632

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questioning mixed use of file paths and digests #275

Questioning mixed use of file paths and digests #275

zimeon commented Nov 9, 2018

zimeon commented Nov 9, 2018

ahankinson commented Dec 5, 2018

zimeon commented Dec 6, 2018

rosy1280 commented Jan 6, 2019

rosy1280 commented Feb 14, 2019

birkland commented Feb 14, 2019

awoods commented Feb 14, 2019

zimeon commented Feb 14, 2019 •

edited

Loading

zimeon commented Feb 20, 2019

Questioning mixed use of file paths and digests #275

Questioning mixed use of file paths and digests #275

Comments

zimeon commented Nov 9, 2018

zimeon commented Nov 9, 2018

ahankinson commented Dec 5, 2018

zimeon commented Dec 6, 2018

rosy1280 commented Jan 6, 2019

rosy1280 commented Feb 14, 2019

birkland commented Feb 14, 2019

awoods commented Feb 14, 2019

zimeon commented Feb 14, 2019 • edited Loading

zimeon commented Feb 20, 2019

zimeon commented Feb 14, 2019 •

edited

Loading