Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nocopy adds files with different hashes than without nocopy #4318

Closed
PCSmith opened this issue Oct 18, 2017 · 15 comments · Fixed by #6891
Closed

nocopy adds files with different hashes than without nocopy #4318

PCSmith opened this issue Oct 18, 2017 · 15 comments · Fixed by #6891

Comments

@PCSmith
Copy link

PCSmith commented Oct 18, 2017

I noticed this when my content on dtube wasnt loading and I noticed that the hashes DTube is looking for cant be found in my files list of the IPFS -> files page. Investigation showed:

Example:
If I do:
ipfs add --nocopy thumb.jpg
I get hash:
zb2rhbtqSALGLSW19ZRF4QJKrohwwNNueg4qp7hAu9CJYgUyS

If I do:
ipfs add thumb.jpg
I get hash:
QmazMBf1fQCZhXZHjk5nKb1Sui3D4Gw2YR9W8DP6umEYk3

The second one seems to be the correct one...

@magik6k
Copy link
Member

magik6k commented Oct 18, 2017

Due to some inner workings nocopy add implicitly enables --raw-leaves option. If you add your file with this option, the hash will be the same as with nocopy.

@PCSmith
Copy link
Author

PCSmith commented Oct 18, 2017

I updated the web site to match this new information...

image

:D

Jokes aside -- I hope this is being interpreted as a bug.... Clearly identical files should have the same hash.

@whyrusleeping
Copy link
Member

@PCSmith as @magik6k said, there is a difference in the default importer parameters for adding with the --nocopy flag. If you add your file with ipfs add --raw-leaves thumb.jpg you will get the same hash as with the --nocopy flag.

While the bytes of the file being referenced is the same, the structure of the graph is different. The first hash you show describes a graph where the data-containing leaf nodes have some protobuf wrapping. The nocopy-raw-leaves hash describes a graph where the data containing nodes are simply the data, with nothing else.

Clearly identical files should have the same hash.

Identical graphs should have the same hash, the problem youre running into here is really that when passing --nocopy, youre implicitly passing --raw-leaves. We could have made --nocopy fail if the user didnt pass --raw-leaves, but that was deemed to be more confusing to users.

@PCSmith
Copy link
Author

PCSmith commented Oct 19, 2017

Thanks for the added information. That tells me that nocopy is pointless because to match up with most other users I cant use it. DTube type applications being case and point. :(

Even outside dtube... If I host a file, a user gets it, likes it, and then add / pins it, the link for my file to the rest of the world is still a single point cause the pinner is hosting a different hash... Defeats the purpose of IPFS as well right?

@kevina
Copy link
Contributor

kevina commented Oct 19, 2017

@PCSmith the reason that the current filestore only supports raw-leaves is because eventually the use of raw-leaves will be the default and this will become less of an issue over time.

If there is enough interest (and it is something that is wanted by the core IPFS team) it may be possible to provide support for the legacy format in the filestore. My original implementation (see #2634) has it and it may be possible to backport the change into the much simpler implementation currently in IPFS.

@Stebalien
Copy link
Member

If I host a file, a user gets it, likes it, and then add / pins it, the link for my file to the rest of the world is still a single point cause the pinner is hosting a different hash...

If they get it through IPFS, no; they'll have the same hash. They'll only have a different hash if they extract it from IPFS (e.g., by fetching it from a gateway) and re-add it (at this point, you're just using IPFS as a local datastore and as a webserver).

Unique fingerprint

The fingerprints are unique; different files will never have the same fingerprint. There are multiple ways to fingerprint each file but each fingerprint maps to at most one file.

Defeats the purpose of IPFS as well right?

We can't guarantee that adding the same file twice will end up with the same hash. That would make it impossible for us to improve or change anything. For example, we'd like to introduce file-type specific chunkers that chunk up large files on add based on how they are likely to be used (so applications can quickly seek to the part of the file they're interested in without downloading the entire file). If we guaranteed every file would map to exactly one hash, we wouldn't be able to make improvements like this.

removes duplications

IPFS removes duplicate blocks (chunks), not files. This often leads to file-level deduplication but that's not a guarantee. As a matter of fact, removing duplicate blocks is usably better as multiple files often share similar blocks. For example, consider two similar but non-identical VM images. By deduplicating on the level of blocks and allowing for different chunking algorithms, we can define special-purpose chunkers for, e.g., VM images that deduplicate identical files within the VM images.

@kevina
Copy link
Contributor

kevina commented Oct 20, 2017

Relevant issue regarding hashing the actual content and not how the file is stored in IPFS: ipfs/notes#126

@PCSmith
Copy link
Author

PCSmith commented Oct 20, 2017

"If there is enough interest"

Well I wish I had known about this limitation before I spent so much time symlinking and adding my entire video library. Complete waste apparently. I am certainly interested as nocopy is the only viable use case for IPFS for me.

"They'll only have a different hash if they extract it from IPFS (e.g., by fetching it from a gateway) and re-add it"

Lets say I'm a layman. Someone gives me a hash for a file, lets say a video. I "ipfs get" it. Then I magically know to change the file extension to mp4 -- cause for some reason file names arent supported -- so that it will open in my video player. I double click the file and watch it. I like it so I want to help them host it. Even if at that time I remember the original hash I have to know to not "ipfs add" the file I downloaded but to "ipfs pin add" the original hash else I wont be helping at all and also probably wont notice that I'm not helping. Its just an added layer of confusion and detracts from the clarity of use is all I'm saying.

Relevant quote from kevina's link:
"if someone somewhere discovers a lost file in some offline archive and decides to upload that file (or the whole archive) to the Permanent Web, the file is likely to yield a different IPFS hash and thus an old hyperlink (which references the original IPFS hash) is still doomed to remain broken forever. Such behaviour is not okay for the Permanent Web."

I'm here out of love and excitement for this project. Please do not interpret my abrupt writing style as abrasive. <3

@Stebalien
Copy link
Member

cause for some reason file names arent supported

They are, the same way they're supported on your local file system: file names are properties of directories, not files. You can, e.g., call ipfs add --wrap-with-directory myFile to add a file with a filename and a wrapping directory. If we were to attach filenames directly to files, identical files with different names would always have different hashes (the exact problem you're trying to avoid).

However, we agree that the current "just give a file a hash" system is confusing. The plan (caveat, this will take months) is to make commands like ipfs add act more like the ipfs files command. That is, by default, you won't just "add" files to IPFS, you'll add files into your own virtual filesystem stored within IPFS. This way, all files have a name, locally at least. This won't help users who download your file by raw hash (i.e., /ipfs/Qm...) but it should make things a bit less confusing.

Even if at that time I remember the original hash I have to know to not "ipfs add" the file I downloaded but to "ipfs pin add" the original hash else I wont be helping at all and also probably wont notice that I'm not helping.

This is a specific problem we can, and should, address.

  1. In the future, we'd like to improve the fuse interface to the point where you don't have to ipfs get files to your home directory. You should be able to just "open" IPFS files. This way, you don't even have to, e.g., wait for your video to finish downloading; you can just open it and stream it as you play it.
  2. A planned improvement for IPFS is to record the chunking algorithm used into a file's metadata. A second step we could take would be to embed this into the file's extended attributes when the user calls ipfs get to download the file into their local filesystem. This way, if the user adds the file back to IPFS, we can use the same chunking algorithm when possible. Note: this is a brittle solution, not all filesystems support extended attributes (and, e.g., HTTP, dropbox, etc. certainly don't).

I'm here out of love and excitement for this project. Please do not interpret my abrupt writing style as abrasive. <3

Welcome! We love new contributors. If you're open to some suggestions, there are a few ways to keep technical conversations on track and avoid unintentionally alienating participants: avoid making general and/or non-actionable complaints and avoid personalizing issues.

By general statements, I mean things like "this is useless", "this is pointless", etc. Statements like that can be very frustrating for us because they don't lead to specific, actionable issues; i.e., things we can actually fix. They also tend to be wildly incorrect: "this is useless" usually means "I can't use this to solve my specific issue" and "this is pointless" usually means "I don't see the point".

By personalizing, I mean things like:

Well I wish I had known about this limitation before I spent so much time symlinking and adding my entire video library. Complete waste apparently.

This is probably just an expression of your frustration, but this makes us feel like we've personally hurt you by wasting your time. It doesn't solve anything and just makes us feel bad.

TL;DR: Try to keep technical discussions precise and technical (note: this tangent is distinctly not technical by nature).

@yuri-sevatz
Copy link

Just jumping in here (albeit a little uninformed and new 👍 ) I read over what @PCSmith and @Stebalien and @kevina described on how raw-leaves and non-raw-leaves currently generate a different hash. I wanted to make sure this was restated a little more plainly, because it seems obvious to me after having re-read this thread. It looks like the general concern here is that "the way i added a file may not generate the same hash as the way another person may have added the same file, even though we both used the latest version of IPFS. We are afraid this will fragment the swarm."

My thinking tells me that it's okay having different versions of IPFS generate potentially more optimized chunks/hashes than earlier versions when new files are added, because devs need the ability to change the storage algorithm over time. However, doing that will fragment the network by slowing the ability for peers to help each other with different hashes. Hopefully the question of whether a repo's content hashes can be upgraded has been raised as a natural consequence of this fragmentation, and the tradeoffs that either upgrading them or not upgrading them has to network health are called into question

This comment stuck out to me though:

@PCSmith the reason that the current filestore only supports raw-leaves is because eventually the use of raw-leaves will be the default and this will become less of an issue over time.

If the intent is to progress towards an eventual consistency in the hashing algorithm where someday things generated with ipfs add will all use raw-leaves (due to stabilizing the new storage algorithm), then old hashes that are not using raw-leaves will be deprecated. But it looks like any old hashes that are still floating around in different peers' repos don't have a way to eventually converge towards some kind of consistency with the new file store algorithm, let alone the payloads cached in the swarm's repos behind those hashes.

Does (or should) IPFS have a plan for non-raw-leaf hashes + files in the repo to be converted to raw-leaf hashes when raw-leaves eventually becomes the new default for adding? Or are the old hashes something that IPFS plans on just deprecating by having them fizzle out in the swarm to eventually die?

My intuition tells me that if peers have enough of an archive pinned, in theory they might be able to all independently arrive at the same upgraded hash when a new version of IPFS is pushed, without having to generate any network traffic. Obviously it's impossible to do this if there's insufficient data on a peer who might have pieces of a payload from an old hash, but if a peer has enough data in their repo to rebuild a whole file, shouldn't we be talking about upgrading that file?

Am I on the right track, or is this even a good idea?

@djdv
Copy link
Contributor

djdv commented Feb 11, 2018

@yuri-sevatz

then old hashes that are not using raw-leaves will be deprecated

When the defaults for add change, support for different hashes are not dropped, anyone should always be able to get and pin old hashes so they won't suddenly become unavailable. At worst deprecation would mean not being able to generate older style hashes but even that seems unlikely. It's more likely you'll be able to generate old hashes by just specifying the old defaults.
i.e. --cid-version=0 in the event we move up to v1 for defaults, and/or --raw-leaves=false when the default is now true, etc.

they might be able to all independently arrive at the same upgraded hash when a new version of IPFS is pushed

Yes, if you were to hash a file with specific parameters, the hash has to match with everyone else who uses the same parameters on the same content. This can be accomplished in several ways right now but there's no direct rehash command. The simplest method would be to run get and then add on the content, there's also mounting the content and adding it back with your desired parameters, as well as others.
Deprecation of a hash itself is left up to the network, if you publish something with an old hash, it will exist there, if you or someone else decides they want to use newer hashes for whatever reason, it's up to you to coordinate the move.
Even though some of the hashing algorithms introduced may be more efficient or secure, and some of the chunking may be more optimal for certain datasets, I don't think the deltas are ever going to be large enough to matter practically. That being said nothing prevents people from making such an effort. If for instance I publish a video file and someone else or I notice that it would be much more efficient if we used hashing algorithm X and chunking scheme Y, anyone can just publish the new hash and people can choose to use it instead or maintain the old one if they wish.

I Think the specific issue @PCSmith is having isn't really relevant to hashing the content so much as it is mirroring an existing hash, it seems like it will be resolved with #3981
This would eliminate the need to get and then add --nocopy the file yourself, instead you would just be able to do a single command like this ipfs get --filestore -o /place/I/store/my/files, to create a local copy outside of the datastore but still available to the network via the filestore.

This is all talk of the future though. I ask @whyrusleeping and @kevina to look over my comment since they have more bearing on the direction of both of these things.

Am I on the right track, or is this even a good idea?

I can see the value in having something like an ipfs rehash /ipfs/Qm... that takes in add arguments and spits out a recalculated hash. Or maybe allow add to take IPFS paths as input.

@NiKiZe
Copy link

NiKiZe commented Jul 18, 2018

If I'm understanding this correctly:
ipfs add --raw-leaves thumb.jpg (call it QmRawLeaves1) and ipfs add thumb.jpg (QmFullData2) generates different hashes
But some parts of the data between the 2 are shared?

If QmRawLeaves1 was added to the network, It can't be accessed via QmFullData2 (!)
If QmFullData2 was added to the network, It can be accessed via QmRawLeaves1 ?

So for the future we/I will want/need some way to have a hash/pointer to data that points to it however it was added. (could we have an extra basic sha256 lookup method? :D)

@Stebalien
Copy link
Member

If I'm understanding this correctly:
ipfs add --raw-leaves thumb.jpg (call it QmRawLeaves1) and ipfs add thumb.jpg (QmFullData2) generates different hashes
But some parts of the data between the 2 are shared?

In this case, no part is shared unfortunately. This is actually one of the reasons we introduced the --raw-leaves option. Without the --raw-leaves option, the "leaves" (the data nodes) have a wrapping "envelope" (the data is embedded in a protobuf). The --raw-leaves option tells IPFS to not wrap these nodes (using features introduced later in the development of IPFS).

When we hit 1.0, we'll flip a switch and turn options like raw-leaves on by default.

If QmRawLeaves1 was added to the network, It can't be accessed via QmFullData2 (!)

Yes.

If QmFullData2 was added to the network, It can be accessed via QmRawLeaves1 ?

No.

So for the future we/I will want/need some way to have a hash/pointer to data that points to it however it was added. (could we have an extra basic sha256 lookup method? :D)

If you chunk/format your data differently, it'll have a different hash. If you add different metadata, it'll have a different hash. If you use some fancy new option, it'll have a different hash. Unfortunately, there's nothing we can do about this.

Now, we could be a bit smarter. That is, we could take data in one format, convert it to the other, and rehash. We wouldn't have to record the new data, just the hashes and how we got them. However, that's not a huge issue at the moment and adds a ton of complexity.

@bailer
Copy link

bailer commented Feb 11, 2020

This was a great discussion which shines some light on how hashing and blocks work but I think the issue could be closed, documentation (at least now) states that --nocopy implies --raw-leaves.

@Stebalien
Copy link
Member

Good point. I've filed #6891 to close this.

ralendor pushed a commit to ralendor/go-ipfs that referenced this issue Jun 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants