-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nocopy adds files with different hashes than without nocopy #4318
Comments
Due to some inner workings nocopy add implicitly enables |
@PCSmith as @magik6k said, there is a difference in the default importer parameters for adding with the While the bytes of the file being referenced is the same, the structure of the graph is different. The first hash you show describes a graph where the data-containing leaf nodes have some protobuf wrapping. The nocopy-raw-leaves hash describes a graph where the data containing nodes are simply the data, with nothing else.
Identical graphs should have the same hash, the problem youre running into here is really that when passing |
Thanks for the added information. That tells me that nocopy is pointless because to match up with most other users I cant use it. DTube type applications being case and point. :( Even outside dtube... If I host a file, a user gets it, likes it, and then add / pins it, the link for my file to the rest of the world is still a single point cause the pinner is hosting a different hash... Defeats the purpose of IPFS as well right? |
@PCSmith the reason that the current filestore only supports raw-leaves is because eventually the use of raw-leaves will be the default and this will become less of an issue over time. If there is enough interest (and it is something that is wanted by the core IPFS team) it may be possible to provide support for the legacy format in the filestore. My original implementation (see #2634) has it and it may be possible to backport the change into the much simpler implementation currently in IPFS. |
If they get it through IPFS, no; they'll have the same hash. They'll only have a different hash if they extract it from IPFS (e.g., by fetching it from a gateway) and re-add it (at this point, you're just using IPFS as a local datastore and as a webserver).
The fingerprints are unique; different files will never have the same fingerprint. There are multiple ways to fingerprint each file but each fingerprint maps to at most one file.
We can't guarantee that adding the same file twice will end up with the same hash. That would make it impossible for us to improve or change anything. For example, we'd like to introduce file-type specific chunkers that chunk up large files on add based on how they are likely to be used (so applications can quickly seek to the part of the file they're interested in without downloading the entire file). If we guaranteed every file would map to exactly one hash, we wouldn't be able to make improvements like this.
IPFS removes duplicate blocks (chunks), not files. This often leads to file-level deduplication but that's not a guarantee. As a matter of fact, removing duplicate blocks is usably better as multiple files often share similar blocks. For example, consider two similar but non-identical VM images. By deduplicating on the level of blocks and allowing for different chunking algorithms, we can define special-purpose chunkers for, e.g., VM images that deduplicate identical files within the VM images. |
Relevant issue regarding hashing the actual content and not how the file is stored in IPFS: ipfs/notes#126 |
Well I wish I had known about this limitation before I spent so much time symlinking and adding my entire video library. Complete waste apparently. I am certainly interested as nocopy is the only viable use case for IPFS for me.
Lets say I'm a layman. Someone gives me a hash for a file, lets say a video. I "ipfs get" it. Then I magically know to change the file extension to mp4 -- cause for some reason file names arent supported -- so that it will open in my video player. I double click the file and watch it. I like it so I want to help them host it. Even if at that time I remember the original hash I have to know to not "ipfs add" the file I downloaded but to "ipfs pin add" the original hash else I wont be helping at all and also probably wont notice that I'm not helping. Its just an added layer of confusion and detracts from the clarity of use is all I'm saying. Relevant quote from kevina's link: I'm here out of love and excitement for this project. Please do not interpret my abrupt writing style as abrasive. <3 |
They are, the same way they're supported on your local file system: file names are properties of directories, not files. You can, e.g., call However, we agree that the current "just give a file a hash" system is confusing. The plan (caveat, this will take months) is to make commands like
This is a specific problem we can, and should, address.
Welcome! We love new contributors. If you're open to some suggestions, there are a few ways to keep technical conversations on track and avoid unintentionally alienating participants: avoid making general and/or non-actionable complaints and avoid personalizing issues. By general statements, I mean things like "this is useless", "this is pointless", etc. Statements like that can be very frustrating for us because they don't lead to specific, actionable issues; i.e., things we can actually fix. They also tend to be wildly incorrect: "this is useless" usually means "I can't use this to solve my specific issue" and "this is pointless" usually means "I don't see the point". By personalizing, I mean things like:
This is probably just an expression of your frustration, but this makes us feel like we've personally hurt you by wasting your time. It doesn't solve anything and just makes us feel bad. TL;DR: Try to keep technical discussions precise and technical (note: this tangent is distinctly not technical by nature). |
Just jumping in here (albeit a little uninformed and new 👍 ) I read over what @PCSmith and @Stebalien and @kevina described on how raw-leaves and non-raw-leaves currently generate a different hash. I wanted to make sure this was restated a little more plainly, because it seems obvious to me after having re-read this thread. It looks like the general concern here is that "the way i added a file may not generate the same hash as the way another person may have added the same file, even though we both used the latest version of IPFS. We are afraid this will fragment the swarm." My thinking tells me that it's okay having different versions of IPFS generate potentially more optimized chunks/hashes than earlier versions when new files are added, because devs need the ability to change the storage algorithm over time. However, doing that will fragment the network by slowing the ability for peers to help each other with different hashes. Hopefully the question of whether a repo's content hashes can be upgraded has been raised as a natural consequence of this fragmentation, and the tradeoffs that either upgrading them or not upgrading them has to network health are called into question This comment stuck out to me though:
If the intent is to progress towards an eventual consistency in the hashing algorithm where someday things generated with ipfs add will all use raw-leaves (due to stabilizing the new storage algorithm), then old hashes that are not using raw-leaves will be deprecated. But it looks like any old hashes that are still floating around in different peers' repos don't have a way to eventually converge towards some kind of consistency with the new file store algorithm, let alone the payloads cached in the swarm's repos behind those hashes. Does (or should) IPFS have a plan for non-raw-leaf hashes + files in the repo to be converted to raw-leaf hashes when raw-leaves eventually becomes the new default for adding? Or are the old hashes something that IPFS plans on just deprecating by having them fizzle out in the swarm to eventually die? My intuition tells me that if peers have enough of an archive pinned, in theory they might be able to all independently arrive at the same upgraded hash when a new version of IPFS is pushed, without having to generate any network traffic. Obviously it's impossible to do this if there's insufficient data on a peer who might have pieces of a payload from an old hash, but if a peer has enough data in their repo to rebuild a whole file, shouldn't we be talking about upgrading that file? Am I on the right track, or is this even a good idea? |
When the defaults for
Yes, if you were to hash a file with specific parameters, the hash has to match with everyone else who uses the same parameters on the same content. This can be accomplished in several ways right now but there's no direct I Think the specific issue @PCSmith is having isn't really relevant to hashing the content so much as it is mirroring an existing hash, it seems like it will be resolved with #3981 This is all talk of the future though. I ask @whyrusleeping and @kevina to look over my comment since they have more bearing on the direction of both of these things.
I can see the value in having something like an |
If I'm understanding this correctly: If QmRawLeaves1 was added to the network, It can't be accessed via QmFullData2 (!) So for the future we/I will want/need some way to have a hash/pointer to data that points to it however it was added. (could we have an extra basic sha256 lookup method? :D) |
In this case, no part is shared unfortunately. This is actually one of the reasons we introduced the When we hit 1.0, we'll flip a switch and turn options like raw-leaves on by default.
Yes.
No.
If you chunk/format your data differently, it'll have a different hash. If you add different metadata, it'll have a different hash. If you use some fancy new option, it'll have a different hash. Unfortunately, there's nothing we can do about this. Now, we could be a bit smarter. That is, we could take data in one format, convert it to the other, and rehash. We wouldn't have to record the new data, just the hashes and how we got them. However, that's not a huge issue at the moment and adds a ton of complexity. |
This was a great discussion which shines some light on how hashing and blocks work but I think the issue could be closed, documentation (at least now) states that --nocopy implies --raw-leaves. |
Good point. I've filed #6891 to close this. |
I noticed this when my content on dtube wasnt loading and I noticed that the hashes DTube is looking for cant be found in my files list of the IPFS -> files page. Investigation showed:
Example:
If I do:
ipfs add --nocopy thumb.jpg
I get hash:
zb2rhbtqSALGLSW19ZRF4QJKrohwwNNueg4qp7hAu9CJYgUyS
If I do:
ipfs add thumb.jpg
I get hash:
QmazMBf1fQCZhXZHjk5nKb1Sui3D4Gw2YR9W8DP6umEYk3
The second one seems to be the correct one...
The text was updated successfully, but these errors were encountered: