Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support hashings other than MD5 #3069

Open
adrinjalali opened this issue Jan 6, 2020 · 53 comments
Open

Support hashings other than MD5 #3069

adrinjalali opened this issue Jan 6, 2020 · 53 comments
Labels
feature request Requesting a new feature p3-nice-to-have It should be done this or next sprint question I have a question?

Comments

@adrinjalali
Copy link

adrinjalali commented Jan 6, 2020

The docs seem to imply that md5 is the only hash users can use. The vulnerabilities of md5 have made it not usable in many organizations which require a certain level of security.

Is the a way to use SHA1 or SHA256 or other hashes instead?

I saw some PRs (#2022 for instance), but they're closed w/o being merged. What's the progress on that front?

@triage-new-issues triage-new-issues bot added the triage Needs to be triaged label Jan 6, 2020
@adrinjalali adrinjalali changed the title Support other sha hash Support hashings other than MD5 Jan 6, 2020
@efiop
Copy link
Contributor

efiop commented Jan 6, 2020

Hi @adrinjalali !

In our case md5 is not used for cryptography, it is simply used for file hashing, where it is still suitable (at least I'm not aware of any issues with it in such scenarios), so it is safe to use for htat purpose. The contents of the cache files are not encrypted, you could see it for yourself by opening any file under .dvc/cache. If you need to encrypt your data, you could do that pretty easily on the remote itself (e.g. sse option for s3 remote) or encrypt in your pipeline along the way and only track those encrypted files. We also have #1317 for in-house encryption, please feel free to leave a comment there, so it is easier for us to estimate the priority. Would the cloud-side encryption be suitable for your scenario?

@efiop efiop added the awaiting response we are waiting for your reply, please respond! :) label Jan 6, 2020
@triage-new-issues triage-new-issues bot removed the triage Needs to be triaged label Jan 6, 2020
@efiop efiop added question I have a question? triage Needs to be triaged labels Jan 6, 2020
@triage-new-issues triage-new-issues bot removed the triage Needs to be triaged label Jan 6, 2020
@adrinjalali
Copy link
Author

I understand these hashes shouldn't be the only defense layer against a malicious player, but it is still a very useful thing. Imagine a malicious player who has write access to the storage (s3 bucket in this case). They can take a model file, modify it in a malicious way in a way the md5 is kept unchanged, and upload the new one. The rest of the pipeline now won't detect a change in the file, and will serve the new wrong model.

Again, I understand this should not be seen as the only way to secure the models, but in places where software goes through security audit before being used, md5 is one of the first things they're gonna use to reject it.

Another point is that having a SHA is not expensive or complicated at all, so why not include it?

@JavierLuna
Copy link

I agree with @adrinjalali, using MD5 to check for integrity is far from ideal.

As SHA-1 is starting to show collisions as well, DVC could join git and replace the hashing function to SHA-256.

@efiop
Copy link
Contributor

efiop commented Jan 7, 2020

@adrinjalali @JavierLuna Thanks for the info, guys! We will support other hash functions eventually, it just wasn't the focus for us, but we will reconsider it. We've had an attempt from an outside contributor #1920 , but he wasn't able to finalise it 🙁 If anyone is interested in contributing this feature(similar to #1920, where sha is a opt-in config option for now), we will be happy to help with everything we can.

I'll leave this issue opened in addition to #1676 , as this current one is about simply replacing md5 with something else (possibly as an option), but #1676 is originally about some custom hash functions.

@JavierLuna
Copy link

I'd like to help with both issues!
Maybe I'll try working on #1676 and set the default hashing function to sha-256?

@efiop
Copy link
Contributor

efiop commented Jan 7, 2020

@JavierLuna That would be great! 🙏

Maybe I'll try working on #1676

Let's start with the current one, similar to how #1920 does it. Custom hashes could be implemented later, esp since we don't really know what type of hash creator of #1676 wants.

set the default hashing function to sha-256

Please don't set it as a default one, it should be configurable. Md5 should stay default for now. If the feature is implemented correctly, we will be able to switch to sha-256 later using 1-line PR 🙂

Please let us know if you have any questions :) FYI: we also have a #dev-general channel in our discord, feel free to join 🙂

@pared
Copy link
Contributor

pared commented Jan 8, 2020

Also, we need to remember that changing default checksum will result in whole cache recalculation for users upgrading their DVC version, and this could bring a lot of issues. If having md5 is severe, the change of default checksum should probably be left for major version release.

@efiop efiop added p3-nice-to-have It should be done this or next sprint feature request Requesting a new feature and removed awaiting response we are waiting for your reply, please respond! :) labels Jun 23, 2020
efiop added a commit to efiop/dvc that referenced this issue Aug 15, 2020
Currently we kinda assume that whatever is returned by `get_file_hash`
is of type self.PARAM_CHECKSUM, which is not actually true. E.g. for
http it might return `etag` or `md5`, but we don't distinguish between
those and call both `etag`. This is becoming more relevant for dir
hashes that are computed a few different ways (e.g. in-memory md5 or
upload to remote and get etag for the dir file).

Prerequisite for iterative#4144 and iterative#3069
efiop added a commit that referenced this issue Aug 15, 2020
Currently we kinda assume that whatever is returned by `get_file_hash`
is of type self.PARAM_CHECKSUM, which is not actually true. E.g. for
http it might return `etag` or `md5`, but we don't distinguish between
those and call both `etag`. This is becoming more relevant for dir
hashes that are computed a few different ways (e.g. in-memory md5 or
upload to remote and get etag for the dir file).

Prerequisite for #4144 and #3069
efiop added a commit to efiop/dvc that referenced this issue Aug 25, 2020
efiop added a commit that referenced this issue Aug 25, 2020
efiop added a commit to efiop/dvc that referenced this issue Aug 30, 2020
@efiop efiop mentioned this issue Aug 30, 2020
2 tasks
efiop added a commit that referenced this issue Aug 30, 2020
* dvc: use HashInfo

Related to #4144 , #3069 , #1676

* Update dvc/tree/s3.py

Co-authored-by: Saugat Pachhai <[email protected]>

Co-authored-by: Saugat Pachhai <[email protected]>
@iesahin
Copy link

iesahin commented Oct 3, 2021

We need to stop having a completely critical security vulnerability as a matter of urgency.
The goal of supporting multiple hashes is nice, however that is a far secondary concern from moving away from MD5.

I agree completely.

There are no practical export restrictions on crypto in this world anymore. This fear is about 20 years out of date.

Laws change. That was an example though. More algorithms mean more points to consider, more options require more updates and more maintenance in
the future.

@efiop
Copy link
Contributor

efiop commented Oct 3, 2021

Guys, we've been using md5 for historical reasons. Initially, we only had pipeline management where md5 was used as a checksum and not a hash, so it was perfectly suitable for the task. Data management grew from it and kept using md5. We are well aware that md5 is not a great hash, and will be switching to something more suitable (likely some SHA) in 3.0, so stay tuned.

@da2ce7
Copy link

da2ce7 commented Oct 3, 2021

@efiop

Guys, we've been using md5 for historical reasons. Initially, we only had pipeline management where md5 was used as a checksum and not a hash, so it was perfectly suitable for the task. Data management grew from it and kept using md5. We are well aware that md5 is not a great hash, and will be switching to something more suitable (likely some SHA) in 3.0, so stay tuned.

Why don't you release a security update that updates to https://github.com/BLAKE3-team/BLAKE3 (my recommendation) as a matter of priority?

Having this sort of security vulnerability addressed 'in the next major update' isn't very professional IMHO...

@Erotemic
Copy link
Contributor

Erotemic commented Oct 3, 2021

@da2ce7 While I agree DVC needs to move away from MD5 ASAP, I disagree that the choice to delay to the next major update is unprofessional. Changing the hashing algorithm is a major change that will require existing users to rename all of their files in the dvc-caches. This is not trivial.

In fact IMHO it is highly professional to be mindful of semantic versioning when making large breaking changes like this, even in the case of security.

As I mentioned beforehand, the primary use-case of DVC does not require a cryptographically secure hashing algorithm. The attack vectors in the most common cases are niche at best (note: I recognize there are serious concerns in some cases, but my argument is that these are not common). Much of the security community will pounce on anything that would be insecure in some context without really ever considering the real-world threat model in which that thing is applied. I think that is a mistake. It certainly doesn't warrant getting personal.

Regardless, I would second an urgent push to release DVC 3.0 with BLAKE3 as the default (and possibly only) hash algorithm.

@efiop
Copy link
Contributor

efiop commented Oct 3, 2021

I agree with @Erotemic on his points.

One thing about blake though, is that it is not FIPS/NIST certified, which is a significant problem for people that need to comply with those, so I doubt that we will choose blake by default, and hence why I think that some version of SHA is more realistic.

One more thing is that sometimes people confuse checksums we use for pipeline checks with (hash-ish, hashing function not cryptographic hash) hashes that we use for data management. For the former one, we will support other custom checksums that could be more relaxed than md5 that depends on the full binary content.

We've been working hard on 3.0 pre-requsites for the past half a year or so, and are continually working on those as we speak. We not only want to switch to a different algorithm in 3.0, but to also provide better performance/ui/architecture/ecosystem for data management, and all of that while not seizing releases with new features (experiements, dvc machine, plots, etc) and bug fixes for 2.0, so we've been gradually rebuilding that and will likely be ready for 3.0 in the upcoming months.

We really truly appreciate your interest 🙏

@da2ce7
Copy link

da2ce7 commented Oct 4, 2021

@Erotemic

In fact IMHO it is highly professional to be mindful of semantic versioning when making large breaking changes like this, even in the case of security.

In the context of semantic versioning making a breaking change should always be signalled by a major number change. However, I do not feel like that was the core of the statement that @efiop was making. He wasn't referring to the fact that this breaking change will be called 3.0, but that this change will be included in 3.0.

As I mentioned beforehand, the primary use-case of DVC does not require a cryptographically secure hashing algorithm. The attack vectors in the most common cases are niche at best (note: I recognize there are serious concerns in some cases, but my argument is that these are not common). Much of the security community will pounce on anything that would be insecure in some context without really ever considering the real-world threat model in which that thing is applied. I think that is a mistake. It certainly doesn't warrant getting personal.

The problem with MD5 is that it is desperately broken, and somebody could commit to a hash of good file, and then later silently swap out that file to an evil variant, (if they control the server).

@efiop

One thing about blake though, is that it is not FIPS/NIST certified, which is a significant problem for people that need to comply with those, so I doubt that we will choose blake by default, and hence why I think that some version of SHA is more realistic.

Now that is a very niche use case. FIPS/NIST is normally enforced in the core security aspects of a system, such as: when hashing passwords, or for the digest of of a cryptographic signature.

In the case for DVC the use case is different: Here we use the cryptographic hash to assure that there is only a single matching blob of data that can be referenced b any single commit within the git repository.

DVC routinely deals with large files. So the performance of the hashing algorithm is of quite some importance:

All three SHA variants are not very good for our case.

  • There SHA-1 is quite slow and it is also considered broken.
  • SHA-512-256 is extremely secure and very slow. (SHA-512-256 is faster than SHA-256 for long strings).
  • SHA-3 is also extremely secure and even slower.

In addition, SHA-1 and SHA-2 both are not parallelizable, and are subject to length extension attacks.

On the other hand, BLAKE3 is perfect for our case.

It is extremely fast, and very secure, and in particular it is infinitely parallelizable so we can get speedup form multi-cored hardware for large files.

We really truly appreciate your interest 🙏

Thanks, our project is considering the use of DVC and the choice of the hashing algorithm naturally is an important consideration for us.

@iesahin
Copy link

iesahin commented Oct 4, 2021

I also believe this change requires more testing and a major version upgrade.

BLAKE3 has technical advantages but having FIPS/NIST certification is an important point for some of the organizations. Setting the default for SHA-256 with possible versions that support other algorithms might be a good solution. Installing like pip install dvc may install the SHA-256 version, but we may also have pip install 'dvc[blake3]' for a faster version.

@da2ce7
Copy link

da2ce7 commented Oct 4, 2021

@iesahin

I also believe this change requires more testing and a major version upgrade.

BLAKE3 has technical advantages but having FIPS/NIST certification is an important point for some of the organizations.

Setting the default for SHA-256 with possible versions that support other algorithms might be a good solution. Installing like pip install dvc may install the SHA-256 version, but we may also have pip install 'dvc[blake3]' for a faster version.

For the very few organisations that are constrained by the FIPS/NIST certification, they could use a different version of dvc. - You are suggesting that we should cripple the speed of the default because of a hand-waving concern about FIPS/NIST certification.

There will is a dramatic difference in the performance between SHA256, and BLAKE3. I think the default should be to have good performance.

If, and only if, you are forced to use FIPS/NIST certificated hashes for this part of your company Infrastructure, then you could use SHA-512-256. (SHA-256 would be silly, as it is slower than SHA-512 for everything other than the smallest of strings).

Primarily, can you show of any interested organisations that require FIPS/NIST certificated hashes for DVC files that they have in their git-repo? Is this actually a realistic goal? Is this such an important case to make it the default??

Otherwise, I suggest this is just FUD about FIPS/NIST certification requirements.

@iesahin
Copy link

iesahin commented Oct 4, 2021

can you show of any interested organisations that require FIPS/NIST certificated hashes for DVC files that they have in their git-repo?

I can't because they probably can't use DVC now. :)

For the very few organisations that are constrained by the FIPS/NIST certification, they could use a different version of dvc.

This is a good idea too. I'll leave the decision to the project maintainers.

@Erotemic
Copy link
Contributor

Erotemic commented Oct 4, 2021

For the very few organisations that are constrained by the FIPS/NIST certification

@da2ce7 I think you underestimate how many organizations that is. I work at an open source company and even I'm impacted by that (as much as it irritates me). Often it's not about actual security in these cases. Someone see's OMG you're using a hash, it better be one in our passlist! Then it devolves into pseudo-software hell from there.

The gov funds a lot of companies, and to be eligible you have to jump through their hoops, even if they don't make sense. (I wasn't able to use my ed25519 ssh key to login to a server recently. There was no way I was going to use the NIST curve, so I had to make a new RSA key. It felt a little dirty.)

I would really love something configurable in this case (and IMO it would be best to design in configurability from the ground up). I'd love to see blake3 as the default, and then a sha256 option as a hedge against FIPS/NIST certification issues. My prediction is that the probability they will come up and cause issues is high. But I think you're right that we shouldn't cripple DVC's speed in the default case because of short-sighted policy decisions that have a lot more impact that you might think if you aren't exposed to them.

Now that is a very niche use case. FIPS/NIST is normally enforced in the core security aspects of a system, such as: when hashing passwords, or for the digest of of a cryptographic signature.

This is the only thing that gives me doubt. The wording in these certification guidelines is often vague and open to interpretation (quite the opposite of where I like to be). It could be the case that we argue DVC isn't a core security aspect (and based on my earlier arguments about why a crypto hash isn't always necessary for DVC, I would agree with that).

@iesahin your solution is effectively configuration of hashes, which I don't know if that is in the roadmap. I think the arguments here are showing that even if there is one technically superior hash (blake3), there are meatspace tradeoffs that need to be considered. If configuration is not on the roadmap, I suggest re-evaluating that decision. My opinion is that blake3 is the default and sha256 is offered for the pour souls who need it.

@iesahin
Copy link

iesahin commented Oct 4, 2021

your solution is effectively configuration of hashes, which I don't know if that is in the roadmap

I'm not in a position to make these decisions.

What I'm saying is not setting an option in .dvc/config and change the hash algorithm. It should be done like a fork, and the fork may get lower level of attention in support. This will also make the main branch to adopt newer algorithms easily.

@shcheklein
Copy link
Member

Looks like Linux is using blake now:

https://lore.kernel.org/lkml/[email protected]/

It sounds reasonable to make it configurable and make blake default.

@iesahin
Copy link

iesahin commented Jan 3, 2022

If it's really required to make it configurable, I think making the digest size fixed (256 bits = 32 bytes) allows to abstract digest -> path conversion. Instead of MD5, DVC may require to use "any hash function that outputs 256 bits."

I still think making it configurable won't add much value, and using SHA2-256 is the most obvious option due to NSA blessings. Longer hash function digests may also be problematic for file system paths, 512 bits means 64 bytes encoded with 128 characters in hexadecimal. In this case .dvc/cache/12/....EF will have 138 characters. Windows systems have 255 character limit for paths. (Encoding hashes in base64 to convert to paths might be possible but NTFS also doesn't allow case-sensitive file names by default.)

@bobertlo
Copy link
Contributor

bobertlo commented Jan 4, 2022

I would like to point out that with hash configuration any future change would then not be a breaking change requiring a major version release.

I've just been watching this thread because the text/binary distinction in the hash has been limiting to my use cases. I don't think the use of MD5 is that bad or needing of urgent change, but it would be nice to be able to use another hash. I would personally be a lot more comfortable with something like SHA-256 (or 512-256) vs blake3 if there was not a choice. I like to be able to access the data externally and sha256 is generally more available than blake3 (i.e. it is in the go standard library)

@Erotemic
Copy link
Contributor

Erotemic commented Jan 4, 2022

Because the discussion doesn't seem to reference this issue when I raised it, I'll note that I think a configurable hash could solve the main issue with immutable storage backends like IPFS as discussed here: #6777 (reply in thread)

If I run: ipfs add -nq setup.py I get the "hash" (really an IPFS CID): QmdsJ5Tn6E78KrGwvkrNRSoA9QvmX2iieo7Zv3GmVUcNMf, which is entirely based on the file contents. If the files in the .dvc/cache were stored with these IPFS "CIDs" then the issue of needing to know what the name of the file is on the remote is solved. The way I envision it, IPFS backend would require the IPFS-hasher, which allows for implementing an IPFS variant of dvc.fs.base.FileSystem.

@Erotemic
Copy link
Contributor

@efiop Has there been any movement towards any of the ideas discussed here? I have an idea that I'd like to test out. Having an API that specifies the hash is necessary for it. Is a PR that adds a configurable hash API that is hard-coded to MD5 to start with (adding more hash algorithms can be discussed later) something that would be considered?

@robotrapta
Copy link

robotrapta commented May 25, 2023

I would hope that the migration to a newer hashing algorithm would be done in a backwards-compatible way. Saying it has to wait until 3.0 because it's a breaking change implies something pretty troubling - does that mean dvc3 can only use new hash and won't work with existing stores? @shcheklein @efiop (I sure hope not.)

For organizations that have really invested in using DVC, migrating their existing stores from one hash func to another will be at least a big hassle and perhaps completely intractable. At the very least you need to redo the entire storage - which is certainly expensive but possible. But then you also need to track down every .dvc file which references it, and those could be scattered to the winds in places that can't all be found reliably.

The right way to do this doesn't require waiting for a major version. You just add an option today which is "new hash / old hash" when creating a new repo. In 3.0 you can/should make "new hash" the default. IMHO it would be very reasonable to offer this option for new hash without any migration tool whatsoever. It just means people who start using dvc today are better off. And those who have legacy MD5 data stores can either figure out the migration themselves or stick with what they have.

@brohee
Copy link

brohee commented Jun 13, 2023

Seeing as I just stumbled upon this and just experimented, lets just be very clear with everyone: if your dataset includes files with the same MD5 sum, only the first one will ever be pushed to the remote, the others will be falsely deduplicated. It's super easy to test with the bins from https://github.com/corkami/collisions. Given how easy the files are to produce and the potential impact, I don't think this issue is treated with nearly the urgency needed. I'd say use Blake3 already, FIPS be damned. (possibly a bit more discussion at https://pouet.chapril.org/@brohee/110537633498422926)

@efiop
Copy link
Contributor

efiop commented Jun 15, 2023

Hey folks, thanks for your interest in this. We've been working hard towards making hash configurable in our internals and at this point pretty much only missing an actual config option exposed to the public and some high-level stuff (and obviously some polish around it). We don't have an immediate plan to work on that right now, but it is on our radar and we will be considering it in the near/mid future. Will keep everyone posted on the changes here.

@Erotemic
Copy link
Contributor

Erotemic commented Jul 4, 2023

The new <cache>/files/<hash> folder in dvc 3.0 has been great so far; @efiop, you and the rest of the iterative team have done great work with that and all of the surrounding revamps!

Having spent the weekend slogging through md5 hashes, I'm so excited for blake3 and/or xxhash

image

image

@lefos99
Copy link

lefos99 commented Jan 8, 2024

Nice this Blake3 cryptographic hash function looks really promising! 🤟

@derdre
Copy link

derdre commented Jan 11, 2024

I would really like the option to use faster hash algorithms, e.g. Blake3. Especially in a project with a lot of files, this could accelerate my ML pipeline significantly.
Do you think this will be added anytime soon? Or is there a way to overwrite md5 already now?

@JohnB22
Copy link

JohnB22 commented Oct 24, 2024

Any updates on the progress here? is there a PR? FIPS compliance is a must for me, unfortunately!

@Erotemic
Copy link
Contributor

@JohnB22 md5 can be used in a FIPS system as long as its not part of any security related component.

From NIST.FIPS.140-2.pdf page iv:

  1. Applicability. This standard is applicable to all Federal agencies that use cryptographic-based security
    systems to protect sensitive information in computer and telecommunication systems (including voice
    systems) as defined in Section 5131 of the Information Technology Management Reform Act of 1996,
    Public Law 104-106. This standard shall be used in designing and implementing cryptographic modules
    that Federal departments and agencies operate or are operated for them under contract. Cryptographic
    modules that have been approved for classified use may be used in lieu of modules that have been validated
    against this standard. The adoption and use of this standard is available to private and commercial
    organizations.

The key phrase being applicable to all Federal agencies that use cryptographic-based security
systems to protect sensitive information in computer and telecommunication systems

Is DVC being used to protect information? No. It's a storage system. It is not used to control access. It is used for content addressable references. It's not hashing passwords, its not protecting information, its merely a lookup index.

You can still use Python dictionaries in a FIPS system even though their hash function is not cryptographic. The same reasoning applies here.

DVC still should move away from md5, but its a niche case related to data integrity, not security. You can also validate if any data integrity issues have occurred. After you add data to DVC, if you have fewer unique files in your cache than you originally had, then you hit a md5 collision (which in all likelihood won't happen). If you really really wanted to be robust to the small likelihood there is an md5 collision, you could implement a system on top of DVC that precomputes a sha256 hash for each file, and check if the number of unique sha256 hashes is ever different than the unique md5 hashes. That would guarantee (up to the even fewer limitations of sha256) that no data integrity errors have occurred.

Takeaway: when you are working with FIPS. Understand its purpose and the role of hashes in your system and when FIPS guidelines are / are not applicable.

@JohnB22
Copy link

JohnB22 commented Oct 25, 2024

@Erotemic I understand that it's not used for security reasons. I should have been more clear.

For reasons out of my control, I am using a "FIPS Compliant" version of Python which disallows the use of hashlib.md5. So, despite its use case, the interpreter will not call the md5 function.

Doing some short research however, I've noticed that since Python 3.9 hashlib includes a usedforsecurity parameter, which defaults to True. When set to False, it seems to allow the use of those function in protected environments.

Would there be any difficulty in setting that flag?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Requesting a new feature p3-nice-to-have It should be done this or next sprint question I have a question?
Projects
None yet
Development

No branches or pull requests