Added archive validation for all available algorithms in hashlib. #8851

GMouzourou · 2024-01-05T01:56:33Z

Resolves: #4578

May also contribute to #7881 , as it's unclear if the known hashes being checked are all sha256.

Added tests for changed code.
Updated documentation for changed code.

radoering · 2024-01-05T06:07:21Z

Some thoughts:

Regardless of the changes in this PR, it looks like we are checking that the hash of an archive matches the hash of any known archive. Actually, we should check that it matches the hash of the correct archive since we know the name of the archive, shouldn't we?
With this PR we calculate multiple hashes if we have several files with different hash types. What's the performance impact? (This would be a non-issue if we'd fix the above point first.)

dimbleby · 2024-01-05T08:10:33Z

This certainly has nothing to do with #7778

GMouzourou · 2024-01-05T10:19:12Z

Hi @radoering,

When calculating the known hashes, we filter for if f["file"] == archive.name. So we only check hashes for the file given.

What would impact performance is if we had multiple hashes types for the same given file (i.e. sha256, sha512, ...). This can be seen on the line were we generate archive_hashes. This could be changed to just picking the first hash type from hash_types if required.

radoering · 2024-01-05T16:08:15Z

Thanks for the hint, I overlooked this. So in theory, the same archive could appear several times (with different hash types) in the list. (I'm not sure whether this can happen in practice but let's assume it can.)

This could be changed to just picking the first hash type from hash_types if required.

I don't think picking the first hash type is good enough. We do not want to rely on md5 if there are more secure hashes. I suppose, we have to put a list of preferred hash types somewhere (maybe in poetry.utils.helpers). Then, we can use the first hash type in the list that is available and just fallback to a "random" choice if no hash type is in our list.

I assume normally there shouldn't be other hashes than these in the lockfile anyway so this might all be a theoretical issue after all:

poetry/src/poetry/repositories/http_repository.py

Lines 227 to 232 in 022308e

    
           if not link.hash or ( 
        
               link.hash_name is not None 
        
               and link.hash_name not in ("sha256", "sha384", "sha512") 
        
               and hasattr(hashlib, link.hash_name) 
        
           ): 
        
               file_hash = self.calculate_sha256(link) or file_hash

GMouzourou · 2024-01-06T02:35:44Z

I've created a get_highest_priority_hash_type function in poetry.utils.helpers as suggested. So now only one hash algorithm is used per archive (the most cryptographically secure). If you have issue with the priority of the algorithms, let me know the priority you'd like and I'll update the list.

I've also added tests for multiple hash types of the same file, multiple files in the same package, and an unsupported hash type.

radoering · 2024-01-06T14:04:57Z

I think the approach makes sense in general. We just have to clarify some details.

I don't really like the repetetiveness of get_highest_priority_hash_type(). Further, you might get the log message that the hash is not in the priority list if it is in the list but not in hashlib.algorithms_available. I would probably rewrite it like this:

def get_preferred_available_hash_type(hash_types: set[str]) -> str | None:
    if not hash_types:
        return None

    # ordered by preference (most preferred first)
    preferred_hashes = ('sha512', 'sha256', ...)
    for chosen in preferred_hashes:
        if chosen in hash_types:
            if chosen in hashlib.algorithms_available:
                return chosen
            logger.debug("Hash type %s not available in hashlib", chosen)

    # do not consider hashes that are not available in hashlib again
    hash_types = hash_types - set(preferred_hashes)

    if hash_types:
        logger.debug("No hash type of %s in preferred hashes list", hash_types)
        # choose any available hash type
        for chosen in hash_types:
            if chosen in hashlib.algorithms_available:
                return chosen
            logger.debug("Hash type %s not available in hashlib", chosen)

    return None

We definitively want a parameterized unit test for this function that checks the special cases (regardless of the exact implementation).

the most cryptographically secure

That's probably fine. I don't expect performance to be an issue but if it was that would also be a factor to consider...

If you have issue with the priority of the algorithms, let me know the priority you'd like and I'll update the list.

LGTM but I'm not an expert. I suppose we can change it later if someone with more expertise comes along. Just one thing: I'd probably omit md5 (and maybe sha1?) because they are known to be insecure. If we omit them from the preferred/priority list, we still have a chance to choose a better algorithm, which is unknown to us. Not something you should rely on but anyway...

GMouzourou · 2024-01-06T19:20:25Z

I didn't particularly like that big block of if statements either. I did originally write it as a match\case, but then realised poetry supports versions of python older than 3.10.

I think I got the gist of where you were going in your code snippet. I pulled out the creation of the prioritised hash types into a global, as I didn't like the idea of doing all that collection manipulation for every archive and to only get the same answer every time. Also python set is an unordered collection, so you wouldn't have got your intended behaviour.

As to your MD5 and SHA1 comment, preferring other algorithms is a must (which is what this implementation does). Omitting them from the list will remove the ability to ever use them. I took it that was not your intent, but if you did want to prohibit the use of archives with only an MD5 or SHA1 hash then I will remove them. That does tend to force people to point to git repositories to avoid any checks (or to lesser dependency management tools 😝), which is obviously worse. Is there a good mechanism to warn users if we have had to check an archive using MD5 or SHA1? Or perhaps create a feature to add a configuration option?

radoering · 2024-01-07T09:21:25Z

I pulled out the creation of the prioritised hash types into a global, as I didn't like the idea of doing all that collection manipulation for every archive and to only get the same answer every time.

Makes sense.

Also python set is an unordered collection, so you wouldn't have got your intended behaviour.

Not sure what your are referring to. The preferred/prioritized hashes must be ordered, but the hash_types we pass to the function can be a set.

Omitting them from the list will remove the ability to ever use them.

Not when using my code snippet. 😉 My idea is that we still choose an available hash type even if none is in our preferred list. That part is missing now. I'll add a comment in the code to make it clearer.

Is there a good mechanism to warn users if we have had to check an archive using MD5 or SHA1?

We could have another list/set of deprecated hash types. However, I'm not sure if it's worth it.

src/poetry/utils/helpers.py

GMouzourou · 2024-01-07T20:27:16Z

So I've:

Removed MD5 and SHA1 from the prioritised list.
Introduced your non_prioritized_available_hash_types and the processing that goes with it.

I didn't include the warning log you had as this catered for in executor.py.

GMouzourou · 2024-01-07T20:42:00Z

Also python set is an unordered collection, so you wouldn't have got your intended behaviour.

Not sure what your are referring to. The preferred/prioritized hashes must be ordered, but the hash_types we pass to the function can be a set.

I thought preferred_hashes = ('sha512', 'sha256', ...) was a set not a tuple. My bad.

tests/installation/test_executor.py

src/poetry/utils/helpers.py

radoering

Just some minor nitpicks.

src/poetry/installation/executor.py

src/poetry/utils/helpers.py

github-actions · 2024-03-03T18:46:01Z

This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Added archive validation for all available algorithms in hashlib.

42f4138

GMouzourou added 3 commits January 6, 2024 01:40

Only validate archives using highest priority hash type

0957c00

Merge branch 'master' into master

c2e85da

Addressed MyPy issue

200c6f7

Refactored highest priority hash type assessment

e81a60d

radoering reviewed Jan 7, 2024

View reviewed changes

src/poetry/utils/helpers.py Outdated Show resolved Hide resolved

src/poetry/utils/helpers.py Outdated Show resolved Hide resolved

Added non prioritized available hash types for archive validation

19560db

radoering requested changes Jan 9, 2024

View reviewed changes

tests/installation/test_executor.py Outdated Show resolved Hide resolved

tests/installation/test_executor.py Outdated Show resolved Hide resolved

src/poetry/utils/helpers.py Show resolved Hide resolved

GMouzourou added 6 commits January 9, 2024 20:27

Parameterised known hash test cases

93d5924

Merge branch 'master' into master

e31a9fc

Updated known hashes test cases

89ce18d

Merge branch 'master' into master

6500c7e

Updated known hashes test cases

c3712f4

Added test for get_highest_priority_hash_type

ae99609

radoering requested changes Jan 14, 2024

View reviewed changes

src/poetry/installation/executor.py Outdated Show resolved Hide resolved

src/poetry/installation/executor.py Outdated Show resolved Hide resolved

src/poetry/utils/helpers.py Outdated Show resolved Hide resolved

src/poetry/utils/helpers.py Show resolved Hide resolved

GMouzourou and others added 3 commits January 16, 2024 23:14

Addressed review comments

1d4fa46

Merge branch 'master' into master

d58d22e

Merge branch 'master' into master

4901db1

radoering approved these changes Jan 20, 2024

View reviewed changes

radoering merged commit a67c0ec into python-poetry:master Jan 20, 2024
20 checks passed

radoering mentioned this pull request Feb 17, 2024

release: bump version to 1.8.0 #8985

Merged

github-actions bot locked as resolved and limited conversation to collaborators Mar 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added archive validation for all available algorithms in hashlib. #8851

Added archive validation for all available algorithms in hashlib. #8851

GMouzourou commented Jan 5, 2024 •

edited

Loading

radoering commented Jan 5, 2024

dimbleby commented Jan 5, 2024

GMouzourou commented Jan 5, 2024 •

edited

Loading

radoering commented Jan 5, 2024

GMouzourou commented Jan 6, 2024

radoering commented Jan 6, 2024

GMouzourou commented Jan 6, 2024 •

edited

Loading

radoering commented Jan 7, 2024

GMouzourou commented Jan 7, 2024

GMouzourou commented Jan 7, 2024

radoering left a comment

github-actions bot commented Mar 3, 2024

Added archive validation for all available algorithms in hashlib. #8851

Added archive validation for all available algorithms in hashlib. #8851

Conversation

GMouzourou commented Jan 5, 2024 • edited Loading

radoering commented Jan 5, 2024

dimbleby commented Jan 5, 2024

GMouzourou commented Jan 5, 2024 • edited Loading

radoering commented Jan 5, 2024

GMouzourou commented Jan 6, 2024

radoering commented Jan 6, 2024

GMouzourou commented Jan 6, 2024 • edited Loading

radoering commented Jan 7, 2024

GMouzourou commented Jan 7, 2024

GMouzourou commented Jan 7, 2024

radoering left a comment

Choose a reason for hiding this comment

github-actions bot commented Mar 3, 2024

GMouzourou commented Jan 5, 2024 •

edited

Loading

GMouzourou commented Jan 5, 2024 •

edited

Loading

GMouzourou commented Jan 6, 2024 •

edited

Loading