-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Order of hashes no longer canonical #105
Comments
The original issue was: #93 The reasoning was that when the code that determines whether to edit the list of hashes kick in, it should be agnostic of order. Because, the order of them does not matter to Also, I don't understand why you point out this commit. There's no change of order of hashes there. |
By the way, I couldn't help but to notice... for requirement in $requirements; do
echo -e "\r Hashing $requirement..."
$other_python -m hashin -r $contrib/deterministic-build/requirements${i}.txt ${requirement}
done Can, in theory, be replaced with: echo -e "\r Hashing $requirements..."
$other_python -m hashin -r $contrib/deterministic-build/requirements${i}.txt ${requirements} Referring to this. |
Yes, there is no change there. I just linked to that as an example of rerunning the script. I have just locally rerun the script again, and noticed this ordering issue now.
If pip is agnostic to it, it might as well be canonical. :) Thanks for the code cleanup suggestion. |
Actually, I think the order is deterministic and it's based on PyPI. What I mean is that it downloads, for example, https://pypi.org/pypi/hashin/json and based on the E.g. when you run
the order is
and that is always the case. So, what you get in your requirements file is something like this: --- Old
+++ New
@@ -0,0 +1,3 @@
+hashin==0.14.1 \
+ --hash=sha256:5dbf06b2adc32e5e438b9b9b352622b787c8555d4a2aa88b38d64764526f81a1 \
+ --hash=sha256:43ae3277fbd937ba1c99bc3fef9e4dd2e68cd5f2eea8f60ebdb53d83983d861b
What #93 was about was being smarter about "merging" the hashes. For example, suppose I take that diff above, but manually, in my editor, rearrange the order so it looks like this:
(note that the only change is the order of the hashes). Now, if I were to run
You get an empty output because the hashes have NOT changed according to Basically, as long as you don't manually "mess" with the order of the hashes, it is deterministic. Always. And even if you did mess with the order of the hashes So, no the order is not canonical but it's certainly deterministic as long as PyPI.org is. |
I think that Hashin should take a stronger approach here than "whatever PyPI.org does". Consistency is useful, both for deterministic builds and for smaller, easier to read diffs. I think that sorting the hashes in the file anytime Hashin writes them goes a long way towards that end. Deterministic build are important because of projects like Debian's ReproducibleBuilds, which strive to increase trust and verifiability of packages. These are some of the same goals that hashing packages in Python is intended to solve. Another tool that is similar to Hashin also uses sorting to help consistency: Yarn. It sorts entries both in |
@di Do you have an opinion on this one? Should the requirements file that I would argue that it's a no-problem because the only time it comes up is if you compare two different requirements files made in two different places when PyPI might have returned files in a different order. E.g. # On machine X
$ hashin Django -r reqs-x.txt # On machine Y
$ hashin Django -r reqs-y.txt
$ rsync reqs-y.txt machineX: # On machine X
# This *will* diff if the order of files
# on https://pypi.org/pypi/Django/json changed over time.
$ diff reqs-x.txt reqs-y.txt Contrast with doing it all and always on the same machine: # Monday
$ hashin Django==2.1.5 -r /path/to/reqs.txt # Tuesday
# This will ONLY diff if the SET of hashes has changed between Monday and Tuesday
$ hashin Django==2.1.5 -r /path/to/reqs.txt --dry-run Essentially, what if set(hashes_from_parsing_existing_file) != set(hashes_from_pypi_download):
amend_requiments_file(hashes_from_pypi_download) |
In Electrum's case, the file is not always generated by the same developer. I think it's unrealistic to assume in general that frozen dependency files are always updated from the same machine. |
@peterbe Thanks for looping me in. I would expect the hashes to be sorted lexicographically. The ordering from PyPI doesn't have much meaning to users: it is ordered by the filename of the distribution artifact. Since |
I get the feeling that all y'all people are saying that it would be good to NOT rely on the sorting of PyPI and instead always write down the hashes lexicographically sorted. Yes, it has the advantage that the files created a more predictable when compared as a whole. I don't think the existing logic that determines whether it should amend the hashes should change but when it does it should sort the hashes before writing them. |
Yes I think that's a good solution. |
Anybody willing to take a look at #110 ? The unit tests plus my extensive manual testing (in the PR description) makes me pretty confident already but, you know, more eyes etc. |
Fixes #105 CC @SomberNight ### Here's how I tested it: First observe that for `requests` the sha256 digests appear in this order: 1. `7bf2a778576d825600030a110f3c0e3e8edc51dfaafe1c146e39a2027784957b` 2. `502a824f31acdacb3a35b6690b5fbf0bc41d63a24a45c4004352b0242707598e` <details> ▶ curl https://pypi.org/pypi/requests/json | jq '.releases["2.21.0"]' % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 107k 100 107k 0 0 483k 0 --:--:-- --:--:-- --:--:-- 485k [ { "comment_text": "", "digests": { "md5": "ed3af234ffcad0b3c1e521e1dfde19be", "sha256": "7bf2a778576d825600030a110f3c0e3e8edc51dfaafe1c146e39a2027784957b" }, "downloads": -1, "filename": "requests-2.21.0-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "ed3af234ffcad0b3c1e521e1dfde19be", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*", "size": 57987, "upload_time": "2018-12-10T15:40:08", "url": "https://files.pythonhosted.org/packages/7d/e3/20f3d364d6c8e5d2353c72a67778eb189176f08e873c9900e10c0287b84b/requests-2.21.0-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "1bcd0e0977c3f8db1848ba0e2b7ab904", "sha256": "502a824f31acdacb3a35b6690b5fbf0bc41d63a24a45c4004352b0242707598e" }, "downloads": -1, "filename": "requests-2.21.0.tar.gz", "has_sig": false, "md5_digest": "1bcd0e0977c3f8db1848ba0e2b7ab904", "packagetype": "sdist", "python_version": "source", "requires_python": ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*", "size": 111528, "upload_time": "2018-12-10T15:40:11", "url": "https://files.pythonhosted.org/packages/52/2c/514e4ac25da2b08ca5a464c50463682126385c4272c18193876e91f4bc38/requests-2.21.0.tar.gz" } ] </details> Run it the first time: ``` ▶ python hashin.py -r /tmp/reqs.txt requests ▶ cat /tmp/reqs.txt requests==2.21.0 \ --hash=sha256:502a824f31acdacb3a35b6690b5fbf0bc41d63a24a45c4004352b0242707598e \ --hash=sha256:7bf2a778576d825600030a110f3c0e3e8edc51dfaafe1c146e39a2027784957b ``` Note that `502a824...` comes before `7bf2a778...` which is different from the JSON from Pypi.org. Next, mess with the list of hashes manually: ``` ▶ jed /tmp/reqs.txt ▶ cat /tmp/reqs.txt requests==2.21.0 \ --hash=sha256:7bf2a778576d825600030a110f3c0e3e8edc51dfaafe1c146e39a2027784957b \ --hash=sha256:502a824f31acdacb3a35b6690b5fbf0bc41d63a24a45c4004352b0242707598e ▶ python hashin.py -r /tmp/reqs.txt requests --dry-run ``` In other words, it didn't touch the file. It's because when it compares the previous hashes with the found hashes from pypi, it does it by comparing to `set`s. Now really mess with it: ``` ▶ cat /tmp/reqs.txt requests==2.21.0 \ --hash=sha256:7bf2a778576d825600030a110f3c0e3e8edc51dfaafe1c146e39a2027784957b \ --hash=sha256:502a824f31acdacb3a35b6690b5fbf0bc41d63a24a45c4004352b0242707598e ▶ jed /tmp/reqs.txt ▶ cat /tmp/reqs.txt requests==2.21.0 \ --hash=sha256:7bf2a778576d825600030a110f3c0e3e8edc51dfaafe1c146e39a2027784957b ▶ python hashin.py -r /tmp/reqs.txt requests ▶ cat /tmp/reqs.txt requests==2.21.0 \ --hash=sha256:502a824f31acdacb3a35b6690b5fbf0bc41d63a24a45c4004352b0242707598e \ --hash=sha256:7bf2a778576d825600030a110f3c0e3e8edc51dfaafe1c146e39a2027784957b ``` This time, the sets aren't equal so it goes to write down the hashes and when it does it does so in a sorted order.
This is now out in version 0.14.5. |
Since version 0.14, the order of hashes does not seem to be deterministic.
Before 0.14, the hashes used to be lexicographically ordered.
I am guessing this was changed in #95
What is the reason for this?
We use hashin in electrum for deterministic builds.
The requirements files are generated by this script.
The
freeze_packages.sh
script is manually rerun to update dependencies (see e.g. spesmilo/electrum@ba4af29).If the order of hashes for a given package is not canonical, then rerunning the script can result in changes for packages that have not actually been updated, making the diff considerably less readable.
The text was updated successfully, but these errors were encountered: