-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Only save md5 in lock file if no sha256 is present #764
Conversation
Could you give some statistics on file size or compressed file reductions for: https://github.com/rerun-io/rerun (e.g)? |
@Hofer-Julian Maybe something like this to test the compression? import zlib
# Step 1: Read the file contents
input_file_path = 'input.txt'
with open(input_file_path, 'rb') as f:
file_data = f.read()
# Step 2: Compress the contents
compressed_data = zlib.compress(file_data)
# Step 3: Save the compressed data
output_file_path = 'output.txt.zlib'
with open(output_file_path, 'wb') as f:
f.write(compressed_data)
print(f"Compressed file saved as {output_file_path}") |
By saving only one of those two hashes we reduce the lock file size a bit. This should be especially noticeable in git repos since hashes compress poorly
ceea433
to
3c7fff7
Compare
Here you go: nu ❯ ls --du | sort-by size | reverse
╭───┬────────────────────────────┬──────┬───────────┬───────────────╮
│ # │ name │ type │ size │ modified │
├───┼────────────────────────────┼──────┼───────────┼───────────────┤
│ 0 │ conda-lock_with_md5 │ file │ 1.7 MiB │ 8 minutes ago │
│ 1 │ conda-lock_without_md5 │ file │ 1.6 MiB │ 8 minutes ago │
│ 2 │ conda-lock_with_md5.zip │ file │ 241.0 KiB │ 2 minutes ago │
│ 3 │ conda-lock_without_md5.zip │ file │ 199.0 KiB │ 2 minutes ago │
╰───┴────────────────────────────┴──────┴───────────┴───────────────╯ The fraction between with and without md5 not zipped is 0.94. That indicates that the impact is indeed higher for compressed situations. |
That's pretty significant for the compressed case :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@tdejager @Hofer-Julian The reason we store all the metadata including the md5 and the sha256 is because a matchspec can depend on these values. E.g you can have torch[md5=123456789]. That is also the reason we store the license etc. I know this is an edge case but by removing this we can no longer statitically verify the lock-file correctly. We also break incremental updates. |
Do we have examples of this being used? |
Not from the top of my head no. But if we remove this field we can also remove a bunch of others: e.g. size, file_name, timestamp, etc. But I would be opposed to that because it would potentially break a future incremental update of a lock-file. We have been advocating for storing the entire repodata record in the lock-file for exactly this reason. |
See also the design goals and text about conparison with conda-lock: https://docs.rs/rattler_lock/latest/rattler_lock/#relation-to-conda-lock |
Yes you are right, we might need to come up with alternative solutions though, we could revert this. Because the large lock files are a problem :) |
This reverts commit 4ee9217.
## 🤖 New release * `rattler`: 0.26.4 -> 0.26.5 * `rattler_cache`: 0.1.0 -> 0.1.1 * `rattler_conda_types`: 0.25.2 -> 0.26.0 * `rattler_macros`: 0.19.3 -> 0.19.4 * `rattler_package_streaming`: 0.21.3 -> 0.21.4 * `rattler_networking`: 0.20.8 -> 0.20.9 * `rattler_shell`: 0.20.9 -> 0.21.0 * `rattler_lock`: 0.22.12 -> 0.22.13 * `rattler_repodata_gateway`: 0.20.5 -> 0.21.0 * `rattler_solve`: 0.24.2 -> 0.25.0 * `rattler_libsolv_c`: 0.19.3 -> 0.19.4 * `rattler_virtual_packages`: 0.19.15 -> 0.19.16 * `rattler_index`: 0.19.17 -> 0.19.18 <details><summary><i><b>Changelog</b></i></summary><p> ## `rattler` <blockquote> ## [0.26.5](rattler-v0.26.4...rattler-v0.26.5) - 2024-07-08 ### Added - add direct url repodata building ([#725](#725)) ### Fixed - errors should not contain trailing punctuation ([#763](#763)) - run clippy on all targets ([#762](#762)) </blockquote> ## `rattler_cache` <blockquote> ## [0.1.1](rattler_cache-v0.1.0...rattler_cache-v0.1.1) - 2024-07-08 ### Added - add direct url repodata building ([#725](#725)) ### Fixed - run clippy on all targets ([#762](#762)) </blockquote> ## `rattler_conda_types` <blockquote> ## [0.26.0](rattler_conda_types-v0.25.2...rattler_conda_types-v0.26.0) - 2024-07-08 ### Added - add support for zos-z ([#753](#753)) - return pybytes for sha256 and md5 everywhere and use md5 hash for legacy bz2 md5 ([#752](#752)) - add direct url repodata building ([#725](#725)) - add shards_base_url and write shards atomically ([#747](#747)) ### Fixed - allow version following package in strict mode ([#770](#770)) - Fix doctests and start testing them again ([#767](#767)) - skip over implicit `0` components when copying ([#760](#760)) - allow empty json repodata ([#745](#745)) - lenient and strict parsing of equality signs ([#738](#738)) - This fixes parsing of `ray[default,data] >=2.9.0,<3.0.0` ([#732](#732)) </blockquote> ## `rattler_macros` <blockquote> ## [0.19.4](rattler_macros-v0.19.3...rattler_macros-v0.19.4) - 2024-07-08 ### Other - update README.md </blockquote> ## `rattler_package_streaming` <blockquote> ## [0.21.4](rattler_package_streaming-v0.21.3...rattler_package_streaming-v0.21.4) - 2024-07-08 ### Fixed - run clippy on all targets ([#762](#762)) </blockquote> ## `rattler_networking` <blockquote> ## [0.20.9](rattler_networking-v0.20.8...rattler_networking-v0.20.9) - 2024-07-08 ### Fixed - errors should not contain trailing punctuation ([#763](#763)) </blockquote> ## `rattler_shell` <blockquote> ## [0.21.0](rattler_shell-v0.20.9...rattler_shell-v0.21.0) - 2024-07-08 ### Added - add direct url repodata building ([#725](#725)) - allow passing custom environment to run_activation ([#743](#743)) </blockquote> ## `rattler_lock` <blockquote> ## [0.22.13](rattler_lock-v0.22.12...rattler_lock-v0.22.13) - 2024-07-08 ### Added - Only save md5 in lock file if no sha256 is present ([#764](#764)) - return pybytes for sha256 and md5 everywhere and use md5 hash for legacy bz2 md5 ([#752](#752)) - add direct url repodata building ([#725](#725)) ### Fixed - lock file stability issues with PyPI types ([#761](#761)) - errors should not contain trailing punctuation ([#763](#763)) ### Other - revert only save md5 in lock file if no sha256 is present ([#766](#766)) </blockquote> ## `rattler_repodata_gateway` <blockquote> ## [0.21.0](rattler_repodata_gateway-v0.20.5...rattler_repodata_gateway-v0.21.0) - 2024-07-08 ### Added - improve error message when parsing file name ([#757](#757)) - add direct url repodata building ([#725](#725)) - add shards_base_url and write shards atomically ([#747](#747)) ### Fixed - direct_url query for windows ([#768](#768)) - Fix GatewayQuery.query to filter records based on provided specs ([#756](#756)) - run clippy on all targets ([#762](#762)) - allow empty json repodata ([#745](#745)) ### Other - document gateway features ([#737](#737)) </blockquote> ## `rattler_solve` <blockquote> ## [0.25.0](rattler_solve-v0.24.2...rattler_solve-v0.25.0) - 2024-07-08 ### Added - add direct url repodata building ([#725](#725)) - add tool to generate resolvo snapshots ([#741](#741)) ### Fixed - run clippy on all targets ([#762](#762)) - This fixes parsing of `ray[default,data] >=2.9.0,<3.0.0` ([#732](#732)) ### Other - bump resolvo to 0.6.0 ([#733](#733)) - Document all features on docs.rs ([#734](#734)) </blockquote> ## `rattler_libsolv_c` <blockquote> ## [0.19.4](rattler_libsolv_c-v0.19.3...rattler_libsolv_c-v0.19.4) - 2024-07-08 ### Other - update README.md </blockquote> ## `rattler_virtual_packages` <blockquote> ## [0.19.16](rattler_virtual_packages-v0.19.15...rattler_virtual_packages-v0.19.16) - 2024-07-08 ### Added - add support for zos-z ([#753](#753)) ### Fixed - run clippy on all targets ([#762](#762)) </blockquote> ## `rattler_index` <blockquote> ## [0.19.18](rattler_index-v0.19.17...rattler_index-v0.19.18) - 2024-07-08 ### Other - updated the following local packages: rattler_conda_types, rattler_package_streaming </blockquote> </p></details> --- This PR was generated with [release-plz](https://github.com/MarcoIeni/release-plz/).
By saving only one of those two hashes, we reduce the lock file size a bit. This should be especially noticeable in git repos, since hashes compress poorly.
This PR is a contribution to improving prefix-dev/pixi#1509