Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Only save md5 in lock file if no sha256 is present #764

Merged
merged 1 commit into from
Jul 5, 2024

Conversation

Hofer-Julian
Copy link
Collaborator

By saving only one of those two hashes, we reduce the lock file size a bit. This should be especially noticeable in git repos, since hashes compress poorly.

This PR is a contribution to improving prefix-dev/pixi#1509

@tdejager
Copy link
Collaborator

tdejager commented Jul 5, 2024

Could you give some statistics on file size or compressed file reductions for: https://github.com/rerun-io/rerun (e.g)?

@tdejager
Copy link
Collaborator

tdejager commented Jul 5, 2024

@Hofer-Julian Maybe something like this to test the compression?

import zlib

# Step 1: Read the file contents
input_file_path = 'input.txt'
with open(input_file_path, 'rb') as f:
    file_data = f.read()

# Step 2: Compress the contents
compressed_data = zlib.compress(file_data)

# Step 3: Save the compressed data
output_file_path = 'output.txt.zlib'
with open(output_file_path, 'wb') as f:
    f.write(compressed_data)

print(f"Compressed file saved as {output_file_path}")

By saving only one of those two hashes we reduce the lock file size a bit.
This should be especially noticeable in git repos since hashes compress poorly
@Hofer-Julian
Copy link
Collaborator Author

Could you give some statistics on file size or compressed file reductions for: https://github.com/rerun-io/rerun (e.g)?

Here you go:

nu  ls --du | sort-by size | reverse
╭───┬────────────────────────────┬──────┬───────────┬───────────────╮
 # │            name            │ type │   size    │   modified    │
├───┼────────────────────────────┼──────┼───────────┼───────────────┤
 0  conda-lock_with_md5         file    1.7 MiB  8 minutes ago 
 1  conda-lock_without_md5      file    1.6 MiB  8 minutes ago 
 2  conda-lock_with_md5.zip     file  241.0 KiB  2 minutes ago 
 3  conda-lock_without_md5.zip  file  199.0 KiB  2 minutes ago 
╰───┴────────────────────────────┴──────┴───────────┴───────────────╯

The fraction between with and without md5 not zipped is 0.94.
The fraction between with and without md5 zipped is 0.82.

That indicates that the impact is indeed higher for compressed situations.

@tdejager
Copy link
Collaborator

tdejager commented Jul 5, 2024

That's pretty significant for the compressed case :)

Copy link
Collaborator

@tdejager tdejager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@Hofer-Julian Hofer-Julian merged commit 4ee9217 into conda:main Jul 5, 2024
14 checks passed
@Hofer-Julian Hofer-Julian deleted the lock-file-reduction-hash branch July 5, 2024 14:31
@baszalmstra
Copy link
Collaborator

@tdejager @Hofer-Julian The reason we store all the metadata including the md5 and the sha256 is because a matchspec can depend on these values. E.g you can have torch[md5=123456789]. That is also the reason we store the license etc. I know this is an edge case but by removing this we can no longer statitically verify the lock-file correctly. We also break incremental updates.

@tdejager
Copy link
Collaborator

tdejager commented Jul 6, 2024

Do we have examples of this being used?

@baszalmstra
Copy link
Collaborator

Not from the top of my head no. But if we remove this field we can also remove a bunch of others: e.g. size, file_name, timestamp, etc.

But I would be opposed to that because it would potentially break a future incremental update of a lock-file. We have been advocating for storing the entire repodata record in the lock-file for exactly this reason.

@baszalmstra
Copy link
Collaborator

See also the design goals and text about conparison with conda-lock: https://docs.rs/rattler_lock/latest/rattler_lock/#relation-to-conda-lock

@tdejager
Copy link
Collaborator

tdejager commented Jul 7, 2024

Yes you are right, we might need to come up with alternative solutions though, we could revert this.

Because the large lock files are a problem :)

baszalmstra added a commit that referenced this pull request Jul 8, 2024
@baszalmstra baszalmstra mentioned this pull request Jul 8, 2024
baszalmstra added a commit that referenced this pull request Jul 8, 2024
## 🤖 New release
* `rattler`: 0.26.4 -> 0.26.5
* `rattler_cache`: 0.1.0 -> 0.1.1
* `rattler_conda_types`: 0.25.2 -> 0.26.0
* `rattler_macros`: 0.19.3 -> 0.19.4
* `rattler_package_streaming`: 0.21.3 -> 0.21.4
* `rattler_networking`: 0.20.8 -> 0.20.9
* `rattler_shell`: 0.20.9 -> 0.21.0
* `rattler_lock`: 0.22.12 -> 0.22.13
* `rattler_repodata_gateway`: 0.20.5 -> 0.21.0
* `rattler_solve`: 0.24.2 -> 0.25.0
* `rattler_libsolv_c`: 0.19.3 -> 0.19.4
* `rattler_virtual_packages`: 0.19.15 -> 0.19.16
* `rattler_index`: 0.19.17 -> 0.19.18

<details><summary><i><b>Changelog</b></i></summary><p>

## `rattler`
<blockquote>

##
[0.26.5](rattler-v0.26.4...rattler-v0.26.5)
- 2024-07-08

### Added
- add direct url repodata building
([#725](#725))

### Fixed
- errors should not contain trailing punctuation
([#763](#763))
- run clippy on all targets
([#762](#762))
</blockquote>

## `rattler_cache`
<blockquote>

##
[0.1.1](rattler_cache-v0.1.0...rattler_cache-v0.1.1)
- 2024-07-08

### Added
- add direct url repodata building
([#725](#725))

### Fixed
- run clippy on all targets
([#762](#762))
</blockquote>

## `rattler_conda_types`
<blockquote>

##
[0.26.0](rattler_conda_types-v0.25.2...rattler_conda_types-v0.26.0)
- 2024-07-08

### Added
- add support for zos-z
([#753](#753))
- return pybytes for sha256 and md5 everywhere and use md5 hash for
legacy bz2 md5 ([#752](#752))
- add direct url repodata building
([#725](#725))
- add shards_base_url and write shards atomically
([#747](#747))

### Fixed
- allow version following package in strict mode
([#770](#770))
- Fix doctests and start testing them again
([#767](#767))
- skip over implicit `0` components when copying
([#760](#760))
- allow empty json repodata
([#745](#745))
- lenient and strict parsing of equality signs
([#738](#738))
- This fixes parsing of `ray[default,data] >=2.9.0,<3.0.0`
([#732](#732))
</blockquote>

## `rattler_macros`
<blockquote>

##
[0.19.4](rattler_macros-v0.19.3...rattler_macros-v0.19.4)
- 2024-07-08

### Other
- update README.md
</blockquote>

## `rattler_package_streaming`
<blockquote>

##
[0.21.4](rattler_package_streaming-v0.21.3...rattler_package_streaming-v0.21.4)
- 2024-07-08

### Fixed
- run clippy on all targets
([#762](#762))
</blockquote>

## `rattler_networking`
<blockquote>

##
[0.20.9](rattler_networking-v0.20.8...rattler_networking-v0.20.9)
- 2024-07-08

### Fixed
- errors should not contain trailing punctuation
([#763](#763))
</blockquote>

## `rattler_shell`
<blockquote>

##
[0.21.0](rattler_shell-v0.20.9...rattler_shell-v0.21.0)
- 2024-07-08

### Added
- add direct url repodata building
([#725](#725))
- allow passing custom environment to run_activation
([#743](#743))
</blockquote>

## `rattler_lock`
<blockquote>

##
[0.22.13](rattler_lock-v0.22.12...rattler_lock-v0.22.13)
- 2024-07-08

### Added
- Only save md5 in lock file if no sha256 is present
([#764](#764))
- return pybytes for sha256 and md5 everywhere and use md5 hash for
legacy bz2 md5 ([#752](#752))
- add direct url repodata building
([#725](#725))

### Fixed
- lock file stability issues with PyPI types
([#761](#761))
- errors should not contain trailing punctuation
([#763](#763))

### Other
- revert only save md5 in lock file if no sha256 is present
([#766](#766))
</blockquote>

## `rattler_repodata_gateway`
<blockquote>

##
[0.21.0](rattler_repodata_gateway-v0.20.5...rattler_repodata_gateway-v0.21.0)
- 2024-07-08

### Added
- improve error message when parsing file name
([#757](#757))
- add direct url repodata building
([#725](#725))
- add shards_base_url and write shards atomically
([#747](#747))

### Fixed
- direct_url query for windows
([#768](#768))
- Fix GatewayQuery.query to filter records based on provided specs
([#756](#756))
- run clippy on all targets
([#762](#762))
- allow empty json repodata
([#745](#745))

### Other
- document gateway features
([#737](#737))
</blockquote>

## `rattler_solve`
<blockquote>

##
[0.25.0](rattler_solve-v0.24.2...rattler_solve-v0.25.0)
- 2024-07-08

### Added
- add direct url repodata building
([#725](#725))
- add tool to generate resolvo snapshots
([#741](#741))

### Fixed
- run clippy on all targets
([#762](#762))
- This fixes parsing of `ray[default,data] >=2.9.0,<3.0.0`
([#732](#732))

### Other
- bump resolvo to 0.6.0
([#733](#733))
- Document all features on docs.rs
([#734](#734))
</blockquote>

## `rattler_libsolv_c`
<blockquote>

##
[0.19.4](rattler_libsolv_c-v0.19.3...rattler_libsolv_c-v0.19.4)
- 2024-07-08

### Other
- update README.md
</blockquote>

## `rattler_virtual_packages`
<blockquote>

##
[0.19.16](rattler_virtual_packages-v0.19.15...rattler_virtual_packages-v0.19.16)
- 2024-07-08

### Added
- add support for zos-z
([#753](#753))

### Fixed
- run clippy on all targets
([#762](#762))
</blockquote>

## `rattler_index`
<blockquote>

##
[0.19.18](rattler_index-v0.19.17...rattler_index-v0.19.18)
- 2024-07-08

### Other
- updated the following local packages: rattler_conda_types,
rattler_package_streaming
</blockquote>


</p></details>

---
This PR was generated with
[release-plz](https://github.com/MarcoIeni/release-plz/).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants