Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More sized-efficient pixi.lock (with care to be optimized for git packfile). #1509

Open
jleibs opened this issue Jun 14, 2024 · 6 comments
Open
Labels
✨ enhancement Feature request

Comments

@jleibs
Copy link

jleibs commented Jun 14, 2024

Problem description

The pixi.lock file in our repository is now over 1 megabyte (https://github.com/rerun-io/rerun/blob/main/pixi.lock).

It still compresses reasonably in git object storage within the packfile (taking up about 3MB of storage across history), but it is fast-becoming a meaningful contributor to repository growth.

This is a tricky one to do something about, as we ultimately care more about contribution to the delta-compressed packfile than we care about the actual size of the file-on-disk. Compression strategies that make the file smaller in a single checkout but harder to compress would still be a net-negative.

@jleibs jleibs added the ✨ enhancement Feature request label Jun 14, 2024
@baszalmstra
Copy link
Contributor

Do you have an idea on how we could change the format to achieve this?

I think one of the biggest issues is the presence of sha hashes because those compress terribly. We tried to minimize the places where these occur for that reason.

@jleibs
Copy link
Author

jleibs commented Jun 14, 2024

Not off the top of my head -- I definitely acknowledge it's a hard problem, so I was at least somewhat relieved to find that it still compressed reasonably well in the packfile.

Looking at the file itself it seems like there is still maybe a lot of meta information that is redundant with package management meta-data from conda/pypi as well.

From an information theory perspective does the lockfile need to have more than a table of: (feature, platform, package-name, version-number)

All the information about the kind of package, where to find it, it's own transitive dependencies, etc. seem like they could be re-computed from the pixi.toml file again.

Maybe there need to be two files here? One is a strict minimal .lock that can be included in the repository, while the other is a materialized file that can be cached somewhere.

@tdejager
Copy link
Contributor

There has been some talk in the python packaging space along these lines, where the lock-file would still need to be rendered into a suitable format. Could be food for thought :)

@glemaitre
Copy link
Contributor

We are facing a bit the same issue when we intend to use pixi to manage our CI/CD.

From an information theory perspective does the lockfile need to have more than a table of: (feature, platform, package-name, version-number)

In scikit-learn, when doing the flow for which we would like to use pixi, we are currently storing file from conda-lock files (eg https://github.com/scikit-learn/scikit-learn/blob/main/build_tools/azure/pylatest_conda_forge_mkl_osx-64_conda.lock) that have less meta-information.

In skrub, we went with pixi and automatic weekly update of some dependencies bring a significant amount of changes.

So the "environment" part of the pixi.lock + the checksum I think would be almost enough in this particular setting of CIs.

I clearly can understand that in some other cases, you might need more information to have a fully and secure reproducible environment.

Maybe there is a way to set the cursor for the underlying use-case.

seem like they could be re-computed from the pixi.toml file again

If this is something possible, this could be one of the trade-off where less information are stored at a small cost of recomputing some potential information when the CI is triggered.

@tdejager
Copy link
Contributor

Like discussed in the discord. I'm also curious what we can do if we deduplicate some of the common prefixes.

@tdejager
Copy link
Contributor

tdejager commented Jul 5, 2024

Not off the top of my head -- I definitely acknowledge it's a hard problem, so I was at least somewhat relieved to find that it still compressed reasonably well in the packfile.

BTW @jleibs what (commands) are you using to compare in the packfile?

Hofer-Julian added a commit to conda/rattler that referenced this issue Jul 5, 2024
By saving only one of those two hashes, we reduce the lock file size a
bit. This should be especially noticeable in git repos, since hashes
compress poorly.

This PR is a contribution to improving
prefix-dev/pixi#1509
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
✨ enhancement Feature request
Projects
None yet
Development

No branches or pull requests

4 participants