Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[integration] Shallow clones for cargo #449

Open
4 of 9 tasks
Tracked by #303
Byron opened this issue Jul 1, 2022 · 4 comments
Open
4 of 9 tasks
Tracked by #303

[integration] Shallow clones for cargo #449

Byron opened this issue Jul 1, 2022 · 4 comments
Labels
C-integrate-gitoxide "Oxidize" crates even more by replacing git2 with gitoxide

Comments

@Byron
Copy link
Member

Byron commented Jul 1, 2022

This issue collects thoughts and facts about the state of shallow clones for git repositories when used by cargo.

Here is a list of steps to take in cargo to support step-wise integration of gitoxide.

Terminology

Let's be sure we are on the same page, so I repeat here this comment by @Eh2406 to set a baseline.

  • The "crates.io index": The index that backs crates.io. It lives at https://github.com/rust-lang/crates.io-index .
  • An "alternative git index": A git repo with the same structure as the "crates.io index", but for a different set of crates.
  • A "git dependency": A git repo cargo clones because of a git = "<url>" dependency in a cargo.toml.

Another source of miscommunication is that there are two interconnected potential changes.

  • Switching from libgit2 -> gitoxide
  • Adding new functionality that is only available in gitoxide (Specifically "shallow clones")

of course one depends on the other.

Tracking issues

Cloning crates.io + crates (non-shallow)

It would be most straightforward to implement git::fetch(…) using gitoxide. This includes all transports and all credentials options that git2 supports for maximum usability.

Note that checkouts would still be performed by git2.

Requirements

All requirements are to be validated with the cargo-team, and a checkmark means its indeed a requirement.

Cloning crates.io + crates (shallow)

Add a parameter to support shallow fetches that maintain shallow-ness.

Issues

Don't forget about the general considerations of shallow clones for database-like repositories by ehuss in a comment, which might make this option unusable. It's something to validate first. If it truly is an issue, shallow can be turned off for crates.io but can be used for crates clones.

Assumptions

These should be validated to see if they may indeed be considered issues or risks one day in case they are proven true.

  • Shallow clones are not inherently slower to serve anymore and are thus desirable without increasing the risk of being throttled by GitHub. Work has been going on 6 years ago and is likely mature by now.
  • ✅Shallow clones are actually saving time and bandwidth - indeed, a maximally shallow clone is ~5.3x faster and uses 1/4th of the disk space. See details for the source data.

Questions

  • How to 'unshallow' a crates index? Some might want it for research. In any case, there should be a known path for this, so probably there must be an option for this in the cargo config no matter what will be the default.
    • No need, it's a special case and those who need it can always recreate the index from scratch. The index is an implementation detail.

Requirements

  • a cargo-config setting to control turn on or off shallow clones - maybe it's enabled by default for crates and maybe disable it for the crates index.
  • documentation on ways to change shallow-ness of crates.io clones (purposefully omitting such documentation for crates clones merely because I consider them private to cargo)
  • validate that older cargo versions can still work with such an index. It's likely they can as git2 can open them (and we only access a single tree which has complete objects)

Notes by @Eh2406

  • "shallow clones" of the "crates.io index" we could experiment with. But stabilizing requires careful communication with GItHub to make sure we don't abuse their generosity. With sparse indexes coming along, I don't know that the coordination is worth setting up.
  • "shallow clones" of the "alternative git index"s we could experiment with. However, it's not very motivating as I suspect a lot of alternative indexes will switch to sparse indexes.

Interesting reading

Checkout worktrees (without submodules)

This effectively is an implementation of git reset --hard as used in GitCheckout::reset(…).

Questions

  • Does cargo manipulate existing checkouts to match different versions as needed, or is each version of a clone in its own worktree, along with a git repository copy? It's probably the latter, but let's validate that. - YES, with hard-links if available.
    • Cargo splits checkouts (git/checkouts) and their source, and does a full clone from these to the sources (git/db, bare repos). Worktrees should help here, saving quite a bit of space.
  • Does cargo update these db clones or always create a new one? It's the question on how to update worktrees with submodules properly after changes where pulled. I have a feeling the current setup works around this.

Notes by @Eh2406

  • "shallow clones" of "git dependency"s is definitely worth striving for. I don't think it needs a opt in, unless there are practical use cases where people might need the full history.

Checkout submodules

Update submodules as in GitCheckout::update_submodules(…).

Out of scope

Reducing the local size of the .cargo directory seems very doable even without great effort, but we chose to tackle these separately.

  • optimize crates clones by using worktree checkouts instead of local file://… clones.
  • use a bare clones of the crates.io index and extract files content directly from git.
    • cargo is doing that already

bare shallow clones vs non-shallow ones

❯ git clone --bare https://github.com/rust-lang/crates.io-index index-full-history.git
Cloning into bare repository 'index-full-history.git'...
remote: Total 457133 (delta 151), reused 69 (delta 0), pack-reused 456913
Receiving objects: 100% (457133/457133), 209.38 MiB | 1.21 MiB/s, done.
Resolving deltas: 100% (319566/319566), done.

~/.cargo/registry took 2m59s
❯ git clone --depth 1 --bare https://github.com/rust-lang/crates.io-index index-shallow-depth-1.git
Cloning into bare repository 'index-shallow-depth-1.git'...
remote: Total 108481 (delta 57698), reused 92572 (delta 47615), pack-reused 0
Receiving objects: 100% (108481/108481), 53.77 MiB | 2.05 MiB/s, done.
Resolving deltas: 100% (57698/57698), done.

~/.cargo/registry took 34s

worktree checkout sizes (compressed, uncompressed)

.cargo/registry/index-shallow-depth-1.git ( master)
❯ l
.rw-r--r-- 703Mi byron staff  1 Jul 11:40 archive.tar
.rw-r--r--  44Mi byron staff  1 Jul 11:40 archive.tar.gz
@Byron Byron added the C-integrate-gitoxide "Oxidize" crates even more by replacing git2 with gitoxide label Jul 1, 2022
@Byron Byron mentioned this issue Jul 1, 2022
48 tasks
@wezm
Copy link

wezm commented Jul 6, 2022

Just as another reference for projects moving away from shallow clones at GitHub's request, there is also Homebrew in 2020:

https://github.com/Homebrew/brew/blob/17a7e71d909de4d09bc2cb479b1ccf975648fbd2/Library/Homebrew/cmd/update.sh#L448-L454

Homebrew/brew#8883

@Byron
Copy link
Member Author

Byron commented Jul 6, 2022

Thanks for posting!

What I understand from this is that:

  • unshallow operations are expensive and should be avoided (i.e. don't shallow clone just to unshallow afterwards)
    • this should imply that shallow fetches are fine as they are the same amount of work as normal fetches
  • brew still supports shallow clones on CI and does so by default

This should mean that shallow clones for cargo are pretty much the way to go on CI and probably locally as well as there is no planned unshallow operation at all (even though users might chose to do that if they want to access the history of the crates index for research for instance, after all cargo doesn't care about the history).

Please let me know if I am missing something.

@Byron Byron changed the title [research] Shallow clones for cargo [integration] Shallow clones for cargo Aug 4, 2022
@Eh2406
Copy link

Eh2406 commented Aug 9, 2022

Thank you for your work organizing this complicated topic. I am going to put 2 nits here, and the rest in the zulip thread.
Just my comments, not speaking officially for the team.

How to 'unshallow' a crates index? Some might want it for research. In any case, there should be a known path for this, so probably there must be an option for this in the cargo config no matter what will be the default.

I don't think there needs to be an option for unshallowing. The clone that Cargo makes is an implementation detail, that entirely belongs to Cargo. If someone wants a copy of the data with different configuration, they can clone the index themselves. That being said, we do have to be careful about how different versions of Cargo interact. If one version of cargo does a shallow clone, and then an older version of cargo is used to do an index update, things have to at least work.

use a bare clones of the crates.io index and extract files content directly from git

I believe this is done already.

@Byron
Copy link
Member Author

Byron commented Aug 10, 2022

Thank you for your work organizing this complicated topic. I am going to put 2 nits here, and the rest in the zulip thread. Just my comments, not speaking officially for the team.

How to 'unshallow' a crates index? Some might want it for research. In any case, there should be a known path for this, so probably there must be an option for this in the cargo config no matter what will be the default.

I don't think there needs to be an option for unshallowing. The clone that Cargo makes is an implementation detail, that entirely belongs to Cargo. If someone wants a copy of the data with different configuration, they can clone the index themselves. That being said, we do have to be careful about how different versions of Cargo interact. If one version of cargo does a shallow clone, and then an older version of cargo is used to do an index update, things have to at least work.

Thank you, I took note of this and specifically mentioned the need for backwards compatibility and to validate it. Since older versions of cargo would use git2, or recent versions of cargo will use git2 if gitoxide isn't turned on, I wouldn't be surprised if fetches would fail - I am looking forward to trying it out.

use a bare clones of the crates.io index and extract files content directly from git

I believe this is done already.

A good catch! I updated that passage to reflect the status quo.

I also started off this issues with the terminology you provided over on zulip, thanks for taking the time to help people discuss this complex topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-integrate-gitoxide "Oxidize" crates even more by replacing git2 with gitoxide
Projects
None yet
Development

No branches or pull requests

3 participants