-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use GitHubs commit API to check out the tree directly #11176
Conversation
r? @epage (rust-highfive has picked a reviewer for you, use r? to override) |
I find this method downright genius as it uses the GitHub API in conjunction with the standard git protocol features (namely, fetching objects specified by hash) to emulate a shallow clone which is even more space efficient than an actual shallow clone, as it skip on commits entirely. Let's call it 'super-shallow' clone. Thus, even once shallow cloning is available, assuming there is no blockers on GitHubs side, using this fast-path would be beneficial either way. From a performance perspective on the server side, judging from my own limited experience, I'd be surprised if GitHub would have an issue with it. It will be good to hear that from them though. Something I was always interested in was to keep the local crates-index clone as standard as possible to allow users to interact with it like they normally would. With this PR we get a step further away from that as I hope to get to integrating cloning and fetching the crates index using |
One thing I forgot to do in this PR and forgot to mention, we can easily change the hash used by Cargo in the file path where the git db is stored. This can prevent old and new Cargos from interacting with the same git dbs. Hopefully then the only time we transition back to a normal clone is on a failed api call. And if updates are the expensive part then we can make it opt in, so people can use it where updates are not likely (on CI). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, looks good. As I wasn't as familiar with this, the area I was most concerned about was that invalidation still works correctly and it seems so.
src/cargo/sources/git/utils.rs
Outdated
let id = if tree { | ||
shalow_data.tree | ||
} else { | ||
shalow_data.etag | ||
}; | ||
Some(id.parse::<Oid>().ok()?) | ||
}) | ||
{ | ||
return Ok(reference); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned over the brittleness of this API.
If tree
is true, it will resolve to a tree if and only if we are in the fast path. All (transitive) callers need to know this is just a generic object id and not specifically a commit id (which is an easy assumption to make).
In my ideal world
- We would have separate
resolve_to_object
andresolve_to_commit
functions.object
is more accurate thantree
as we can't assume its a tree- I feel like we should make this more obvious in the caller than just using a
true
orfalse
- separate functions as the difference in which to use is a design-time decision and not a runtime decision
resolve_to_object
would not do anypeel
s, making it more obvious when someone misuses the API
My main concern with this ideal is how much code would be duplicated. I guess its not so bad if
- We have a
_resolve_to_object
function that is just the old code path without peels. ``resolve_to_commit` would call it and do a peel. - We have a
_find_shallow_blob
function that just gets the blob and each of the top-level functions extract the relevant field from that.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I completely agree with your point that this is a compile time distinction. I will rework the interface to match that (when I next have time to code). I think the two relevant functions are resolve_to_tree
(which always points to a tree that git has on disk) and resolve_to_commit
(which will always point to a commit, but may not be something git has on disk).
Ready for the next review |
#[test] | ||
fn check_with_git_hub() { | ||
panic!( | ||
r#"nonstandard shallow clone may be worse than a full check out. | ||
This test is here to make sure we do not merge until we have official signoff from GitHub"# | ||
) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raising visibility of this
👋 Hello from GitHub! Using the commits API certainly works. However, if you are requesting a lot of files then you might run into our API rate limits. If you are only interested in the content of the HEAD commit of a particular reference, then the most efficient way to get this data is our archive API. Here is an example which downloads a tarball of the HEAD of the $ curl -L https://api.github.com/repos/rust-lang/crates.io-index/tarball/master | tar zxf - |
Switching to using tarballs is a larger structural change. One I will definitely continue to look into. For some of Cargos needs we do in fact want to extract the output, where it might be a very good fit. However, The Index is the place where Cargo puts the most load on GitHubs infrastructure. We do not extract the Index on disk, instead only extracting and parsing the files we happen to need. I don't think tarballs are compatible with random-access. We could get a zipball, but this is also significantly larger. The result would be a whole new kind of Index (Git/Sparse and now ZIP), with significant testing and implementation overhead, which would probably not merge before we stabilize Sparse Indexs. |
I realize the problem with this plan, |
I built a separate project that creates a new git repository and then fetches into it twice. The goal was to see the first one download a pack file of 64 MB, and the second one only downloading the changes. I used head: Things I tried:
Was helpful in figuring out what's going on. Unless we can figure out how to get deltas between two trees, I think this PR is probably dead in the water. |
Does it have to be between two trees? I see two distinct stages
For This narrows the problem down to: How can we make a commit object (that we know, but don't have) so that a ref points to it and it is picked up as A look at the code reveals that the way
The idea is to manufacture a commit that:
Thus we create a commit graph made up of fake commits that is 'one-commit-per-fetch'. That way the server will get enough information, even without mentioning the intermediate commits that we don't know about. I know this is fine as This sounds like a lot of workaround'ing to have another stab at getting this to work, but I think it could indeed work. The question is if it's worth the effort (and maintenance) in the light of Edit: It's worth noting that these 'fake' commits would be loose objects which have been renamed to match the required hash. During normal operation, |
Unfortunately there are two problems with that suggestion. One problem is that libgit2 does appear to check. In my experiments if the right hand side of a refspec point at a loose object that has been renamed to have the hash of head, then libgit2 gives "object hash mismatch" error. The second issue is more subtle, what happens if the new tree contains a file whose contents existed in an old commit, but not in the tree associated with the old head. What would happen is:
|
So this is how it feels to be sad and happy at the same time 😅. |
Seems like the low-level control needed to make this work will have to wait on gitoxide. Which will probably also support more normal/standard operations, making this scheme unnecessary. Also hopefully sparse indexes will be stable by then, which also makes such extraordinary measures unjustified. |
Cargo often needs to clone or update repos that have a lot of history, when all it needs is the contents of one particular reference. The most dramatic example is the index, which we clone or update regularly. sparse-registry will move us from using git for this particular case entirely, but in the meantime it would be nice to reduce the transfer costs.
GitHub Blog "Get up to speed with partial clone and shallow clone" describe how to use "shallow clone" (
--depth=1
) or "treeless clone" (--filter=
) to make this less expensive. Unfortunately,libgit2
does not support either of these features. When GitOxide is ready, we should talk to GitHub about if these features make sense in our use case.It turns out, without using any fancy features, you can fetch an arbitrary git object ID and everything it links to. If we knew the ID of the Tree of the commit of the reference we were trying to check out, we could fetch everything we needed without any of the history! GitHub provides an API endpoint for retrieving information about a commit. One of the included pieces of information is the ID of the Tree. Even better, we already hit that API as a fast path for discovering that we are up-to-date. So instead of that fast path returning the
sha
we return thecommit.tree.sha
, and we no longer clone the history!In fact there are more minor implementation details:
sha
andcommit.tree.sha
somewhere so that we can use it as an etag, and so that we know what to check out, as git no longer has Commits to refer to. I have chosen to make a blob and a ref point to that blob.There is one failing test, specifically added because this cannot be merged until we talk to GitHub about:
Fundamentally this change reduces disk footprint and network bandwidth by an enormous amount. Different repos will have different ratios. But the index is quite significant, in the 80% range, depending on how recently it was squashed.