Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Serve crates-io registry over HTTP as static files #2789

Merged
merged 23 commits into from
Jan 12, 2021

Conversation

kornelski
Copy link
Contributor

@kornelski kornelski commented Oct 18, 2019

The existing crate index format is good enough to be served over HTTP as-is, and still fetched and updated incrementally.

This can be faster than a git clone of the whole index ahead of time, and will spare clients from unbounded growth of the index.

Originating forum thread.

Rendered RFC

@djc
Copy link
Contributor

djc commented Oct 18, 2019

It would be good to touch on offline use scenarios and how these (as well as slow links) would be affected by these changes.

text/0000-http-index.md Outdated Show resolved Hide resolved
@kornelski
Copy link
Contributor Author

kornelski commented Oct 19, 2019

I've added more info about network speed. To sum it up:

  • It's a significant saving of bandwidth compared to a full clone, even if it ends up downloading all crates anyway. For CI that runs once and doesn't keep the index, it'll be a huge reduction in bandwidth.

  • For a first run with no cache and no lockfile, the worst case is as slow as the max depth of the dependency tree (but in practice less than that thanks to crates appearing in multiple places in the tree). In my projects I've seen deps 12 levels deep, so on a high-latency connection it may take several seconds, but that still is likely way faster than a full index clone.

  • For subsequent runs, with some cache or a lockfile, the update can be speculatively parallelized to reduce latency to one or two roundtrips, because Cargo will already know about all or almost all dependencies-of-dependencies that it needs to check.

  • In terms of number of update requests, the worst case scenario without incremental changelog is a few hundred requests. It may seem like a lot, but thanks to web bloat, web servers regularly handle this much for typical page load without breaking a sweat :) These requests can be made conditional (using HTTP status 304), they should use very little bandwidth for updates. The incremental changelog can efficiently mark crates as fresh and not needed updates, so that in the best case an update will be just one HTTP request for the log itself.

  • Cargo has a fast path for GitHub that checks if the index is fresh in one HTTP request. This can be supported too for the HTTP index, so the fast path is just as fast.

Note that cost of the git index is proportional to total size of crates-io, but cost of this solution is proportional to the project size. At some point git clone will become too large to be usable, but this solution will remain functional (even large Rust projects don't use all crates-io dependencies at the same time, so they won't grow as big as fast).

@Nemo157
Copy link
Member

Nemo157 commented Oct 19, 2019

One third-party tool that could not support a feature with this design is cargo add's - vs _ automatic distinction:

> cargo add serde-json
WARN: Added `serde_json` instead of `serde-json`

well, it still could be implemented by fetching every permutation, that would be O(n^2), but since n is going to generally be 1-2 that could be ok.

@matklad
Copy link
Member

matklad commented Oct 21, 2019

I think it might make sense to spell out how Cargo would use a new index in some more details. Currently, it works as follows.

For every command that can change lockfile (most commonly, first cargo build after Cargo.toml change or cargo update) the first thing that Cargo does is updating the registry (that is, pulling new changes from the git repo). After that, Cargo has a fresh offline copy of the registry in a bare git checkout, and use it for actual dependency resolution.

The two interesting properties of status quo are:

  • if you change lockfile, you always use the most recent version of the index, so that security patches and other minor updates propagate as fast as possible
  • index changes are globally atomic: Cargo works with a set of packages, identified by a commit hash

What would we do in the new implementation? I think there's no explicit registry update anymore. Instead, we update registry on a per-crate basis lazily, when we do version resolution?

@kornelski
Copy link
Contributor Author

kornelski commented Oct 21, 2019

@matklad This is covered in "greedy fetch" section. It works as follows:

  1. Cargo still "updates the registry" as a separate step, but instead of fetching everything, it fetches a subset of the registry that is relevant for a specific workspace. At this stage it's not an actual precise dependency resolution, but a "greedy" dependency resolution in order to discover and update all dependencies that may be needed. The greedy fetch guarantees that all of the crates that the actual dependency resolution algorithm might look at will be up to date.

  2. After the incremental update, Cargo has a fresh offline copy of the registry, except crates that were not relevant for the update. It can run the regular dependency resolution as usual, and it won't know the difference.

The registry storage format would change from a bare git repo to a directory structure similar to a git checkout. Theoretically it'd be possible to mix the two and add "fake" commits to a local git repo to keep using it as the storage format, but I don't think that's worth the complexity.

@carols10cents carols10cents added T-cargo Relevant to the Cargo team, which will review and decide on the RFC. T-crates-io Relevant to the crates.io team, which will review and decide on the RFC. labels Oct 21, 2019
@the8472
Copy link
Member

the8472 commented Oct 23, 2019

Possible alternative that wouldn't require http2 support: Download index as zip, which could utilize github's download as zip functionality.

@kornelski
Copy link
Contributor Author

kornelski commented Oct 24, 2019

The zip would be good only for initial download (you wouldn't want to download 20MB+ on every cargo update), so it sounds like a variation of "Initial index from rustup" alternative in the RFC.

@the8472
Copy link
Member

the8472 commented Oct 24, 2019

Zips don't have to be compressed (store "compression") and if github supports range requests one could only fetch the headers and the parts that changed, i.e. do an incremental download.

@kornelski
Copy link
Contributor Author

kornelski commented Oct 24, 2019

Oh, that's clever! So yes, a ZIP could be used to combine multiple requests into a two/three range requests (to find offsets of the ZIP central directory, to get the directory, to get ranges of desired files).

  • GitHub unfortunately doesn't support Range on their ZIPs, so crates-io would have to store the ZIP elsewhere.
  • Care would have to be taken to recover from race conditions (in case ZIP changed after client read the central directory)
  • If crates-io ever moved to storing crate metadata in a database, it'd be difficult to maintain a ZIP-like "API" on top of that. Plain HTTP/2 requests don't dictate how the index is stored server-side.

Currently the index requires registries to list all of their crates ahead of time. In a model where individual crates are requested directly by their name, a registry could even create crates on demand (lazily). That may be interesting for setups where Cargo registry is a facade for other code repositories.

@remram44
Copy link

remram44 commented Oct 24, 2019

You are describing something akin to zsync (or librsync), which interestingly I'm in the process of re-implementing in Rust 🤔

@gilescope
Copy link

If cargo hits http static files rather than a git repo wouldn't that make life a lot easier for private repositories like artifactory for example to support Rust? Seems like this would lower the barrier to supporting Rust in an enterprise setting so a thumps up from me.

@Eh2406
Copy link
Contributor

Eh2406 commented Nov 20, 2019

Personal opinions, not speaking for the team.

This feels like a direction that could actually work! I have participated in many "just use HTTP" conversations in the past, this is the first time it has sounded plausible. (Sorry if you explained it to me before and I was to dance to get it.)

At some point between now and NPM this will be faster. That being said my sense is that the tipping point is far off. Additionally it is very important to read the line "Since alternative registries are stable, the git-based protocol is stable, and can't be removed." As such I will be spending my time on ways to improve the git-based protocol. This RFC is convincing enough that I would not object to someone spending there time on developing it.

On Performance:
This RFC assumes that a robust parallel HTTP/2 client will be maintainable and faster then a git fetch. I am skeptical by default. I would love to see examples showing that to be true. Async and await make thinks much ezere then it was last time I tried, and things continue to improve.

For Crates.io:
On CDN:
The Index is generously hosted by GitHub with a good CDN. We do not do things that will get it kicked out (aka shallow clone). Which side of that line is using raw.githubusercontent.com? I would like proactive assurance that switching will be ok. (I miss lolbench.rs and don't want that to happen to the Index.) Do we know someone at GitHub that can provide such assurance? If not can we get an estimate of the CDN bill? Switching may have to wait for that size bill to feel small.

For Alternative Registries:
I really like that this reduces the set of infrastructure required to set up an Alternative Registry to a static file server!

@the8472
Copy link
Member

the8472 commented Nov 20, 2019

Should compression support be mandatory for clients? i.e. they must send an Accept-encoding: ... header? gzip, br, zstd, ...?

@alexcrichton
Copy link
Member

The Cargo team discussed this RFC over the past few weeks, and we wanted to leave some thoughts of what we discussed and what we're thinking about this RFC.

Overall none of us are opposed to the principles of this RFC or the idea it's chasing after. If there's a better solution for hosting and managing the index, that'd be great to implement! If it's better after all then it's better :)

Our main discussion point, though, was that implementing this style of fetching the index is likely to be a very significant undertaking from a technical perspective in Cargo. There's a ton of concerns to take care of and it's not quite as simple as "just throw HTTP/2 at the problem and let it solve everything". We'd ideally like to see a prototype implementation to prove out the possible gains that an architecture like this would bring us, but unfortunately a prototype implementation is likely such a significant undertaking that there needs to be more buy-in before even that happens.

The conclusion we reached is that we'd be willing to accept this under the idea that it's an "eRFC" or "experimental RFC". It would basically be a blessing that the Cargo team is interested in seeing research and development around this area of hosting the index and fetching it on the client. We aren't wed to any particular details yet, and the actual details of this RFC as-written are highly likely to change while a solution is being developed.

@kornelski
Copy link
Contributor Author

kornelski commented Dec 8, 2019

I've implemented a proof-of-concept with greedy resolution that takes into account features and version ranges.

https://github.com/kornelski/cargo-static-registry-rfc-proof-of-concept

https://github.com/kornelski/cargo-static-registry-rfc-proof-of-concept/blob/master/src/main.rs

Tested with rocket framework, which required 122 requests:

  • If every request had added a whole 1 second of latency, over HTTP/1.1, it fetched all deps in 11 seconds
  • If every request had added 200ms latency, over HTTP/1.1: 3.5 seconds.
  • Without extra latency, over HTTP/2 on a fast connection: 1.6-3 seconds.

Tested on a fast connection with 144 crates.io dependencies used by servo. It required 344 requests:

  • over HTTP/1.1: 1.2-2s.
  • over HTTP/2: 0.7-1.3s.

As predicted, more dependencies make lookup faster, because lookup starts with more parallelism.

Just [email protected] - 8 requests:

  • over HTTP/1.1: 0.3s
  • over HTTP/2: 0.2s

All these tests were simulation of the worst case of a blank state with no cache.

@kornelski
Copy link
Contributor Author

kornelski commented Dec 8, 2019

I've added cache and revalidation support. Cache flattens the check's critical path to a depth of 1, and revalidation makes bandwidth use minimal, so this makes all checks (regardless of crate(s) checked) almost equally fast.

With disk cache and a fast connection:

  • HTTP/1: 0.8s
  • HTTP/2: 0.4s

With disk cache and a 3G connection:

  • HTTP/1: 1.6-2.4s
  • HTTP/2: 0.7s-2s

@pietroalbini
Copy link
Member

A problem this RFC doesn't seem to address with its local cache is deleting entries from the index: it's something we do very rarely, but sometimes we're legally required to do so (DMCA, GDPR...), and in those caches we can't leave the deleted entries in the users' local caches. The git index solves this cleanly (it's just yet another commit).

The Index is generously hosted by GitHub with a good CDN. We do not do things that will get it kicked out (aka shallow clone). Which side of that line is using raw.githubusercontent.com? I would like proactive assurance that switching will be ok. (I miss lolbench.rs and don't want that to happen to the Index.) Do we know someone at GitHub that can provide such assurance? If not can we get an estimate of the CDN bill? Switching may have to wait for that size bill to feel small.

If we do this we should just host the content on Rust infrastructure. We should still be on the 0.04$/GB CloudFront pricing tier, which is sustainable for those small files.

@kornelski
Copy link
Contributor Author

Note that Cargo currently does not delete crate tarballs and checked out copies of crates even after they have been removed from the index. Deleted metadata files continue to be distributed as part of git history until the index branch is squashed.

With a static file solution, when a client checks freshness of a deleted crate, it will make a request to the server and notice a 404/410/451 HTTP status. It can then be made to act accordingly, and clean up local data (even tarball and source checkout).

If the client is not interested in deleted crate, it won't check it, but chances are it never did, and didn't download it. If ability to immediately erase deleted data is important, then the "incremental changelog" feature can be extended to notify about deletions proactively.

@rfcbot rfcbot removed the proposed-final-comment-period Currently awaiting signoff of all team members in order to enter the final comment period. label Dec 19, 2020
@rfcbot rfcbot added finished-final-comment-period The final comment period is finished for this RFC. and removed final-comment-period Will be merged/postponed/closed in ~10 calendar days unless new substational objections are raised. labels Dec 29, 2020
@rfcbot
Copy link
Collaborator

rfcbot commented Dec 29, 2020

The final comment period, with a disposition to merge, as per the review above, is now complete.

As the automated representative of the governance process, I would like to thank the author for their work and everyone else who contributed.

The RFC will be merged soon.

@ehuss ehuss merged commit 17b10d7 into rust-lang:master Jan 12, 2021
@ehuss
Copy link
Contributor

ehuss commented Jan 12, 2021

Huzzah! The @rust-lang/cargo team has decided to accept this RFC!

To track further discussion, subscribe to the tracking issue here: rust-lang/cargo#9069

@kornelski kornelski deleted the http-index branch January 13, 2021 00:08
@glandium
Copy link

I don't see git clone --depth=1 being discussed in the RFC. It reports transferring 29.97MiB, and allows for incremental updates after that.

@ssokolow
Copy link

ssokolow commented Jan 13, 2021

@glandium It was mentioned in the discussion. Expand all the collapsed bits and Ctrl+F for "shallow clone".

@glandium
Copy link

Oh, the RFC also has something about shallow clone, but I was interpreting it as narrow/partial clone for some reason. Sorry for the noise.

bors added a commit to rust-lang/cargo that referenced this pull request Mar 24, 2022
HTTP registry implementation

Implement HTTP registry support described in [RFC 2789](rust-lang/rfcs#2789).

Adds a new unstable flag `-Z http-registry` which allows cargo to interact with remote registries served over http rather than git. These registries can be identified by urls starting with `sparse+http://` or `sparse+https://`.

When fetching index metadata over http, cargo only downloads the metadata for needed crates, which can save significant time and bandwidth over git.

The format of the http index is identical to a checkout of a git-based index.

This change is based on `@jonhoo's` PR #8890.

cc `@Eh2406`

Remaining items:
- [x] Performance measurements
- [x] Make unstable only
- [x] Investigate unification of download system. Probably best done in separate change.
- [x] Unify registry tests (code duplication in `http_registry.rs`)
- [x] Use existing on-disk cache, rather than adding a new one.
@Eh2406 Eh2406 added the call-for-testing Add this label + test notes in comments to be included in TWiR `Call for Testing` section. label Jun 25, 2022
@Eh2406
Copy link
Contributor

Eh2406 commented Jun 25, 2022

This is ready for testing. Instructions are on the blog post, feedback is welcome on internals or as issues on the Cargo repo.

@U007D
Copy link

U007D commented Jun 28, 2022

Please note I have removed the call-for-testing label as this RFC will appear in the Call for Testing section of the next issue (# 449) of TWiR.

Please feel free to re-add the call-for-testing label if you would like this RFC to appear again in another issue of TWiR.

@U007D U007D removed the call-for-testing Add this label + test notes in comments to be included in TWiR `Call for Testing` section. label Jun 28, 2022
@thomcc thomcc added the A-registry Proposals relating to the registries of cargo. label Jul 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-registry Proposals relating to the registries of cargo. disposition-merge This RFC is in PFCP or FCP with a disposition to merge it. finished-final-comment-period The final comment period is finished for this RFC. T-cargo Relevant to the Cargo team, which will review and decide on the RFC. to-announce
Projects
No open projects
Status: Done (Stabilized)
Development

Successfully merging this pull request may close these issues.