Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add delays to network retries. #11881

Merged
merged 6 commits into from
Apr 1, 2023
Merged

Add delays to network retries. #11881

merged 6 commits into from
Apr 1, 2023

Conversation

ehuss
Copy link
Contributor

@ehuss ehuss commented Mar 23, 2023

This adds a delay to network retries to help guard against short-lived transient errors.

The overview of how this works is: Requests that fail are placed into a queue with a timestamp of when they should be retried. The download code then checks for any requests that need to be retried, and re-injects them back into curl.

Care must be taken in the download loops to track which requests are currently in flight plus which requests are waiting to be retried.

This also adds jitter to the retry times so that multiple failed requests don't all flood the server when they are retried. This is a very primitive form of jitter which suffers from clumping, but I think it should be fine.

The max number of retries is raised to 3 in order to bring the total retry time to around 10 seconds. This is intended to address Cloudfront's default negative TTL of 10 seconds. The retry schedule looks like 0.5seconds ± 1 second, then 3.5 seconds then 6.5 seconds, for a total of 10.5 to 11.5 seconds. If the user configures additional retries, each attempt afterwards has a max delay of 10 seconds.

The retry times are currently not user-configurable. This could potentially be added in the future, but I don't think is particularly necessary for now.

There is an unfortunate amount of duplication between the crate download and HTTP index code. I think in the near future we should work on consolidating that code. That might be challenging, since their use patterns are different, but I think it is feasible.

@rustbot
Copy link
Collaborator

rustbot commented Mar 23, 2023

r? @epage

(rustbot has picked a reviewer for you, use r? to override)

@rustbot rustbot added A-documenting-cargo-itself Area: Cargo's documentation A-git Area: anything dealing with git A-registries Area: registries A-sparse-registry Area: http sparse registries A-testing-cargo-itself Area: cargo's tests S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 23, 2023
} else {
let min_timeout = Duration::new(1, 0);
let timeout = self.set.multi.get_timeout()?.unwrap_or(min_timeout);
let timeout = timeout.min(min_timeout);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this end up waiting longer than delay? Do we care?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to wait here past the deadline for retrying a sleeping request. However, this is at most 1 second, so it shouldn't matter too much as the exact time something is retried isn't that important. This should usually return within much less than a second (usually 200ms), and I believe even that long of a delay happens when there is 0 network traffic.

use std::time::{Duration, Instant};

pub struct SleepTracker<T> {
heap: BinaryHeap<Sleeper<T>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be a min heap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is correct because the custom PartialEq implementation reverses the comparison of the timestamps. Another option is to use Reverse, but since it already needed a custom impl (to pick just one field), I figured it wasn't necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That all seams good. Should we have a comment on the BinaryHeap and the PartialEq pointing out this important detail?

@Eh2406
Copy link
Contributor

Eh2406 commented Mar 24, 2023

Also did we want to add some limit on the total number of retries system wide? Like if 20 requests need a retry then the registry is probably down/overlodad and retrying is just making it harder for the registry to come back up.

@ehuss
Copy link
Contributor Author

ehuss commented Mar 25, 2023

Also did we want to add some limit on the total number of retries system wide? Like if 20 requests need a retry then the registry is probably down/overlodad and retrying is just making it harder for the registry to come back up.

Can you say more about how you would envision that working?

I'm not sure how that would work beyond what is implemented here. If it is currently being overwhelmed, cargo will back off and smear the requests over some period of time.

If there is some hard limit for proactively giving up, I think that could be difficult to tune. Cargo inherently sends hundreds of requests all at once. In a situation where the server is having a temporary upset, it may very well recover within the existing retry schedule, even with hundreds of requests.

The current schedule is geared around the behaviors that I'm familiar with. Most recoverable transient errors can recover within a very short period of time. The second cliff is the set of errors that are recoverable within a moderately short period of time. Beyond that, most recovery won't happen for minutes/hours/days, which is longer than I think Cargo should bother waiting.

One of the hypothetical scenarios this is intended for is a temporary spike that triggers a rate limit. After waiting for Cloudfront's 10s negative TTL, things should be cleared up if there isn't a sustained barrage of traffic.

There could be a slower resume, with some intelligence of "ok, requests are succeeding again, let them go through faster" vs "requests are still failing, let's slow down even more". This would be similar to a slow-start algorithm, but only engaging when errors are encountered. However, I'm concerned about the complexity of implementing such a solution.

@Eh2406
Copy link
Contributor

Eh2406 commented Mar 26, 2023

Can you say more about how you would envision that working?

All I had in mind was a global static RETRY_COUNT = AtomicU8::new(20), and then subtracting one from it before doing each retry. When it hits zero then cargo errors instead of continuing to do retries.

Cargo inherently sends hundreds of requests all at once. In a situation where the server is having a temporary upset, it may very well recover within the existing retry schedule, even with hundreds of requests.

That is a problem for the "simple" thing I had in mind. It's entirely possible that cargo is started 200 requests all of which will fail and all of which will succeed on their retry. :-/

Perhaps every time we make a "first request" foreign asset we increase the RETRY_COUNT. Now we have a system that lets us retry everything once, but only 20 things more than that. I don't know if this gets us the traffic shaping benefits I discuss below.

Most recoverable transient errors can recover within a very short period of time. The second cliff is the set of errors that are recoverable within a moderately short period of time. Beyond that, most recovery won't happen for minutes/hours/days, which is longer than I think Cargo should bother waiting.

Crates.io is set up with infrastructure that tends to either be up or down, not infrastructure that might be running at degraded service. Not all registries will work that way. For example a private registry might need to do a database call for every request that comes in.

One common pattern that happens with systems is that they are running entirely smoothly close to capacity. Then a some small hiccup happens, say a small number of requests take longer to process than they should. Connections start timing out before a response is ready. By the time the server is ready to start processing the next request it's even further into the connections timeout. This isn't entirely tragic, there is excess capacity that the server can use to catch up. It was running at say 90% of theoretical capacity, but thanks to the backlog is now running flat out and only getting 99% requests in on time. The 1% of Clients start receiving connection timeout errors and retry. The server is now receiving 1% more traffic, and it was already effectively overloaded. So now 2% requests get time outs. 2% of Clients retry. So the server is even more overloaded and 4% retry. This pattern continues until no actual work is getting done. As the queue builds up, the requests the server are picking up have been waiting longer and are closer to timing out. The recurrence relation is something like traffic = base traffic + retries but retries ~= max(0, capacity - traffic) which tends towards infinity. If the clients have a limit retries < %10 of base traffic, then a server running against base traffic with %10 headroom will be able to catch up even if all requests start hitting retries.

There is also a problem with nested retries. If cargo will retry all of its requests three times but cargo is run in a CI environment that will retry three times, then there is nine times as much traffic hitting the server when it's down than when it's running smoothly. Lots of systems would consider 9x normal traffic as a DDOS level. If cargo limits the total number of retries to 10% of base then even with a CI retry there is only a 3.3x.

One of the hypothetical scenarios this is intended for is a temporary spike that triggers a rate limit. After waiting for Cloudfront's 10s negative TTL, things should be cleared up if there isn't a sustained barrage of traffic.

This got me wondering, when we have this 503 issue is Cloudfront returning 503 for all endpoints or only specific packages? If most packages are still cashed correctly and working, but a handful are being retried, then even my simple suggestion should work well.

I just generated a bunch of words, but I don't know what we should do next. Sorry. Did anything I say make any sense?

@ehuss
Copy link
Contributor Author

ehuss commented Mar 26, 2023

This got me wondering, when we have this 503 issue is Cloudfront returning 503 for all endpoints or only specific packages?

Based on my reading of the documentation, I think it is not a simple answer. I believe the rate limit on S3 is per prefix. The prefix is everything up to the last slash. But the docs also say that the partitioning is adjusted dynamically based on the traffic patterns. So if everything is partitioned completely, then I think the rate limit would be based on the first 4 letters (that is, /ca/rg/* would all have the same rate limit). However, if it is not fully partitioned, then the rate limit could be broader (perhaps the entire bucket?).

I'm pretty sure Cloudfront's cache will be per endpoint.

But these are just guesses based on the available docs.

@Eh2406
Copy link
Contributor

Eh2406 commented Mar 27, 2023

Lets not delay this, desperately needed, PR for the "cap on retries" functionality. We can move it to an issue and come back to it latter.

@ehuss
Copy link
Contributor Author

ehuss commented Mar 31, 2023

@epage I was wondering if you've had a chance to look at this? I think this is a relatively high priority change that I would like to see get into beta to help mitigate some risk with the registry change.

Copy link
Contributor

@epage epage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I saw @Eh2406 was looking at this was was leaving it to him.

I have some nits but ok merging as-is

src/cargo/sources/registry/http_remote.rs Outdated Show resolved Hide resolved
src/cargo/util/network/retry.rs Outdated Show resolved Hide resolved
@epage
Copy link
Contributor

epage commented Mar 31, 2023

@bors r+

@bors
Copy link
Collaborator

bors commented Mar 31, 2023

📌 Commit 4fdea65 has been approved by epage

It is now in the queue for this repository.

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 31, 2023
@bors
Copy link
Collaborator

bors commented Mar 31, 2023

⌛ Testing commit 4fdea65 with merge 0e474cf...

@bors
Copy link
Collaborator

bors commented Apr 1, 2023

☀️ Test successful - checks-actions
Approved by: epage
Pushing 0e474cf to master...

@bors bors merged commit 0e474cf into rust-lang:master Apr 1, 2023
bors added a commit to rust-lang-ci/rust that referenced this pull request Apr 1, 2023
Update cargo

9 commits in 145219a9f089f8b57c09f40525374fbade1e34ae..0e474cfd7b16b018cf46e95da3f6a5b2f1f6a9e7
2023-03-27 01:56:36 +0000 to 2023-03-31 23:15:58 +0000
- Add delays to network retries. (rust-lang/cargo#11881)
- Add a note to `cargo logout` that it does not revoke the token. (rust-lang/cargo#11919)
- Sync external-tools JSON docs. (rust-lang/cargo#11918)
- Drop derive feature from serde in cargo-platform (rust-lang/cargo#11915)
- Disable test_profile test on windows-gnu (rust-lang/cargo#11916)
- src/doc/src/reference/build-scripts.md: a{n =&gt;} benchmark target (rust-lang/cargo#11908)
- Documented working directory behaviour for `cargo test`, `cargo bench` and `cargo run` (rust-lang/cargo#11901)
- docs(contrib): Link to office hours doc (rust-lang/cargo#11903)
- chore: Upgrade to clap v4.2 (rust-lang/cargo#11904)
ehuss pushed a commit to ehuss/cargo that referenced this pull request Apr 1, 2023
@ehuss ehuss added this to the 1.70.0 milestone Apr 3, 2023
ehuss pushed a commit to ehuss/cargo that referenced this pull request Apr 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-documenting-cargo-itself Area: Cargo's documentation A-git Area: anything dealing with git A-registries Area: registries A-sparse-registry Area: http sparse registries A-testing-cargo-itself Area: cargo's tests S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants