Add delays to network retries. #11881

ehuss · 2023-03-23T13:12:58Z

This adds a delay to network retries to help guard against short-lived transient errors.

The overview of how this works is: Requests that fail are placed into a queue with a timestamp of when they should be retried. The download code then checks for any requests that need to be retried, and re-injects them back into curl.

Care must be taken in the download loops to track which requests are currently in flight plus which requests are waiting to be retried.

This also adds jitter to the retry times so that multiple failed requests don't all flood the server when they are retried. This is a very primitive form of jitter which suffers from clumping, but I think it should be fine.

The max number of retries is raised to 3 in order to bring the total retry time to around 10 seconds. This is intended to address Cloudfront's default negative TTL of 10 seconds. The retry schedule looks like 0.5seconds ± 1 second, then 3.5 seconds then 6.5 seconds, for a total of 10.5 to 11.5 seconds. If the user configures additional retries, each attempt afterwards has a max delay of 10 seconds.

The retry times are currently not user-configurable. This could potentially be added in the future, but I don't think is particularly necessary for now.

There is an unfortunate amount of duplication between the crate download and HTTP index code. I think in the near future we should work on consolidating that code. That might be challenging, since their use patterns are different, but I think it is feasible.

rustbot · 2023-03-23T13:13:04Z

r? @epage

(rustbot has picked a reviewer for you, use r? to override)

Eh2406 · 2023-03-23T13:35:19Z

src/cargo/core/package.rs

+            } else {
+                let min_timeout = Duration::new(1, 0);
+                let timeout = self.set.multi.get_timeout()?.unwrap_or(min_timeout);
+                let timeout = timeout.min(min_timeout);


Could this end up waiting longer than delay? Do we care?

It is possible to wait here past the deadline for retrying a sleeping request. However, this is at most 1 second, so it shouldn't matter too much as the exact time something is retried isn't that important. This should usually return within much less than a second (usually 200ms), and I believe even that long of a delay happens when there is 0 network traffic.

src/cargo/util/network/sleep.rs

Eh2406 · 2023-03-23T13:42:37Z

src/cargo/util/network/sleep.rs

+use std::time::{Duration, Instant};
+
+pub struct SleepTracker<T> {
+    heap: BinaryHeap<Sleeper<T>>,


Does this need to be a min heap?

I believe this is correct because the custom PartialEq implementation reverses the comparison of the timestamps. Another option is to use Reverse, but since it already needed a custom impl (to pick just one field), I figured it wasn't necessary.

That all seams good. Should we have a comment on the BinaryHeap and the PartialEq pointing out this important detail?

Eh2406 · 2023-03-24T14:23:32Z

Also did we want to add some limit on the total number of retries system wide? Like if 20 requests need a retry then the registry is probably down/overlodad and retrying is just making it harder for the registry to come back up.

ehuss · 2023-03-25T19:19:51Z

Also did we want to add some limit on the total number of retries system wide? Like if 20 requests need a retry then the registry is probably down/overlodad and retrying is just making it harder for the registry to come back up.

Can you say more about how you would envision that working?

I'm not sure how that would work beyond what is implemented here. If it is currently being overwhelmed, cargo will back off and smear the requests over some period of time.

If there is some hard limit for proactively giving up, I think that could be difficult to tune. Cargo inherently sends hundreds of requests all at once. In a situation where the server is having a temporary upset, it may very well recover within the existing retry schedule, even with hundreds of requests.

The current schedule is geared around the behaviors that I'm familiar with. Most recoverable transient errors can recover within a very short period of time. The second cliff is the set of errors that are recoverable within a moderately short period of time. Beyond that, most recovery won't happen for minutes/hours/days, which is longer than I think Cargo should bother waiting.

One of the hypothetical scenarios this is intended for is a temporary spike that triggers a rate limit. After waiting for Cloudfront's 10s negative TTL, things should be cleared up if there isn't a sustained barrage of traffic.

There could be a slower resume, with some intelligence of "ok, requests are succeeding again, let them go through faster" vs "requests are still failing, let's slow down even more". This would be similar to a slow-start algorithm, but only engaging when errors are encountered. However, I'm concerned about the complexity of implementing such a solution.

Eh2406 · 2023-03-26T00:35:48Z

Can you say more about how you would envision that working?

All I had in mind was a global static RETRY_COUNT = AtomicU8::new(20), and then subtracting one from it before doing each retry. When it hits zero then cargo errors instead of continuing to do retries.

Cargo inherently sends hundreds of requests all at once. In a situation where the server is having a temporary upset, it may very well recover within the existing retry schedule, even with hundreds of requests.

That is a problem for the "simple" thing I had in mind. It's entirely possible that cargo is started 200 requests all of which will fail and all of which will succeed on their retry. :-/

Perhaps every time we make a "first request" foreign asset we increase the RETRY_COUNT. Now we have a system that lets us retry everything once, but only 20 things more than that. I don't know if this gets us the traffic shaping benefits I discuss below.

Most recoverable transient errors can recover within a very short period of time. The second cliff is the set of errors that are recoverable within a moderately short period of time. Beyond that, most recovery won't happen for minutes/hours/days, which is longer than I think Cargo should bother waiting.

Crates.io is set up with infrastructure that tends to either be up or down, not infrastructure that might be running at degraded service. Not all registries will work that way. For example a private registry might need to do a database call for every request that comes in.

One common pattern that happens with systems is that they are running entirely smoothly close to capacity. Then a some small hiccup happens, say a small number of requests take longer to process than they should. Connections start timing out before a response is ready. By the time the server is ready to start processing the next request it's even further into the connections timeout. This isn't entirely tragic, there is excess capacity that the server can use to catch up. It was running at say 90% of theoretical capacity, but thanks to the backlog is now running flat out and only getting 99% requests in on time. The 1% of Clients start receiving connection timeout errors and retry. The server is now receiving 1% more traffic, and it was already effectively overloaded. So now 2% requests get time outs. 2% of Clients retry. So the server is even more overloaded and 4% retry. This pattern continues until no actual work is getting done. As the queue builds up, the requests the server are picking up have been waiting longer and are closer to timing out. The recurrence relation is something like traffic = base traffic + retries but retries ~= max(0, capacity - traffic) which tends towards infinity. If the clients have a limit retries < %10 of base traffic, then a server running against base traffic with %10 headroom will be able to catch up even if all requests start hitting retries.

There is also a problem with nested retries. If cargo will retry all of its requests three times but cargo is run in a CI environment that will retry three times, then there is nine times as much traffic hitting the server when it's down than when it's running smoothly. Lots of systems would consider 9x normal traffic as a DDOS level. If cargo limits the total number of retries to 10% of base then even with a CI retry there is only a 3.3x.

One of the hypothetical scenarios this is intended for is a temporary spike that triggers a rate limit. After waiting for Cloudfront's 10s negative TTL, things should be cleared up if there isn't a sustained barrage of traffic.

This got me wondering, when we have this 503 issue is Cloudfront returning 503 for all endpoints or only specific packages? If most packages are still cashed correctly and working, but a handful are being retried, then even my simple suggestion should work well.

I just generated a bunch of words, but I don't know what we should do next. Sorry. Did anything I say make any sense?

ehuss · 2023-03-26T18:02:43Z

This got me wondering, when we have this 503 issue is Cloudfront returning 503 for all endpoints or only specific packages?

Based on my reading of the documentation, I think it is not a simple answer. I believe the rate limit on S3 is per prefix. The prefix is everything up to the last slash. But the docs also say that the partitioning is adjusted dynamically based on the traffic patterns. So if everything is partitioned completely, then I think the rate limit would be based on the first 4 letters (that is, /ca/rg/* would all have the same rate limit). However, if it is not fully partitioned, then the rate limit could be broader (perhaps the entire bucket?).

I'm pretty sure Cloudfront's cache will be per endpoint.

But these are just guesses based on the available docs.

Eh2406 · 2023-03-27T14:18:36Z

Lets not delay this, desperately needed, PR for the "cap on retries" functionality. We can move it to an issue and come back to it latter.

ehuss · 2023-03-31T16:50:21Z

@epage I was wondering if you've had a chance to look at this? I think this is a relatively high priority change that I would like to see get into beta to help mitigate some risk with the registry change.

epage

Sorry, I saw @Eh2406 was looking at this was was leaving it to him.

I have some nits but ok merging as-is

src/cargo/sources/registry/http_remote.rs

src/cargo/util/network/retry.rs

This is intended to help grow with more stuff.

This allows tests to generate dynamic URLs for custom responders.

epage · 2023-03-31T23:14:48Z

@bors r+

bors · 2023-03-31T23:15:49Z

📌 Commit 4fdea65 has been approved by epage

It is now in the queue for this repository.

bors · 2023-03-31T23:16:00Z

⌛ Testing commit 4fdea65 with merge 0e474cf...

bors · 2023-04-01T00:00:28Z

☀️ Test successful - checks-actions
Approved by: epage
Pushing 0e474cf to master...

Update cargo 9 commits in 145219a9f089f8b57c09f40525374fbade1e34ae..0e474cfd7b16b018cf46e95da3f6a5b2f1f6a9e7 2023-03-27 01:56:36 +0000 to 2023-03-31 23:15:58 +0000 - Add delays to network retries. (rust-lang/cargo#11881) - Add a note to `cargo logout` that it does not revoke the token. (rust-lang/cargo#11919) - Sync external-tools JSON docs. (rust-lang/cargo#11918) - Drop derive feature from serde in cargo-platform (rust-lang/cargo#11915) - Disable test_profile test on windows-gnu (rust-lang/cargo#11916) - src/doc/src/reference/build-scripts.md: a{n =>} benchmark target (rust-lang/cargo#11908) - Documented working directory behaviour for `cargo test`, `cargo bench` and `cargo run` (rust-lang/cargo#11901) - docs(contrib): Link to office hours doc (rust-lang/cargo#11903) - chore: Upgrade to clap v4.2 (rust-lang/cargo#11904)

rustbot assigned epage Mar 23, 2023

Eh2406 reviewed Mar 23, 2023

View reviewed changes

src/cargo/util/network/sleep.rs Outdated Show resolved Hide resolved

Eh2406 reviewed Mar 23, 2023

View reviewed changes

ehuss force-pushed the http-retry branch from afafea4 to db653f2 Compare March 23, 2023 13:48

epage reviewed Mar 31, 2023

View reviewed changes

src/cargo/sources/registry/http_remote.rs Outdated Show resolved Hide resolved

src/cargo/util/network/retry.rs Outdated Show resolved Hide resolved

ehuss added 6 commits March 31, 2023 14:04

Split the cargo::util::network module into submodules

b5d772f

This is intended to help grow with more stuff.

Allow RegistryBuilder responder URLs to be a String

c38e050

This allows tests to generate dynamic URLs for custom responders.

Add delays to network retries.

6bd1209

Add some more docs and comments to SleepTracker.

2f8dafe

Add _MS suffix to retry constants.

e0d8204

Don't place side-effect expressions in assert! macros.

4fdea65

ehuss force-pushed the http-retry branch from 15b03a5 to 4fdea65 Compare March 31, 2023 21:22

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 31, 2023

bors merged commit 0e474cf into rust-lang:master Apr 1, 2023

This was referenced Apr 1, 2023

support for shallow clones and fetches with gitoxide #11840

Merged

Add more information to HTTP errors to help with debugging. #11878

Merged

ehuss mentioned this pull request Apr 1, 2023

Update cargo rust-lang/rust#109835

Merged

ehuss pushed a commit to ehuss/cargo that referenced this pull request Apr 1, 2023

[beta 1.69] Backport rust-lang#11881 Add delays to network retries.

20ed347

ehuss mentioned this pull request Apr 1, 2023

[beta 1.69] Backport #11881 Add delays to network retries. #11922

Closed

ehuss added this to the 1.70.0 milestone Apr 3, 2023

ehuss pushed a commit to ehuss/cargo that referenced this pull request Apr 4, 2023

[beta 1.69] Backport rust-lang#11881 Add delays to network retries.

72c6b4e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add delays to network retries. #11881

Add delays to network retries. #11881

ehuss commented Mar 23, 2023

rustbot commented Mar 23, 2023

Eh2406 Mar 23, 2023

ehuss Mar 23, 2023

Eh2406 Mar 23, 2023

ehuss Mar 23, 2023

Eh2406 Mar 24, 2023

Eh2406 commented Mar 24, 2023

ehuss commented Mar 25, 2023

Eh2406 commented Mar 26, 2023

ehuss commented Mar 26, 2023

Eh2406 commented Mar 27, 2023

ehuss commented Mar 31, 2023

epage left a comment

epage commented Mar 31, 2023

bors commented Mar 31, 2023

bors commented Mar 31, 2023

bors commented Apr 1, 2023

Add delays to network retries. #11881

Add delays to network retries. #11881

Conversation

ehuss commented Mar 23, 2023

rustbot commented Mar 23, 2023

Eh2406 Mar 23, 2023

Choose a reason for hiding this comment

ehuss Mar 23, 2023

Choose a reason for hiding this comment

Eh2406 Mar 23, 2023

Choose a reason for hiding this comment

ehuss Mar 23, 2023

Choose a reason for hiding this comment

Eh2406 Mar 24, 2023

Choose a reason for hiding this comment

Eh2406 commented Mar 24, 2023

ehuss commented Mar 25, 2023

Eh2406 commented Mar 26, 2023

ehuss commented Mar 26, 2023

Eh2406 commented Mar 27, 2023

ehuss commented Mar 31, 2023

epage left a comment

Choose a reason for hiding this comment

epage commented Mar 31, 2023

bors commented Mar 31, 2023

bors commented Mar 31, 2023

bors commented Apr 1, 2023