peering: retry establishing connection more quickly on certain errors#13938
peering: retry establishing connection more quickly on certain errors#13938
Conversation
kisunji
left a comment
There was a problem hiding this comment.
Overall looks good but I left a few questions
acf7c98 to
1a95284
Compare
agent/consul/leader_peering.go
Outdated
There was a problem hiding this comment.
Log errors differently so as to not spam logs with not actual errors.
agent/consul/leader_peering.go
Outdated
There was a problem hiding this comment.
Just made up this backoff algo. Open to feedback on the params cc @ndhanushkodi
It starts out constant time backoff: 8ms for 5 times, then goes to exponential backoff, 16ms, 32ms, etc. maxing out at 8192ms => 8s.
There was a problem hiding this comment.
No longer swallowing this EOF error because if the stream is disconnected at this early stage then we definitely want to retry connecting. We should only ever return nil if the stream is disconnected gracefully and we don't want to reconnect.
There was a problem hiding this comment.
This function is refactored to send errors through errCh. Since we're sending errors through, we don't need to close recvChan as a signal that this goroutine has exited. Instead in our main for{} we can receive the error and then know that this goroutine has exited.
There was a problem hiding this comment.
We don't close recvChan anymore so no point checking if it's open
When we receive a FailedPrecondition error, retry that more quickly because we expect it will resolve shortly. This is particularly important in the context of Consul servers behind a load balancer because when establishing a connection we have to retry until we randomly land on a leader node. The default retry backoff goes from 2s, 4s, 8s, etc. which can result in very long delays quite quickly. Instead, this backoff retries in 8ms five times, then goes exponentially from there: 16ms, 32ms, ... up to a max of 8152ms.
1a95284 to
4d5373c
Compare
| func retryLoopBackoffPeering(ctx context.Context, logger hclog.Logger, loopFn func() error, errFn func(error), | ||
| retryTimeFn func(failedAttempts uint, loopErr error) time.Duration) { | ||
| var failedAttempts uint | ||
| var err error |
There was a problem hiding this comment.
err doesn't need to be here
kisunji
left a comment
There was a problem hiding this comment.
Looks reasonable! Approving and hopefully we can tweak as needed during beta
When running with an LB in front of the Consul servers, dialers need to retry their requests until they land on a leader server. The default retry backoff was too long because it started at 2s and went exponentially from there: 4s, 8s, etc. This change adds new retry logic for specific errors to retry quickly.
I opted to treat all
FailedPreconditionerrors as quick-retry errors because they should all be resolved quickly.Changes:
retryLoopBackoffwithretryLoopBackoffPeeringthat implements the new retry mechanicFailedPreconditionerror to not log as an error because they aren't indicative of an actual problemHandleStream()loop to add a new channel for errors. Previously, any error received fromRecv()would be swallowed and then we'd close therecvChand use that closing to trigger the loop to exit. Now we're sending that error through the new channel and handling it within the mainfor{}loop.