Skip to content

peering: retry establishing connection more quickly on certain errors#13938

Merged
lkysow merged 6 commits intomainfrom
lkysow/peering-non-leader-backoff
Jul 29, 2022
Merged

peering: retry establishing connection more quickly on certain errors#13938
lkysow merged 6 commits intomainfrom
lkysow/peering-non-leader-backoff

Conversation

@lkysow
Copy link
Copy Markdown
Contributor

@lkysow lkysow commented Jul 28, 2022

When running with an LB in front of the Consul servers, dialers need to retry their requests until they land on a leader server. The default retry backoff was too long because it started at 2s and went exponentially from there: 4s, 8s, etc. This change adds new retry logic for specific errors to retry quickly.

I opted to treat all FailedPrecondition errors as quick-retry errors because they should all be resolved quickly.

Changes:

  1. replace retryLoopBackoff with retryLoopBackoffPeering that implements the new retry mechanic
  2. change logging everywhere we receive a FailedPrecondition error to not log as an error because they aren't indicative of an actual problem
  3. refactor the main HandleStream() loop to add a new channel for errors. Previously, any error received from Recv() would be swallowed and then we'd close the recvCh and use that closing to trigger the loop to exit. Now we're sending that error through the new channel and handling it within the main for{} loop.

@lkysow lkysow added the pr/no-changelog PR does not need a corresponding .changelog entry label Jul 28, 2022
Copy link
Copy Markdown
Contributor

@kisunji kisunji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good but I left a few questions

@lkysow lkysow force-pushed the lkysow/peering-non-leader-backoff branch 3 times, most recently from acf7c98 to 1a95284 Compare July 28, 2022 21:25
@lkysow lkysow changed the title WIP: backoff on non-leader err peering: retry establishing connection more quickly on certain errors Jul 28, 2022
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log errors differently so as to not spam logs with not actual errors.

Copy link
Copy Markdown
Contributor Author

@lkysow lkysow Jul 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just made up this backoff algo. Open to feedback on the params cc @ndhanushkodi

It starts out constant time backoff: 8ms for 5 times, then goes to exponential backoff, 16ms, 32ms, etc. maxing out at 8192ms => 8s.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer swallowing this EOF error because if the stream is disconnected at this early stage then we definitely want to retry connecting. We should only ever return nil if the stream is disconnected gracefully and we don't want to reconnect.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is refactored to send errors through errCh. Since we're sending errors through, we don't need to close recvChan as a signal that this goroutine has exited. Instead in our main for{} we can receive the error and then know that this goroutine has exited.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't close recvChan anymore so no point checking if it's open

When we receive a FailedPrecondition error, retry that more quickly
because we expect it will resolve shortly. This is particularly
important in the context of Consul servers behind a load balancer
because when establishing a connection we have to retry until we
randomly land on a leader node.

The default retry backoff goes from 2s, 4s, 8s, etc. which can result in
very long delays quite quickly. Instead, this backoff retries in 8ms
five times, then goes exponentially from there: 16ms, 32ms, ... up to a
max of 8152ms.
@lkysow lkysow force-pushed the lkysow/peering-non-leader-backoff branch from 1a95284 to 4d5373c Compare July 28, 2022 21:43
@lkysow lkysow marked this pull request as ready for review July 29, 2022 17:23
func retryLoopBackoffPeering(ctx context.Context, logger hclog.Logger, loopFn func() error, errFn func(error),
retryTimeFn func(failedAttempts uint, loopErr error) time.Duration) {
var failedAttempts uint
var err error
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

err doesn't need to be here

Copy link
Copy Markdown
Contributor

@kisunji kisunji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable! Approving and hopefully we can tweak as needed during beta

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr/no-changelog PR does not need a corresponding .changelog entry

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants