peering: retry establishing connection more quickly on certain errors by lkysow · Pull Request #13938 · hashicorp/consul

lkysow · 2022-07-28T00:16:22Z

When running with an LB in front of the Consul servers, dialers need to retry their requests until they land on a leader server. The default retry backoff was too long because it started at 2s and went exponentially from there: 4s, 8s, etc. This change adds new retry logic for specific errors to retry quickly.

I opted to treat all FailedPrecondition errors as quick-retry errors because they should all be resolved quickly.

Changes:

replace retryLoopBackoff with retryLoopBackoffPeering that implements the new retry mechanic
change logging everywhere we receive a FailedPrecondition error to not log as an error because they aren't indicative of an actual problem
refactor the main HandleStream() loop to add a new channel for errors. Previously, any error received from Recv() would be swallowed and then we'd close the recvCh and use that closing to trigger the loop to exit. Now we're sending that error through the new channel and handling it within the main for{} loop.

kisunji

Overall looks good but I left a few questions

agent/grpc-external/services/peerstream/stream_resources.go

agent/consul/leader_connect.go

lkysow · 2022-07-28T21:39:06Z

agent/consul/leader_peering.go

Log errors differently so as to not spam logs with not actual errors.

lkysow · 2022-07-28T21:39:34Z

agent/consul/leader_peering.go

Just made up this backoff algo. Open to feedback on the params cc @ndhanushkodi

It starts out constant time backoff: 8ms for 5 times, then goes to exponential backoff, 16ms, 32ms, etc. maxing out at 8192ms => 8s.

lkysow · 2022-07-28T21:40:49Z

agent/grpc-external/services/peerstream/stream_resources.go

No longer swallowing this EOF error because if the stream is disconnected at this early stage then we definitely want to retry connecting. We should only ever return nil if the stream is disconnected gracefully and we don't want to reconnect.

lkysow · 2022-07-28T21:41:53Z

agent/grpc-external/services/peerstream/stream_resources.go

This function is refactored to send errors through errCh. Since we're sending errors through, we don't need to close recvChan as a signal that this goroutine has exited. Instead in our main for{} we can receive the error and then know that this goroutine has exited.

lkysow · 2022-07-28T21:42:39Z

agent/grpc-external/services/peerstream/stream_resources.go

We don't close recvChan anymore so no point checking if it's open

When we receive a FailedPrecondition error, retry that more quickly because we expect it will resolve shortly. This is particularly important in the context of Consul servers behind a load balancer because when establishing a connection we have to retry until we randomly land on a leader node. The default retry backoff goes from 2s, 4s, 8s, etc. which can result in very long delays quite quickly. Instead, this backoff retries in 8ms five times, then goes exponentially from there: 16ms, 32ms, ... up to a max of 8152ms.

lkysow · 2022-07-29T19:26:58Z

agent/consul/leader_peering.go

+func retryLoopBackoffPeering(ctx context.Context, logger hclog.Logger, loopFn func() error, errFn func(error),
+	retryTimeFn func(failedAttempts uint, loopErr error) time.Duration) {
+	var failedAttempts uint
+	var err error


err doesn't need to be here

kisunji

Looks reasonable! Approving and hopefully we can tweak as needed during beta

lkysow added the pr/no-changelog PR does not need a corresponding .changelog entry label Jul 28, 2022

ndhanushkodi mentioned this pull request Jul 28, 2022

peering: expose servers over Kubernetes service hashicorp/consul-k8s#1378

Merged

2 tasks

kisunji reviewed Jul 28, 2022

View reviewed changes

lkysow force-pushed the lkysow/peering-non-leader-backoff branch 3 times, most recently from acf7c98 to 1a95284 Compare July 28, 2022 21:25

lkysow changed the title ~~WIP: backoff on non-leader err~~ peering: retry establishing connection more quickly on certain errors Jul 28, 2022

lkysow commented Jul 28, 2022

View reviewed changes

lkysow force-pushed the lkysow/peering-non-leader-backoff branch from 1a95284 to 4d5373c Compare July 28, 2022 21:43

lkysow added 2 commits July 29, 2022 10:21

Fix test

7a80cca

Move function down to bottom

026e4d8

lkysow marked this pull request as ready for review July 29, 2022 17:23

Add some tests

ac8f97b

kisunji added the backport/1.13 label Jul 29, 2022

Test retryLoopBackoffPeering

c269684

lkysow commented Jul 29, 2022

View reviewed changes

kisunji approved these changes Jul 29, 2022

View reviewed changes

Exit loop on success

95d1e8b

lkysow merged commit 95096e2 into main Jul 29, 2022

lkysow deleted the lkysow/peering-non-leader-backoff branch July 29, 2022 20:04

hc-github-team-consul-core mentioned this pull request Jul 29, 2022

Backport of peering: retry establishing connection more quickly on certain errors into release/1.13.x #13964

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

peering: retry establishing connection more quickly on certain errors#13938

peering: retry establishing connection more quickly on certain errors#13938
lkysow merged 6 commits intomainfrom
lkysow/peering-non-leader-backoff

lkysow commented Jul 28, 2022 •

edited

Loading

Uh oh!

kisunji left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lkysow Jul 28, 2022

Uh oh!

lkysow Jul 28, 2022 •

edited

Loading

Uh oh!

lkysow Jul 28, 2022

Uh oh!

lkysow Jul 28, 2022

Uh oh!

lkysow Jul 28, 2022

Uh oh!

lkysow Jul 29, 2022

Uh oh!

kisunji left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lkysow commented Jul 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kisunji left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lkysow Jul 28, 2022

Choose a reason for hiding this comment

Uh oh!

lkysow Jul 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lkysow Jul 28, 2022

Choose a reason for hiding this comment

Uh oh!

lkysow Jul 28, 2022

Choose a reason for hiding this comment

Uh oh!

lkysow Jul 28, 2022

Choose a reason for hiding this comment

Uh oh!

lkysow Jul 29, 2022

Choose a reason for hiding this comment

Uh oh!

kisunji left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lkysow commented Jul 28, 2022 •

edited

Loading

lkysow Jul 28, 2022 •

edited

Loading