Retry service-busy errors after a delay #1174

Groxx · 2022-06-22T02:04:32Z

Builds on #1167, but adds delay before retrying service-busy errors.

For now, since our server-side RPS quotas are calculated per second, this delays
at least 1 second per service busy error.
This is in contrast to the previous behavior, which would have retried up to about
a dozen times in the same period, which is the cause of service-busy-based retry
storms that cause lots more service-busy errors.

This also gives us an easy way to make use of "retry after" information in errors
we return to the caller, though currently our errors do not contain that.

Eventually this should probably come from the server, which has a global view of
how many requests this service has sent, and can provide a more precise delay to
individual callers.
E.g. currently our server-side ratelimiter works in 1-second slices... but that
isn't something that's guaranteed to stay true. The server could also detect truly
large floods of requests, and return jittered values larger than 1 second to more
powerfully stop the storm, or to allow prioritizing some requests (like activity
responses) over others simply by returning a lower delay.

coveralls · 2022-06-22T02:23:53Z

Pull Request Test Coverage Report for Build 0183905f-6dc6-40b7-861c-6866b8ff66be

46 of 58 (79.31%) changed or added relevant lines in 4 files are covered.
1 unchanged line in 1 file lost coverage.
Overall coverage increased (+0.04%) to 64.183%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
internal/common/backoff/retry.go	30	32	93.75%
internal/internal_task_pollers.go	12	22	54.55%

Files with Coverage Reduction	New Missed Lines	%
internal/common/backoff/retry.go	1	96.59%

Totals
Change from base Build 01838582-1814-4eb0-a654-48c8844fff22:	0.04%
Covered Lines:	12648
Relevant Lines:	19706

💛 - Coveralls

#1167) Part 1 of 2 for solving retry storms, particularly around incorrectly-categorized errors (e.g. limit exceeded) and service-busy. This PR moves us to `errors.As` to support wrapped errors in the future, and re-categorizes some incorrectly-retried errors. This is both useful on its own, and helps make #1174 a smaller and clearer change. Service-busy behavior is actually changed in #1174, this commit intentionally maintains its current (flawed) behavior.

Builds on cadence-workflow#1167, but adds delay before retrying service-busy errors. For now, since our server-side RPS quotas are calculated per second, this delays at least 1 second per service busy error. This is in contrast to the previous behavior, which would have retried up to about a dozen times in the same period, which is the cause of service-busy-based retry storms that cause lots more service-busy errors. --- This also gives us an easy way to make use of "retry after" information in errors we return to the caller, though currently our errors do not contain that. Eventually this should probably come from the server, which has a global view of how many requests this service has sent, and can provide a more precise delay to individual callers. E.g. currently our server-side ratelimiter works in 1-second slices... but that isn't something that's guaranteed to stay true. The server could also detect truly large floods of requests, and return jittered values larger than 1 second to more powerfully stop the storm, or to allow prioritizing some requests (like activity responses) over others simply by returning a lower delay.

internal/common/backoff/retry.go

CLAassistant · 2022-07-11T18:28:17Z

All committers have signed the CLA.

…lay so it cannot be bypassed

Groxx · 2022-09-30T21:56:40Z

internal/common/backoff/retry.go

@@ -103,16 +106,40 @@ Retry_Loop:
 			return err
 		}

-		// Check if the error is retryable
-		if isRetryable != nil && !isRetryable(err) {


isRetryable == nil was only true in tests, so what was changed and now it's just assumed to exist in all cases.

Groxx · 2022-09-30T21:59:07Z

internal/common/backoff/retry.go

+//
+// note that this is only a minimum, however.  longer delays are assumed to
+// be equally valid.
+func ErrRetryableAfter(err error) (retryAfter time.Duration) {


decided to move it here because it's tightly related to retry logic, and one thing needs it externally, so it's exposed. and I kinda like the backoff.ErrRetryableAfter package/name, keeps it clear that it's retry-backoff-related.

internal/internal_task_pollers.go

davidporter-id-au

nice!

Groxx · 2022-11-07T20:26:12Z

Merging, will try to follow up this week with a cleanup (if feasible, given the custom behavior I remember... I suspect it won't be, but worth checking on anyway).

Groxx requested review from mantas-sidlauskas and a team June 22, 2022 02:11

Groxx mentioned this pull request Jun 22, 2022

Moving retryable-err checks to errors.As, moving some to not-retryable #1167

Merged

Groxx force-pushed the retries_with_backoff branch from c068f8d to 647f54f Compare June 22, 2022 19:45

davidporter-id-au approved these changes Jul 11, 2022

View reviewed changes

internal/common/backoff/retry.go Outdated Show resolved Hide resolved

Merge branch 'master' into retries_with_backoff

b1ab4e0

Groxx and others added 5 commits September 30, 2022 00:51

Merge branch 'master' into retries_with_backoff

d29f97e

rethinking things after review and time, internalizing retry-after de…

4c42853

…lay so it cannot be bypassed

oops. return the error when it's not transient.

1c2401b

nil retryable is only used for tests, drop that ability

af80c57

cleanup

eeaf231

Groxx commented Sep 30, 2022

View reviewed changes

Groxx requested a review from davidporter-id-au October 4, 2022 17:17

Merge branch 'master' into retries_with_backoff

7bb3e72

shijiesheng reviewed Nov 4, 2022

View reviewed changes

internal/internal_task_pollers.go Show resolved Hide resolved

davidporter-id-au approved these changes Nov 4, 2022

View reviewed changes

Merge branch 'master' into retries_with_backoff

0a554bf

shijiesheng approved these changes Nov 4, 2022

View reviewed changes

Groxx merged commit 2618d0c into cadence-workflow:master Nov 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry service-busy errors after a delay #1174

Retry service-busy errors after a delay #1174

Groxx commented Jun 22, 2022 •

edited

Loading

coveralls commented Jun 22, 2022 •

edited

Loading

CLAassistant commented Jul 11, 2022 •

edited

Loading

Groxx Sep 30, 2022

Groxx Sep 30, 2022 •

edited

Loading

davidporter-id-au left a comment

Groxx commented Nov 7, 2022

Retry service-busy errors after a delay #1174

Retry service-busy errors after a delay #1174

Conversation

Groxx commented Jun 22, 2022 • edited Loading

coveralls commented Jun 22, 2022 • edited Loading

Pull Request Test Coverage Report for Build 0183905f-6dc6-40b7-861c-6866b8ff66be

💛 - Coveralls

CLAassistant commented Jul 11, 2022 • edited Loading

Groxx Sep 30, 2022

Choose a reason for hiding this comment

Groxx Sep 30, 2022 • edited Loading

Choose a reason for hiding this comment

davidporter-id-au left a comment

Choose a reason for hiding this comment

Groxx commented Nov 7, 2022

Groxx commented Jun 22, 2022 •

edited

Loading

coveralls commented Jun 22, 2022 •

edited

Loading

CLAassistant commented Jul 11, 2022 •

edited

Loading

Groxx Sep 30, 2022 •

edited

Loading