-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry service-busy errors after a delay #1174
Conversation
Pull Request Test Coverage Report for Build 0183905f-6dc6-40b7-861c-6866b8ff66be
💛 - Coveralls |
#1167) Part 1 of 2 for solving retry storms, particularly around incorrectly-categorized errors (e.g. limit exceeded) and service-busy. This PR moves us to `errors.As` to support wrapped errors in the future, and re-categorizes some incorrectly-retried errors. This is both useful on its own, and helps make #1174 a smaller and clearer change. Service-busy behavior is actually changed in #1174, this commit intentionally maintains its current (flawed) behavior.
Builds on cadence-workflow#1167, but adds delay before retrying service-busy errors. For now, since our server-side RPS quotas are calculated per second, this delays at least 1 second per service busy error. This is in contrast to the previous behavior, which would have retried up to about a dozen times in the same period, which is the cause of service-busy-based retry storms that cause lots more service-busy errors. --- This also gives us an easy way to make use of "retry after" information in errors we return to the caller, though currently our errors do not contain that. Eventually this should probably come from the server, which has a global view of how many requests this service has sent, and can provide a more precise delay to individual callers. E.g. currently our server-side ratelimiter works in 1-second slices... but that isn't something that's guaranteed to stay true. The server could also detect truly large floods of requests, and return jittered values larger than 1 second to more powerfully stop the storm, or to allow prioritizing some requests (like activity responses) over others simply by returning a lower delay.
c068f8d
to
647f54f
Compare
…lay so it cannot be bypassed
@@ -103,16 +106,40 @@ Retry_Loop: | |||
return err | |||
} | |||
|
|||
// Check if the error is retryable | |||
if isRetryable != nil && !isRetryable(err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isRetryable == nil
was only true in tests, so what was changed and now it's just assumed to exist in all cases.
// | ||
// note that this is only a minimum, however. longer delays are assumed to | ||
// be equally valid. | ||
func ErrRetryableAfter(err error) (retryAfter time.Duration) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
decided to move it here because it's tightly related to retry logic, and one thing needs it externally, so it's exposed. and I kinda like the backoff.ErrRetryableAfter
package/name, keeps it clear that it's retry-backoff-related.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
Merging, will try to follow up this week with a cleanup (if feasible, given the custom behavior I remember... I suspect it won't be, but worth checking on anyway). |
Builds on #1167, but adds delay before retrying service-busy errors.
For now, since our server-side RPS quotas are calculated per second, this delays
at least 1 second per service busy error.
This is in contrast to the previous behavior, which would have retried up to about
a dozen times in the same period, which is the cause of service-busy-based retry
storms that cause lots more service-busy errors.
This also gives us an easy way to make use of "retry after" information in errors
we return to the caller, though currently our errors do not contain that.
Eventually this should probably come from the server, which has a global view of
how many requests this service has sent, and can provide a more precise delay to
individual callers.
E.g. currently our server-side ratelimiter works in 1-second slices... but that
isn't something that's guaranteed to stay true. The server could also detect truly
large floods of requests, and return jittered values larger than 1 second to more
powerfully stop the storm, or to allow prioritizing some requests (like activity
responses) over others simply by returning a lower delay.