-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Call renew slots before final retry #1443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Call renew slots before final retry #1443
Conversation
|
@ljrmorgan thx for the change. It might take some time until I can review @HeartSaVioR @Spikhalskiy can you help with this please? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ljrmorgan
Thanks for the patch. The code looks great except style issue. Please remove an author tag.
@Spikhalskiy Could you review this? If you're not available I'll just merge this in.
| import static org.mockito.Mockito.when; | ||
|
|
||
| /** | ||
| * @author Louis Morgan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove author tag.
I'm not sure it's the reason. Ok, we initiate cluster renew after the first bunch of requests and this first bunch would fail, but we made renew and at some point, all new requests should get a fresh state. So, original code shouldn't stick in old state infinitely. Or I don't understand the semantic of this issue.
I personally don't like this change. This would mean that we will actually never schedule slots renew with maxAttempts=2. And only with this value. We continue to schedule it with other values. So, as a result, we get pretty different Jedis behavior on maxAttempts=2 vs maxAttempts>2. Could be misleading and unexpected. |
|
Thanks for looking at this @HeartSaVioR and @Spikhalskiy, and apologies for the late response. I've been looking through our splunk logs from when we hit this, and I think my understanding of the issue was a bit off before. We saw several hundred JedisConnectionExceptions each time a master node was lost, during a fairly quiet time for us traffic wise. We would see these exceptions over a period of a minute or two after each master loss, so it does seem like Jedis recovers. I say seem there, because we're using Jedis via Spring Data Redis, and I'm not sure if the exceptions stop because Jedis recovers itself, or if Spring Data Redis is doing something to recover. I'm not particularly familiar with the internals of Spring Data Redis unfortunately. I think there might still be some value in this change. In particular, calls that would otherwise fail with a JedisConnectionException could instead recover, which would have avoided the few hundred errors that we saw on each master loss. That feels more robust to me, and we'd call I agree that the way I'm handling the
Does that seem like a reasonable approach? Please let me know and I can make the corresponding change to my PR. Thanks! |
|
Incidentally, immediately after this we also saw a lot of calls to Spring Data Redis' |
|
notifying @mp911de as he has experience in spring-data-redis |
|
@marcosnils thanks, I'll try to reproduce on my end and isolate the hang, that should help determine if it's the Jedis issue I linked to or a Spring Data Redis issue (or an issue in our code) |
|
Spring Data Redis isn't doing much in terms of recovery. Certain commands perform a topology lookup but the majority of commands uses simply |
|
@marcosnils @mp911de thanks guys, the hang seems to be #1158, so not a Spring Data Redis issue. I'll add a comment and stack trace to that issue, I don't want to derail this issue too much. I still think calling |
|
Since #2358, there will be more |
We lost one of our cluster's master nodes and started seeing a lot of
JedisConnectionExceptions being thrown. The cluster itself was working, but Jedis seemed to be hanging on to the old state. Searching in the issues we found #1439 which seems to point out the issue: the call tocluster slotsto refresh the state of the cluster is called after the final retry, so has no effect.With this change we call
renewSlotCache()before the final retry. As beforerenewSlotCache()will never be called ifmaxAttemptsis two, since that would mean callingcluster slotsafter a single failure.I've added a failing (well, now passing!) unit test to validate the fix. I'm using Mockito in that test which isn't being used elsewhere in the project, so I added it as a separate commit to make it easier to remove if you didn't want to add a dependency just for this bug fix.
Fixes #1439