Call renew slots before final retry #1443

ljrmorgan · 2016-12-12T17:07:09Z

We lost one of our cluster's master nodes and started seeing a lot of JedisConnectionExceptions being thrown. The cluster itself was working, but Jedis seemed to be hanging on to the old state. Searching in the issues we found #1439 which seems to point out the issue: the call to cluster slots to refresh the state of the cluster is called after the final retry, so has no effect.

With this change we call renewSlotCache() before the final retry. As before renewSlotCache() will never be called if maxAttempts is two, since that would mean calling cluster slots after a single failure.

I've added a failing (well, now passing!) unit test to validate the fix. I'm using Mockito in that test which isn't being used elsewhere in the project, so I added it as a separate commit to make it easier to remove if you didn't want to add a dependency just for this bug fix.

Fixes #1439

marcosnils · 2016-12-12T17:10:58Z

@ljrmorgan thx for the change. It might take some time until I can review @HeartSaVioR @Spikhalskiy can you help with this please?

HeartSaVioR

@ljrmorgan
Thanks for the patch. The code looks great except style issue. Please remove an author tag.

@Spikhalskiy Could you review this? If you're not available I'll just merge this in.

HeartSaVioR · 2016-12-18T14:41:02Z

src/test/java/redis/clients/jedis/JedisClusterCommandTest.java

+import static org.mockito.Mockito.when;
+
+/**
+ * @author Louis Morgan


Please remove author tag.

Spikhalskiy · 2016-12-18T23:26:34Z

the call to cluster slots to refresh the state of the cluster is called after the final retry, so has no effect.

I'm not sure it's the reason. Ok, we initiate cluster renew after the first bunch of requests and this first bunch would fail, but we made renew and at some point, all new requests should get a fresh state. So, original code shouldn't stick in old state infinitely. Or I don't understand the semantic of this issue.

if maxAttempts is two, since that would mean calling cluster slots after a single failure.

I personally don't like this change. This would mean that we will actually never schedule slots renew with maxAttempts=2. And only with this value. We continue to schedule it with other values. So, as a result, we get pretty different Jedis behavior on maxAttempts=2 vs maxAttempts>2. Could be misleading and unexpected.

ljrmorgan · 2017-01-06T13:56:33Z

Thanks for looking at this @HeartSaVioR and @Spikhalskiy, and apologies for the late response.

I've been looking through our splunk logs from when we hit this, and I think my understanding of the issue was a bit off before.

We saw several hundred JedisConnectionExceptions each time a master node was lost, during a fairly quiet time for us traffic wise. We would see these exceptions over a period of a minute or two after each master loss, so it does seem like Jedis recovers. I say seem there, because we're using Jedis via Spring Data Redis, and I'm not sure if the exceptions stop because Jedis recovers itself, or if Spring Data Redis is doing something to recover. I'm not particularly familiar with the internals of Spring Data Redis unfortunately.

I think there might still be some value in this change. In particular, calls that would otherwise fail with a JedisConnectionException could instead recover, which would have avoided the few hundred errors that we saw on each master loss. That feels more robust to me, and we'd call cluster slots anyway, so we aren't making additional Redis calls. Perhaps just letting the call fail is a less surprising thing to do though?

I agree that the way I'm handling the maxAttempts=2 case is incorrect, I hadn't appreciated that Jedis was recovering by calling refresh after the final attempt, and I think I misunderstood the comment about calling cluster slots too aggressively. I'd propose:

Calling cluster slots before the final retry if maxAttempts > 2
Calling cluster slots after the final retry otherwise to avoid calling it after one connection error if maxAttempts == 2.

Does that seem like a reasonable approach? Please let me know and I can make the corresponding change to my PR.

Thanks!

ljrmorgan · 2017-01-06T14:02:55Z

Incidentally, immediately after this we also saw a lot of calls to Spring Data Redis' RedisCache::get() which never returned. We had to kill the machines before we could investigate further unfortunately, and I haven't been able to reproduce it since. I'm not sure if that was caused/aggravated by these exceptions. Looking through your issues it sounds like it could be #1158 that caused this, I'll try setting setMaxWaitMillis on our JedisPoolConfig, which we're currently just default constructing. This subsequent issue was the real show stopper for us, rather than the JedisConnectionExceptions themselves.

marcosnils · 2017-01-06T14:04:22Z

notifying @mp911de as he has experience in spring-data-redis

ljrmorgan · 2017-01-06T14:53:29Z

@marcosnils thanks, I'll try to reproduce on my end and isolate the hang, that should help determine if it's the Jedis issue I linked to or a Spring Data Redis issue (or an issue in our code)

mp911de · 2017-01-06T16:18:09Z

Spring Data Redis isn't doing much in terms of recovery. Certain commands perform a topology lookup but the majority of commands uses simply JedisCluster.

ljrmorgan · 2017-01-06T16:23:38Z

@marcosnils @mp911de thanks guys, the hang seems to be #1158, so not a Spring Data Redis issue. I'll add a comment and stack trace to that issue, I don't want to derail this issue too much.

I still think calling cluster slots before the final retry rather than after is worth doing to avoid some JedisConnectionExceptions, what do you think?

sazzad16 · 2021-04-07T07:26:30Z

Since #2358, there will be more renewSlotCache() calls internally. IMO, we can close this in favor of that.

ljrmorgan added 2 commits December 12, 2016 16:46

Call renewSlots() before final retry on JedisConnectionException

968dab6

Add previously failing unit test

e2e7620

HeartSaVioR reviewed Dec 18, 2016

View reviewed changes

HeartSaVioR assigned HeartSaVioR and unassigned HeartSaVioR Dec 18, 2016

HeartSaVioR added the wait for more reviews label Dec 18, 2016

sazzad16 closed this Apr 7, 2021

sazzad16 added this to the 3.6.0 milestone Apr 7, 2021

sazzad16 removed the wait for more reviews label Apr 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Call renew slots before final retry #1443

Call renew slots before final retry #1443

Uh oh!

ljrmorgan commented Dec 12, 2016

Uh oh!

marcosnils commented Dec 12, 2016

Uh oh!

HeartSaVioR left a comment

Uh oh!

HeartSaVioR Dec 18, 2016

Uh oh!

Spikhalskiy commented Dec 18, 2016 •

edited

Loading

Uh oh!

ljrmorgan commented Jan 6, 2017

Uh oh!

ljrmorgan commented Jan 6, 2017

Uh oh!

marcosnils commented Jan 6, 2017

Uh oh!

ljrmorgan commented Jan 6, 2017

Uh oh!

mp911de commented Jan 6, 2017

Uh oh!

ljrmorgan commented Jan 6, 2017

Uh oh!

sazzad16 commented Apr 7, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Call renew slots before final retry #1443

Call renew slots before final retry #1443

Uh oh!

Conversation

ljrmorgan commented Dec 12, 2016

Uh oh!

marcosnils commented Dec 12, 2016

Uh oh!

HeartSaVioR left a comment

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Dec 18, 2016

Choose a reason for hiding this comment

Uh oh!

Spikhalskiy commented Dec 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ljrmorgan commented Jan 6, 2017

Uh oh!

ljrmorgan commented Jan 6, 2017

Uh oh!

marcosnils commented Jan 6, 2017

Uh oh!

ljrmorgan commented Jan 6, 2017

Uh oh!

mp911de commented Jan 6, 2017

Uh oh!

ljrmorgan commented Jan 6, 2017

Uh oh!

sazzad16 commented Apr 7, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Spikhalskiy commented Dec 18, 2016 •

edited

Loading