Skip to content

Conversation

@ljrmorgan
Copy link

We lost one of our cluster's master nodes and started seeing a lot of JedisConnectionExceptions being thrown. The cluster itself was working, but Jedis seemed to be hanging on to the old state. Searching in the issues we found #1439 which seems to point out the issue: the call to cluster slots to refresh the state of the cluster is called after the final retry, so has no effect.

With this change we call renewSlotCache() before the final retry. As before renewSlotCache() will never be called if maxAttempts is two, since that would mean calling cluster slots after a single failure.

I've added a failing (well, now passing!) unit test to validate the fix. I'm using Mockito in that test which isn't being used elsewhere in the project, so I added it as a separate commit to make it easier to remove if you didn't want to add a dependency just for this bug fix.

Fixes #1439

@marcosnils
Copy link
Contributor

@ljrmorgan thx for the change. It might take some time until I can review @HeartSaVioR @Spikhalskiy can you help with this please?

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ljrmorgan
Thanks for the patch. The code looks great except style issue. Please remove an author tag.

@Spikhalskiy Could you review this? If you're not available I'll just merge this in.

import static org.mockito.Mockito.when;

/**
* @author Louis Morgan
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove author tag.

@Spikhalskiy
Copy link
Contributor

Spikhalskiy commented Dec 18, 2016

the call to cluster slots to refresh the state of the cluster is called after the final retry, so has no effect.

I'm not sure it's the reason. Ok, we initiate cluster renew after the first bunch of requests and this first bunch would fail, but we made renew and at some point, all new requests should get a fresh state. So, original code shouldn't stick in old state infinitely. Or I don't understand the semantic of this issue.

if maxAttempts is two, since that would mean calling cluster slots after a single failure.

I personally don't like this change. This would mean that we will actually never schedule slots renew with maxAttempts=2. And only with this value. We continue to schedule it with other values. So, as a result, we get pretty different Jedis behavior on maxAttempts=2 vs maxAttempts>2. Could be misleading and unexpected.

@ljrmorgan
Copy link
Author

Thanks for looking at this @HeartSaVioR and @Spikhalskiy, and apologies for the late response.

I've been looking through our splunk logs from when we hit this, and I think my understanding of the issue was a bit off before.

We saw several hundred JedisConnectionExceptions each time a master node was lost, during a fairly quiet time for us traffic wise. We would see these exceptions over a period of a minute or two after each master loss, so it does seem like Jedis recovers. I say seem there, because we're using Jedis via Spring Data Redis, and I'm not sure if the exceptions stop because Jedis recovers itself, or if Spring Data Redis is doing something to recover. I'm not particularly familiar with the internals of Spring Data Redis unfortunately.

I think there might still be some value in this change. In particular, calls that would otherwise fail with a JedisConnectionException could instead recover, which would have avoided the few hundred errors that we saw on each master loss. That feels more robust to me, and we'd call cluster slots anyway, so we aren't making additional Redis calls. Perhaps just letting the call fail is a less surprising thing to do though?

I agree that the way I'm handling the maxAttempts=2 case is incorrect, I hadn't appreciated that Jedis was recovering by calling refresh after the final attempt, and I think I misunderstood the comment about calling cluster slots too aggressively. I'd propose:

  • Calling cluster slots before the final retry if maxAttempts > 2
  • Calling cluster slots after the final retry otherwise to avoid calling it after one connection error if maxAttempts == 2.

Does that seem like a reasonable approach? Please let me know and I can make the corresponding change to my PR.

Thanks!

@ljrmorgan
Copy link
Author

Incidentally, immediately after this we also saw a lot of calls to Spring Data Redis' RedisCache::get() which never returned. We had to kill the machines before we could investigate further unfortunately, and I haven't been able to reproduce it since. I'm not sure if that was caused/aggravated by these exceptions. Looking through your issues it sounds like it could be #1158 that caused this, I'll try setting setMaxWaitMillis on our JedisPoolConfig, which we're currently just default constructing. This subsequent issue was the real show stopper for us, rather than the JedisConnectionExceptions themselves.

@marcosnils
Copy link
Contributor

notifying @mp911de as he has experience in spring-data-redis

@ljrmorgan
Copy link
Author

@marcosnils thanks, I'll try to reproduce on my end and isolate the hang, that should help determine if it's the Jedis issue I linked to or a Spring Data Redis issue (or an issue in our code)

@mp911de
Copy link
Contributor

mp911de commented Jan 6, 2017

Spring Data Redis isn't doing much in terms of recovery. Certain commands perform a topology lookup but the majority of commands uses simply JedisCluster.

@ljrmorgan
Copy link
Author

@marcosnils @mp911de thanks guys, the hang seems to be #1158, so not a Spring Data Redis issue. I'll add a comment and stack trace to that issue, I don't want to derail this issue too much.

I still think calling cluster slots before the final retry rather than after is worth doing to avoid some JedisConnectionExceptions, what do you think?

@sazzad16
Copy link
Contributor

sazzad16 commented Apr 7, 2021

Since #2358, there will be more renewSlotCache() calls internally. IMO, we can close this in favor of that.

@sazzad16 sazzad16 closed this Apr 7, 2021
@sazzad16 sazzad16 added this to the 3.6.0 milestone Apr 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

JedisCluster throws JedisConnectionException when a cluster master killed

6 participants