JedisCluster behavior during cluster failure

**This is more a question and discussion than an issue!**
### Observed behavior

I am running my JedisCluster test client against my Jedis Cluster. The connection uses default settings (5 maxAttempts, 2000ms timeouts).
JedisCluster is passed the three known masters for initial connection. JedisCluster is able to correctly determine the complete cluster topology (six nodes with 3 master and 3 slaves).

For my test, I am generating 1,000,000 random KV-pairs. The keys are numbers between 1 and 1,000,000 converted as Strings and the values are random strings with a length of 100 characters.

I then write those 1,000,000 KV-pairs to my Redis cluster.
In the second round, I read back those 1,000,000 KV-pairs and repeat the read cycle a second time.

When I fail my cluster during the write or read operation by "killing" one of the Docker container with a master node, I see that Redis cluster goes into failed state. After bout 700ms the corresponding slave will get elected new master and the cluster state gets changed to "OK" after about 900ms.

During the failed state, my jedisCluster.set() or jedisCluster.get() operation throws a JedisException (a JedisConnectionException to be exact) which I catch to not fail my application. I go then into a retry loop retrying the last set() or get() operation until it succeeds (or I reach the maximum retries which causes my test client to abort).
With the default settings of "maxAttempts" the set() or get() operation resume successfully after about 2,000 - 3,000 ms.

Here is my question: why does Jedis take at least twice as long to detect that the cluster state is OK again? Shouldn't this immediately happen after the approx. 900ms once the cluster state changes back to "OK"?

If I restart my failed node immediately after the failure (during the time my client reports failure), set() / get() operation resume immediately. Is this due to the fact that the Redis Cluster did not elect the slave from the failed node as new master and the re-started node will resume as master?
### What I would like to observe

Ideally, the fail-over and resumption of the operations should happen transparently to the client. In other words, my application using Jedis should never get a JedisException until a configured timeout. Ideally the JedisCluster should silently re-try the current operation (set(), get(), etc.) until either it was successful or it failed "permanently" (until a set timeout happens).

But, maybe that is what it can do already and I simply have not configured my JedisCluster correctly.
### Steps to reproduce:

Please create a reproducible case of your problem. Make sure
that case repeats consistently and it's not random
1. start your redis cluster
2. start the Jedis test client
3. kill one of the Docker container with a Redis master
### Redis / Jedis Configuration

Redis Cluster with 6 nodes: 3 master and 3 slaves
Redis Cluster is running inside Docker containers in Docker Swarm. The Jedis client program is also running in a Docker container inside the same Swarm cluster.
Jedis client has been developed using a simple Spring Boot application.
#### Jedis version:

2.8.2
#### Redis version:

3.2.3
#### Java version:

Java 1.8.0_101 (Oracle)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

JedisCluster behavior during cluster failure #1395

Observed behavior

What I would like to observe

Steps to reproduce:

Redis / Jedis Configuration

Jedis version:

Redis version:

Java version:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

JedisCluster behavior during cluster failure #1395

Description

Observed behavior

What I would like to observe

Steps to reproduce:

Redis / Jedis Configuration

Jedis version:

Redis version:

Java version:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions