Skip to content

Conversation

@liujg
Copy link

@liujg liujg commented Nov 22, 2015

There is a bug on the condition that:
redis cluster contains three masters and three slaves,if jedis client invoke JedisCluster with the default GenericObjectPoolConfig,this means the jedispool would block to wait idleObjects when exceed the default eight concurrences.At that time one master down and its slave replace it to master,but the blocking threads are waiting for the down master notify,but cannot wait for notify.so those threads will block util the down master live out the new master.
what i do is when discoverClusterSlots,please close the jedis pool of the fail node,so those blocking threads can release.

@liujg
Copy link
Author

liujg commented Nov 22, 2015

@marcosnils why All checks have failed?oracle jdk7 has error?
I try many times

@marcosnils
Copy link
Contributor

@liujg PR seems better now. Tests sometimes fail because of timeout issues. I've re run them so everything looks green now. I'll look it up tomorrow and hopefully merge

@liujg
Copy link
Author

liujg commented Nov 23, 2015

@marcosnils great,please check this block bug i found,think you.

@liujg
Copy link
Author

liujg commented Nov 26, 2015

@marcosnils ,are u check this blocking problem?

@marcosnils
Copy link
Contributor

@liujg replicating this scenario to test your patch takes some time. I'll try to save some bandwidth during the weekend for this.

@marcosnils marcosnils added this to the 2.8.1 milestone Dec 4, 2015
@liujg
Copy link
Author

liujg commented Dec 24, 2015

@marcosnils 2.8.1 milestone test pass? it takes a month....

@marcosnils
Copy link
Contributor

@liujg sorry I couldn't merge this yet. I haven't had the time to setup the scenario to reproduce the bug and check your fix.

Can you please enumerate the necessary steps to reproduce the bug? That'd allow me to move forward a lot faster to validate this issue.

@liujg
Copy link
Author

liujg commented Dec 31, 2015

@marcosnils
1.create a cluster:N masters and N slaves,such as N=3;
2.create client :intital jedis with new JedisCluster(Set nodes),using the default poolConfig.
3.create mult-threads,more than 8(the defautl poolConfig num),such as 50 threads,every thread just do the operation such get(key)
4.this time,if u shutdown a master,u could found some threads block forever even the redis slave replace the master.

what i do is when discoverClusterSlots,please close the jedis pool of the fail node,so those blocking threads can release.

@HeartSaVioR
Copy link
Contributor

@liujg
I believe it should be related to Jedis instance leak, which may be fixed to #1178.
I recommend you to apply the patch and try reproducing, and let us know about the result.

Unfortunately, @antirez said output of 'cluster nodes' will be changed frequently.
We already moved to 'cluster slots' via #1204 and don't use 'cluster nodes' for now.

Since we can't rely on 'cluster nodes', unfortunately this approach is invalid.

If issue still persists after #1178, we should try to find alternative way.

@HeartSaVioR
Copy link
Contributor

@liujg @marcosnils
It's always reproducible, and not related to #1178.

Skimming the source code of GenericObjectPool, it seems that GenericObjectPool doesn't work properly when GenericObjectPool.create() - which calls PooledObjectFactory.makeObject() - throws Exception.

Please refer GenericObjectPool from here.
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.commons/commons-pool2/2.4.2/org/apache/commons/pool2/impl/GenericObjectPool.java

If we set blockWhenExhausted to true and borrowMaxWaitMillis to lower than 0, borrowing thread calls LinkedBlockingDeque.takeFirst() which awaits for another threads to notify availability of new idle object.
(thread will be blocked unless being notified or InterruptedException)

However, when we return object to GenericObjectPool which object can't go in idle object list, GenericObjectPool destroys object and calls ensureIdle(1, false) to ensure new object is created and added to idle objects when there's at least one of awaiting thread.

The problem is, when GenericObjectPool.create() throws Exception, ensureIdle() cannot add new idle object to idle object list, which means blocked thread doesn't wake up.

In other words, PooledObjectFactory.makeObject() should ensure new object is created without exception at any chance. (But JedisFactory.makeObject() does not.)

@HeartSaVioR
Copy link
Contributor

@marcosnils
How about this approach?

  • JedisFactory.makeObject() just creates Jedis instance which is not throwing Exception in normal.
  • when JedisFactory.activateObject() is called, let pooled Jedis instance connects to Redis, checking auth, selecting db, blabla...
    • note that activateObject() will be called multiple time, so it should be idempotent.

In result creating idle object never fails, but it may fail to activate from borrowing thread, which is what we want.

@HeartSaVioR
Copy link
Contributor

Related issue is found : https://issues.apache.org/jira/browse/POOL-303

@HeartSaVioR
Copy link
Contributor

@marcosnils
Seems like new approach cannot solve blocked threads. Need to dig more.

@marcosnils
Copy link
Contributor

@HeartSaVioR seems like this is an non-blocking thing that doesn't have impact on the release. Can we move this to 2.8.2 so we can release 2.8.1?

@HeartSaVioR
Copy link
Contributor

@marcosnils This issue seems major for me, so it would be better to resolve this ASAP, but I also agree that it is not blocking issue. We can move this to 2.8.2.

@marcosnils
Copy link
Contributor

@HeartSaVioR I agree that the issue is major, but non blocking. I'll move it to 2.8.2 so we can release 2.8.1 and 2.9.0

@liujg
Copy link
Author

liujg commented Mar 13, 2016

@HeartSaVioR
I agree with your explain of GenericObjectPool,before the bug of GenericObjectPool fixed,jedis can do something to avoid this? the code I submit is to avoid this.....
@marcosnils u means 2.8.2 could fix this problem?

@marcosnils
Copy link
Contributor

@liujg @HeartSaVioR seems like https://issues.apache.org/jira/browse/POOL-303 is already fixed which should fix this problem. Is there a chance you can try with master version of commons-pool to validate it?

@Anisotrop
Copy link
Contributor

@marcosnils I have tried to look at this problem and created the following unit test:

package redis.clients.jedis.tests.cluster;

import org.junit.Test;
import redis.clients.jedis.HostAndPort;
import redis.clients.jedis.JedisCluster;

import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import java.util.concurrent.Callable;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

public class ClusterThreadsTest {
    @Test
    public void testMultiThreaded() throws InterruptedException {
        Set nodes = new HashSet();
        nodes.add(new HostAndPort("localhost", 30001));
        nodes.add(new HostAndPort("localhost", 30002));
        nodes.add(new HostAndPort("localhost", 30003));
        nodes.add(new HostAndPort("localhost", 30004));
        nodes.add(new HostAndPort("localhost", 30005));
        nodes.add(new HostAndPort("localhost", 30006));
        final JedisCluster cluster = new JedisCluster(nodes);
        cluster.set("a", "b");

        ExecutorService executorService = Executors.newFixedThreadPool(100);
        List<Callable<Object>> tasks = new ArrayList<Callable<Object>>(100);

        for (int i = 0; i< 200; i++) {
            Callable<Object> readTask = new Callable<Object>() {
                @Override
                public Object call() throws Exception {
                    System.out.println(cluster.get("a-" + Thread.currentThread().getName()) + "-" + Thread.currentThread().getName());
                    Thread.sleep(100);
                    cluster.set("a-" + Thread.currentThread().getName(), "b-" + Thread.currentThread().getName());
                    return null;
                }
            };
            tasks.add(readTask);
        }

        CountDownLatch stop = new CountDownLatch(1000);
        while (stop.getCount() > 0L) {
            executorService.invokeAll(tasks);
            stop.countDown();
        }
        cluster.close();
    }
}

as a prerequisite a cluster should be started. The method I used was to execute:

./create-cluster create
./create-cluster start

from redis-3.2.0/utils/create-cluster with default values.

Than in order to reproduce the problem, kill processes (with kill -9) that are part of the cluster.

21321  0.2  0.1 141028  7764 ?        Ssl  13:35   0:00 ../../src/redis-server *:30001 [cluster]
21325  0.2  0.1 141028  7776 ?        Ssl  13:35   0:00 ../../src/redis-server *:30002 [cluster]
21327  0.2  0.1 141028  7780 ?        Ssl  13:35   0:00 ../../src/redis-server *:30003 [cluster]
21331  0.5  0.2 141028  9804 ?        Ssl  13:35   0:00 ../../src/redis-server *:30004 [cluster]
21337  0.7  0.2 141028  9812 ?        Ssl  13:35   0:00 ../../src/redis-server *:30005 [cluster]
21341  0.8  0.2 141028  9812 ?        Ssl  13:35   0:00 ../../src/redis-server *:30006 [cluster]

Generally the first and the fourth process should allow the threads to block:

Name: pool-1-thread-58
State: WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@288ff99f
Total blocked: 5  Total waited: 1,394

Stack trace: 
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)

I have also checked out the latest commons-pool2 and built it (mvn clean install -DskipTests=true). When referencing it from the Jedis project the issue still seems to reproduce.

@marcosnils
Copy link
Contributor

marcosnils commented Jun 1, 2016

@Anisotrop you're right, just tested with latest commons-pool version and seems like problem persists even though 303 has been included and threads are still parked at borrowObject from commons pool.

In the mean time, a solution is to set setMaxWaitMillis in JedisPoolConfig in order to make commos-pool to time out and retrieve a new instance. This will allow Jedis to continue working while cluster is recovering.

Here's a sample snippet to configure JedisCluster:

        Set nodes = new HashSet();
        nodes.add(new HostAndPort("localhost", 30001));
        nodes.add(new HostAndPort("localhost", 30002));
        nodes.add(new HostAndPort("localhost", 30003));
        nodes.add(new HostAndPort("localhost", 30004));
        nodes.add(new HostAndPort("localhost", 30005));
        nodes.add(new HostAndPort("localhost", 30006));
        JedisPoolConfig config = new JedisPoolConfig();
        config.setMaxWaitMillis(1000); // You can tune this to your preference.
        final JedisCluster cluster = new JedisCluster(nodes, config);

I'll keep investigating about the commons pool problem and let the guys know about the issue. It might be the same thing @HeartSaVioR mentioned before.

@Anisotrop
Copy link
Contributor

@marcosnils In this case maybe a release with some of the fixes could be foreseen in the near future ?

@marcosnils
Copy link
Contributor

@Anisotrop I don't think this PR is actually necessary. Commons-pool should handle this situations natively. I'd like to investigate a bit further to see if there's something we can do at the commons-pool level to avoid Jedis unnecessary changes.

@Anisotrop
Copy link
Contributor

@marcosnils Did you get the chance to also test with the following settings ?

        JedisPoolConfig config = new JedisPoolConfig();
        config.setBlockWhenExhausted(false);

When the thread will not be able to execute, an exception with: "Could not get a resource from the pool" will be thrown. When the cluster is rebalancing, an exception with message "Too many Cluster redirections?" seems to be thrown, but when the cluster comes back it seems the connections will work ok again.

The list of nodes is not updated however (or so it seems) as the killed node still does not appear as closed.


Note, please, that in the code sample I provided, the line:

 for (int i = 0; i< 200; i++) {

should be

 for (int i = 0; i< 100; i++) {

@marcosnils marcosnils modified the milestones: 2.8.3, 2.8.2 Jul 22, 2016
@ljrmorgan
Copy link

@marcosnils This issue bit us pretty badly: We're using Redis 3.2.5 in cluster mode, with Jedis 2.9.0 and Spring Data Redis 1.7.5. We killed one of our master nodes, and calls to Spring Data Redis' RedisCache::get() started blocking indefinitely.

I've been able to reproduce this and capture a stack:

java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00000006879ddf18> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
	at org.apache.commons.pool2.impl.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:583)
	at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:442)
	at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:363)
	at redis.clients.util.Pool.getResource(Pool.java:49)
	at redis.clients.jedis.JedisPool.getResource(JedisPool.java:226)
	at redis.clients.jedis.JedisSlotBasedConnectionHandler.getConnectionFromSlot(JedisSlotBasedConnectionHandler.java:66)
	at redis.clients.jedis.JedisClusterCommand.runWithRetries(JedisClusterCommand.java:116)
	at redis.clients.jedis.JedisClusterCommand.runBinary(JedisClusterCommand.java:60)
	at redis.clients.jedis.BinaryJedisCluster.exists(BinaryJedisCluster.java:108)
	at org.springframework.data.redis.connection.jedis.JedisClusterConnection.exists(JedisClusterConnection.java:3339)
	at sun.reflect.GeneratedMethodAccessor146.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at org.springframework.data.redis.core.CloseSuppressingInvocationHandler.invoke(CloseSuppressingInvocationHandler.java:57)
	at com.sun.proxy.$Proxy132.exists(Unknown Source)
	at org.springframework.data.redis.cache.RedisCache$AbstractRedisCacheCallback.waitForLock(RedisCache.java:505)
	at org.springframework.data.redis.cache.RedisCache$AbstractRedisCacheCallback.doInRedis(RedisCache.java:468)
	at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:204)
	at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:166)
	at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:154)
	at org.springframework.data.redis.cache.RedisCache.get(RedisCache.java:138)
	at org.springframework.data.redis.cache.RedisCache.get(RedisCache.java:97)
	<snip>

We seem to have 20 threads stuck in that state.

I'll try setting setMaxWaitMillis and see if I can still reproduce this. Is there anything else you'd like me to try since I can reproduce this fairly easily now?

This feels like a pretty major issue to me: as far as I can tell the default behaviour is to hang forever if a master goes down.

@marcosnils
Copy link
Contributor

@ljrmorgan this is not a Jedis related issue but commons-pool stuff. I'll try to open them an issue to see if they can assist with this scenario

sazzad16 added a commit to sazzad16/jedis that referenced this pull request Nov 29, 2017
sazzad16 added a commit that referenced this pull request Dec 12, 2017
sazzad16 added a commit that referenced this pull request Dec 12, 2017
sazzad16 added a commit that referenced this pull request Dec 12, 2017
joyang1 pushed a commit to joyang1/jedis that referenced this pull request Dec 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants