close the jedis pool of the fail node #1158

liujg · 2015-11-22T03:42:14Z

There is a bug on the condition that:
redis cluster contains three masters and three slaves,if jedis client invoke JedisCluster with the default GenericObjectPoolConfig,this means the jedispool would block to wait idleObjects when exceed the default eight concurrences.At that time one master down and its slave replace it to master,but the blocking threads are waiting for the down master notify,but cannot wait for notify.so those threads will block util the down master live out the new master.
what i do is when discoverClusterSlots,please close the jedis pool of the fail node,so those blocking threads can release.

…be connected or null

liujg · 2015-11-22T03:49:11Z

@marcosnils why All checks have failed?oracle jdk7 has error?
I try many times

marcosnils · 2015-11-22T04:34:44Z

@liujg PR seems better now. Tests sometimes fail because of timeout issues. I've re run them so everything looks green now. I'll look it up tomorrow and hopefully merge

liujg · 2015-11-23T02:08:18Z

@marcosnils great,please check this block bug i found,think you.

liujg · 2015-11-26T07:29:15Z

@marcosnils ,are u check this blocking problem?

marcosnils · 2015-11-26T20:49:52Z

@liujg replicating this scenario to test your patch takes some time. I'll try to save some bandwidth during the weekend for this.

liujg · 2015-12-24T07:56:22Z

@marcosnils 2.8.1 milestone test pass? it takes a month....

marcosnils · 2015-12-28T18:50:35Z

@liujg sorry I couldn't merge this yet. I haven't had the time to setup the scenario to reproduce the bug and check your fix.

Can you please enumerate the necessary steps to reproduce the bug? That'd allow me to move forward a lot faster to validate this issue.

liujg · 2015-12-31T07:58:50Z

@marcosnils
1.create a cluster:N masters and N slaves,such as N=3;
2.create client :intital jedis with new JedisCluster(Set nodes),using the default poolConfig.
3.create mult-threads,more than 8(the defautl poolConfig num),such as 50 threads,every thread just do the operation such get(key)
4.this time,if u shutdown a master,u could found some threads block forever even the redis slave replace the master.

what i do is when discoverClusterSlots,please close the jedis pool of the fail node,so those blocking threads can release.

HeartSaVioR · 2016-02-13T14:13:48Z

@liujg
I believe it should be related to Jedis instance leak, which may be fixed to #1178.
I recommend you to apply the patch and try reproducing, and let us know about the result.

Unfortunately, @antirez said output of 'cluster nodes' will be changed frequently.
We already moved to 'cluster slots' via #1204 and don't use 'cluster nodes' for now.

Since we can't rely on 'cluster nodes', unfortunately this approach is invalid.

If issue still persists after #1178, we should try to find alternative way.

HeartSaVioR · 2016-02-14T06:40:58Z

@liujg @marcosnils
It's always reproducible, and not related to #1178.

Skimming the source code of GenericObjectPool, it seems that GenericObjectPool doesn't work properly when GenericObjectPool.create() - which calls PooledObjectFactory.makeObject() - throws Exception.

Please refer GenericObjectPool from here.
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.commons/commons-pool2/2.4.2/org/apache/commons/pool2/impl/GenericObjectPool.java

If we set blockWhenExhausted to true and borrowMaxWaitMillis to lower than 0, borrowing thread calls LinkedBlockingDeque.takeFirst() which awaits for another threads to notify availability of new idle object.
(thread will be blocked unless being notified or InterruptedException)

However, when we return object to GenericObjectPool which object can't go in idle object list, GenericObjectPool destroys object and calls ensureIdle(1, false) to ensure new object is created and added to idle objects when there's at least one of awaiting thread.

The problem is, when GenericObjectPool.create() throws Exception, ensureIdle() cannot add new idle object to idle object list, which means blocked thread doesn't wake up.

In other words, PooledObjectFactory.makeObject() should ensure new object is created without exception at any chance. (But JedisFactory.makeObject() does not.)

HeartSaVioR · 2016-02-14T06:59:16Z

@marcosnils
How about this approach?

JedisFactory.makeObject() just creates Jedis instance which is not throwing Exception in normal.
when JedisFactory.activateObject() is called, let pooled Jedis instance connects to Redis, checking auth, selecting db, blabla...
- note that activateObject() will be called multiple time, so it should be idempotent.

In result creating idle object never fails, but it may fail to activate from borrowing thread, which is what we want.

HeartSaVioR · 2016-02-14T07:00:22Z

Related issue is found : https://issues.apache.org/jira/browse/POOL-303

HeartSaVioR · 2016-02-14T12:58:53Z

@marcosnils
Seems like new approach cannot solve blocked threads. Need to dig more.

marcosnils · 2016-02-17T18:24:10Z

@HeartSaVioR seems like this is an non-blocking thing that doesn't have impact on the release. Can we move this to 2.8.2 so we can release 2.8.1?

HeartSaVioR · 2016-02-21T15:05:53Z

@marcosnils This issue seems major for me, so it would be better to resolve this ASAP, but I also agree that it is not blocking issue. We can move this to 2.8.2.

marcosnils · 2016-02-21T18:02:44Z

@HeartSaVioR I agree that the issue is major, but non blocking. I'll move it to 2.8.2 so we can release 2.8.1 and 2.9.0

liujg · 2016-03-13T09:25:31Z

@HeartSaVioR
I agree with your explain of GenericObjectPool,before the bug of GenericObjectPool fixed,jedis can do something to avoid this? the code I submit is to avoid this.....
@marcosnils u means 2.8.2 could fix this problem?

marcosnils · 2016-05-23T12:21:49Z

@liujg @HeartSaVioR seems like https://issues.apache.org/jira/browse/POOL-303 is already fixed which should fix this problem. Is there a chance you can try with master version of commons-pool to validate it?

Anisotrop · 2016-06-01T10:57:27Z

@marcosnils I have tried to look at this problem and created the following unit test:

package redis.clients.jedis.tests.cluster;

import org.junit.Test;
import redis.clients.jedis.HostAndPort;
import redis.clients.jedis.JedisCluster;

import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import java.util.concurrent.Callable;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

public class ClusterThreadsTest {
    @Test
    public void testMultiThreaded() throws InterruptedException {
        Set nodes = new HashSet();
        nodes.add(new HostAndPort("localhost", 30001));
        nodes.add(new HostAndPort("localhost", 30002));
        nodes.add(new HostAndPort("localhost", 30003));
        nodes.add(new HostAndPort("localhost", 30004));
        nodes.add(new HostAndPort("localhost", 30005));
        nodes.add(new HostAndPort("localhost", 30006));
        final JedisCluster cluster = new JedisCluster(nodes);
        cluster.set("a", "b");

        ExecutorService executorService = Executors.newFixedThreadPool(100);
        List<Callable<Object>> tasks = new ArrayList<Callable<Object>>(100);

        for (int i = 0; i< 200; i++) {
            Callable<Object> readTask = new Callable<Object>() {
                @Override
                public Object call() throws Exception {
                    System.out.println(cluster.get("a-" + Thread.currentThread().getName()) + "-" + Thread.currentThread().getName());
                    Thread.sleep(100);
                    cluster.set("a-" + Thread.currentThread().getName(), "b-" + Thread.currentThread().getName());
                    return null;
                }
            };
            tasks.add(readTask);
        }

        CountDownLatch stop = new CountDownLatch(1000);
        while (stop.getCount() > 0L) {
            executorService.invokeAll(tasks);
            stop.countDown();
        }
        cluster.close();
    }
}

as a prerequisite a cluster should be started. The method I used was to execute:

./create-cluster create
./create-cluster start

from redis-3.2.0/utils/create-cluster with default values.

Than in order to reproduce the problem, kill processes (with kill -9) that are part of the cluster.

21321  0.2  0.1 141028  7764 ?        Ssl  13:35   0:00 ../../src/redis-server *:30001 [cluster]
21325  0.2  0.1 141028  7776 ?        Ssl  13:35   0:00 ../../src/redis-server *:30002 [cluster]
21327  0.2  0.1 141028  7780 ?        Ssl  13:35   0:00 ../../src/redis-server *:30003 [cluster]
21331  0.5  0.2 141028  9804 ?        Ssl  13:35   0:00 ../../src/redis-server *:30004 [cluster]
21337  0.7  0.2 141028  9812 ?        Ssl  13:35   0:00 ../../src/redis-server *:30005 [cluster]
21341  0.8  0.2 141028  9812 ?        Ssl  13:35   0:00 ../../src/redis-server *:30006 [cluster]

Generally the first and the fourth process should allow the threads to block:

Name: pool-1-thread-58
State: WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@288ff99f
Total blocked: 5  Total waited: 1,394

Stack trace: 
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)

I have also checked out the latest commons-pool2 and built it (mvn clean install -DskipTests=true). When referencing it from the Jedis project the issue still seems to reproduce.

marcosnils · 2016-06-01T22:47:54Z

@Anisotrop you're right, just tested with latest commons-pool version and seems like problem persists even though 303 has been included and threads are still parked at borrowObject from commons pool.

In the mean time, a solution is to set setMaxWaitMillis in JedisPoolConfig in order to make commos-pool to time out and retrieve a new instance. This will allow Jedis to continue working while cluster is recovering.

Here's a sample snippet to configure JedisCluster:

        Set nodes = new HashSet();
        nodes.add(new HostAndPort("localhost", 30001));
        nodes.add(new HostAndPort("localhost", 30002));
        nodes.add(new HostAndPort("localhost", 30003));
        nodes.add(new HostAndPort("localhost", 30004));
        nodes.add(new HostAndPort("localhost", 30005));
        nodes.add(new HostAndPort("localhost", 30006));
        JedisPoolConfig config = new JedisPoolConfig();
        config.setMaxWaitMillis(1000); // You can tune this to your preference.
        final JedisCluster cluster = new JedisCluster(nodes, config);

I'll keep investigating about the commons pool problem and let the guys know about the issue. It might be the same thing @HeartSaVioR mentioned before.

Anisotrop · 2016-06-02T19:38:23Z

@marcosnils In this case maybe a release with some of the fixes could be foreseen in the near future ?

marcosnils · 2016-06-02T21:47:23Z

@Anisotrop I don't think this PR is actually necessary. Commons-pool should handle this situations natively. I'd like to investigate a bit further to see if there's something we can do at the commons-pool level to avoid Jedis unnecessary changes.

Anisotrop · 2016-06-03T15:02:28Z

@marcosnils Did you get the chance to also test with the following settings ?

        JedisPoolConfig config = new JedisPoolConfig();
        config.setBlockWhenExhausted(false);

When the thread will not be able to execute, an exception with: "Could not get a resource from the pool" will be thrown. When the cluster is rebalancing, an exception with message "Too many Cluster redirections?" seems to be thrown, but when the cluster comes back it seems the connections will work ok again.

The list of nodes is not updated however (or so it seems) as the killed node still does not appear as closed.

Note, please, that in the code sample I provided, the line:

 for (int i = 0; i< 200; i++) {

should be

 for (int i = 0; i< 100; i++) {

ljrmorgan · 2017-01-06T16:36:30Z

@marcosnils This issue bit us pretty badly: We're using Redis 3.2.5 in cluster mode, with Jedis 2.9.0 and Spring Data Redis 1.7.5. We killed one of our master nodes, and calls to Spring Data Redis' RedisCache::get() started blocking indefinitely.

I've been able to reproduce this and capture a stack:

java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00000006879ddf18> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
	at org.apache.commons.pool2.impl.LinkedBlockingDeque.takeFirst(LinkedBlockingDeque.java:583)
	at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:442)
	at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:363)
	at redis.clients.util.Pool.getResource(Pool.java:49)
	at redis.clients.jedis.JedisPool.getResource(JedisPool.java:226)
	at redis.clients.jedis.JedisSlotBasedConnectionHandler.getConnectionFromSlot(JedisSlotBasedConnectionHandler.java:66)
	at redis.clients.jedis.JedisClusterCommand.runWithRetries(JedisClusterCommand.java:116)
	at redis.clients.jedis.JedisClusterCommand.runBinary(JedisClusterCommand.java:60)
	at redis.clients.jedis.BinaryJedisCluster.exists(BinaryJedisCluster.java:108)
	at org.springframework.data.redis.connection.jedis.JedisClusterConnection.exists(JedisClusterConnection.java:3339)
	at sun.reflect.GeneratedMethodAccessor146.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at org.springframework.data.redis.core.CloseSuppressingInvocationHandler.invoke(CloseSuppressingInvocationHandler.java:57)
	at com.sun.proxy.$Proxy132.exists(Unknown Source)
	at org.springframework.data.redis.cache.RedisCache$AbstractRedisCacheCallback.waitForLock(RedisCache.java:505)
	at org.springframework.data.redis.cache.RedisCache$AbstractRedisCacheCallback.doInRedis(RedisCache.java:468)
	at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:204)
	at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:166)
	at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:154)
	at org.springframework.data.redis.cache.RedisCache.get(RedisCache.java:138)
	at org.springframework.data.redis.cache.RedisCache.get(RedisCache.java:97)
	<snip>

We seem to have 20 threads stuck in that state.

I'll try setting setMaxWaitMillis and see if I can still reproduce this. Is there anything else you'd like me to try since I can reproduce this fairly easily now?

This feels like a pretty major issue to me: as far as I can tell the default behaviour is to hang forever if a master goes down.

marcosnils · 2017-01-09T13:53:38Z

@ljrmorgan this is not a Jedis related issue but commons-pool stuff. I'll try to open them an issue to see if they can assist with this scenario

https://issues.apache.org/jira/browse/POOL-303 is released and should fix redis#1158"

https://issues.apache.org/jira/browse/POOL-303 is released and should fix #1158"

https://issues.apache.org/jira/browse/POOL-303 is released and should fix redis#1158"

MOD:all the fail pool should close and the targePool of slots should …

59ed1cf

…be connected or null

marcosnils added this to the 2.8.1 milestone Dec 4, 2015

marcosnils modified the milestones: 2.8.2, 2.8.1 Mar 15, 2016

Anisotrop mentioned this pull request May 23, 2016

JedisPubSub doesn't support ping/pong exchanges #1279

Closed

marcosnils modified the milestones: 2.8.3, 2.8.2 Jul 22, 2016

ljrmorgan mentioned this pull request Jan 6, 2017

Call renew slots before final retry #1443

Closed

sazzad16 added a commit to sazzad16/jedis that referenced this pull request Nov 29, 2017

Update commons-pool2 to version 2.4.3

cb88e41

https://issues.apache.org/jira/browse/POOL-303 is released and should fix redis#1158"

sazzad16 mentioned this pull request Nov 29, 2017

Update commons-pool2 to version 2.4.3 #1685

Merged

sazzad16 closed this in #1685 Dec 12, 2017

sazzad16 added a commit that referenced this pull request Dec 12, 2017

Update commons-pool2 to version 2.4.3 (#1685)

1977278

https://issues.apache.org/jira/browse/POOL-303 is released and should fix #1158"

sazzad16 added a commit that referenced this pull request Dec 12, 2017

Update commons-pool2 to version 2.4.3 (#1685)

bf75f64

https://issues.apache.org/jira/browse/POOL-303 is released and should fix #1158"

sazzad16 added a commit that referenced this pull request Dec 12, 2017

Update commons-pool2 to version 2.4.3 (#1685)

6977835

https://issues.apache.org/jira/browse/POOL-303 is released and should fix #1158"

joyang1 pushed a commit to joyang1/jedis that referenced this pull request Dec 27, 2018

Update commons-pool2 to version 2.4.3 (redis#1685)

c451e78

https://issues.apache.org/jira/browse/POOL-303 is released and should fix redis#1158"

close the jedis pool of the fail node #1158

close the jedis pool of the fail node #1158

Uh oh!

Conversation

liujg commented Nov 22, 2015

Uh oh!

liujg commented Nov 22, 2015

Uh oh!

marcosnils commented Nov 22, 2015

Uh oh!

liujg commented Nov 23, 2015

Uh oh!

liujg commented Nov 26, 2015

Uh oh!

marcosnils commented Nov 26, 2015

Uh oh!

liujg commented Dec 24, 2015

Uh oh!

marcosnils commented Dec 28, 2015

Uh oh!

liujg commented Dec 31, 2015

Uh oh!

HeartSaVioR commented Feb 13, 2016

Uh oh!

HeartSaVioR commented Feb 14, 2016

Uh oh!

HeartSaVioR commented Feb 14, 2016

Uh oh!

HeartSaVioR commented Feb 14, 2016

Uh oh!

HeartSaVioR commented Feb 14, 2016

Uh oh!

marcosnils commented Feb 17, 2016

Uh oh!

HeartSaVioR commented Feb 21, 2016

Uh oh!

marcosnils commented Feb 21, 2016

Uh oh!

liujg commented Mar 13, 2016

Uh oh!

marcosnils commented May 23, 2016

Uh oh!

Anisotrop commented Jun 1, 2016

Uh oh!

marcosnils commented Jun 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Anisotrop commented Jun 2, 2016

Uh oh!

marcosnils commented Jun 2, 2016

Uh oh!

Anisotrop commented Jun 3, 2016

Uh oh!

ljrmorgan commented Jan 6, 2017

Uh oh!

marcosnils commented Jan 9, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

marcosnils commented Jun 1, 2016 •

edited

Loading