Jedis returning wrong values after long command execution #909

mpaoletta · 2015-03-06T22:52:49Z

We're having a severe app/driver bug on our heavily concurrent app.

This is a single redis installation with over 4M keys. In order to reproduce the problem, we execute a long running command from redis-cli, like a KEYS * or a DEBUG SLEEP (this is just to reproduce the problem), our client code (Jedis based) starts to throw errors, first socket read timeouts, which are expected and then it throws typical concurrency errors in Jedis, like trying to cast a string into a binary. The worst scenario is that sometimes it seems that it returns wrong keys. These errors keep occurring even after the long running command finishes.

We were using a Jedis instance in each akka actor instance, so it shouldn't be a concurrency issue in our code (we were fairly sure that the redis connection wasn't shared between threads) but we changed to JedisPool nevertheless. But the problem remained (we are using Jedis 2.6.2).

The usual pattern is (this is Scala):

        val redis = MicrositeActor.jedisPool.getResource()
        try {
            redis.getSet(token + ":sent", "sent") == null
        }
        finally {
          MicrositeActor.jedisPool.returnResource(redis)
        }

We use the following commands:

get, exists, hgetAll, getSet, expire, ttl, multi/exec, set, del.

Are you aware of any concurrency issue on any of these commands?

We are currently evaluating other drivers, but we would really appreciate any tips regarding this.

TIA,
Martin Paoletta

The text was updated successfully, but these errors were encountered:

mpaoletta · 2015-03-06T23:06:42Z

I closed the original redis issue because it seems it is not related to redis but to Jedis (about to upload a simple case to reproduce it)

nykolaslima · 2015-03-06T23:07:09Z

It would be nice if you could provide us a small test case about this problem.

I know that Jedis isn't thread safe, but you said that you are using one for each akka actor.

I'll try to take a look at it. If you could provide a test case it would be easier 😄

mpaoletta · 2015-03-06T23:12:27Z

We reproduced the issue just with GET commands on a single redis instance. About to upload sample test code that reproduces the issue.

nykolaslima · 2015-03-06T23:15:25Z

Ok, thank you. I'll wait for this.

mpaoletta · 2015-03-06T23:17:08Z

This is simple groovy code that reproduces the issue:

package main.groovy

import redis.clients.jedis.*

import java.util.concurrent.*

static main(args) {

def THREADS = 5

def jedisPool = new redis.clients.jedis.JedisPool(new redis.clients.jedis.JedisPoolConfig(maxTotal:10), 'localhost')

// This populates the DB
j = jedisPool.resource
100000.times { j.set("foo:$it", "bar$it") }
jedisPool.returnResource(j)

// Now we reproduce the issue
pool = Executors.newFixedThreadPool(THREADS)
defer = { c -> pool.submit(c as Callable) }
fib = { n ->
defer {
def j
try {
j = jedisPool.resource
Thread.sleep(500)
def v = j.get("foo:$n")
def equals = { a, b ->
a == b
}
println "$n -> $v -- equals ${equals('bar'+n, v)}"
}catch(e){
e.printStackTrace();
} finally {
jedisPool.returnResource(j)
}
}
}

(1..10000).each{ n -> fib(n) }
pool.shutdown()

}

After it populates the DB, it spawns threads and get foo:number. In another window, we call from redis-cli a DEBUG SLEEP 10. The code throws SocketTimeoutException. As soon as the sleep ends the get commands return wrong values.

This happens even with just one thread.

mpaoletta · 2015-03-06T23:21:26Z

Sorry about the groovy code, but it is the fastest thing a coworker could whip to reproduce the problem. Also, I don't know if it helps, but the jedis instances that fail have a broken instance var set to true.
Is there a pattern for handling errors?

nykolaslima · 2015-03-06T23:23:11Z

When you got some error with the Jedis instance you need to use returnBrokenResource instead of returnResource. This will invalidate the problematic Jedis instance and I believe it will fix your problem.

Could you please try again with this fix?

mpaoletta · 2015-03-06T23:32:24Z

Thanks a lot! that was it! Where is this in the docs?

nykolaslima · 2015-03-06T23:37:59Z

I believe there is nothing documented about JedisPool. Did you saw any documentation about it?

But anyway I believe we should thrown an exception when the Jedis instance is broken. This way we avoid awkward situations like that.

What do you think about throwing this exception @HeartSaVioR @marcosnils @xetorthio?

HeartSaVioR · 2015-03-07T00:17:21Z

About Exception, handling JedisConnectionException would be enough since Jedis transforms IOException to JedisConnectionException.

About documentation, sure, here it is.

https://github.com/xetorthio/jedis/wiki/Getting-started#using-jedis-in-a-multithreaded-environment

Since Jedis implements Closeable, we're fine to call Jedis.close() from finally statement or just use try-with-resource.
If it was borrowed from pool, it will be returned to pool with proper method since it already determines there was JedisConnectionException occurred.
If it wasn't borrowed from pool, it will be closed.

Hope this helps.

ps. Maybe it's better to add more detailed description to wiki. ;)

nykolaslima · 2015-03-07T00:23:22Z

@HeartSaVioR I see your point. But because Jedis is an "API" consumed by many developers it would be nice to create a "defensive programming" in this case. We know that users could close it in a wrong way (not using Closeable or using the wrong method), so it would be nice trying to avoid it or at least warning it.

Maybe instead of throwing a exception if the Jedis instance is invalid, we could remove the returnBrokenResource and modify returnResource to decide what to do (we already do this in Jedis's close method). What do you think about it?

mpaoletta · 2015-03-07T02:07:51Z

I ran into the problem using just a Jedis instance (no pool). Since my code is single threaded (using actors), I thought that I didn't need to use a pool. Even now, I can reproduce the issue without a pool (just try the sample code without ), so it doesn't seem right to require to solve it at the pool level. Since Jedis already handles reconnection well, why doesn't it reconnect whenever it enters the broken state? What's the point of allowing to make calls from a connection when you already flagged it as broken?

Just my 0,2.

marcosnils · 2015-03-07T02:56:29Z

We should definitely start by improving the wiki as it doesn't say anywhere that within an exception users need to use returnBrokenResource method.

Regarding @mpaoletta comments Jedis shouldn't retrieve wrong values even in broken state as that's a very terrible thing. I'll make some quick tests and hopefully we can altogether come with the optimal solution

HeartSaVioR · 2015-03-07T05:09:35Z

Maybe I don't have enough experiences about Socket Programming.
But we're handling "Stream", how can we assume Stream is OK (without having partial response) if connection is broken or timeout has been occurred during reading response?

Could anyone explain me how we can be fine to use timed-out socket's InputStream?

Btw, reconnect strategy stated by @mpaoletta seems to be good.
We can reconnect it as soon as IOException has been occurred, or at next operation.

HeartSaVioR · 2015-03-07T05:20:41Z

I'd like to add some opinions about your issue, with another issue you posted to Redis repo.

\2. Please increase read timeout from Jedis/JedisPool long enough when you run long commands occasionally. Actually it is best way to resolve this issue.

ps. I remove 1. as you already understand Redis behavior. I was misleading issue from your Redis repo. Sorry about it. ;)

HeartSaVioR · 2015-03-07T05:35:21Z

@nykolaslima
Yes, it could be one way of resolving issue, and I like defensive programming too. :)

Btw, I'm afraid that Jedis users are already familiar with returnResponse / returnBrokenResponse. Removing returnBrokenResopnse will break backward compatibility.

And I don't know it's better to let JedisPool asks Jedis for its state.
Current Jedis.close() could be better cause Jedis knows itself's state, so it doesn't need to ask.

HeartSaVioR · 2015-03-07T05:44:00Z

Updated wiki page : https://github.com/xetorthio/jedis/wiki/Getting-started

Btw, we need to rewrite wiki pages since we don't need to explain replication from Getting-started.
We need to explain Transaction, Pipeline, JedisCluster from Getting-started instead, since I think it's not advanced usages. Just my 2 cents.
Actually AdvancedUsage page is outdated, too.

nykolaslima · 2015-03-07T13:06:53Z

@HeartSaVioR Actually I believe that having returnResoure and returnBrokenResource is unnecessary. It's too easy to make a mistake and don't use the returnBrokenResponse when needed.

IMO, users should not know that when resource is broken, something different has to been done. I believe that it should be the framework responsability to decide when call returnBrokenResource.
As I said, I would prefer if we have just one returnResource and it decide when to call returnBrokenResource.

About breaking compatible, I think it's not an issue. We are planning the 3.0 and in this version we could break things.
I believe that this solution will make the Pool management easier for our users.

About the automatically reconnection, I think it would be great too!

marcosnils · 2015-03-07T18:51:42Z

I also think like @nykolaslima. We already support Closable which simplifies error handling as it automatically returns the resource accordingly. Maybe for 3.0 a good strategy would be to just remove the returnResource and returnBrokenResource out of JedisPool. We can also start by deprecating these last two mehods en 2.x.

Regarding reconnections Jedis already supports a very basic reconnection scheme but there's lots of cases that haven't been tested correctly.

Maybe I don't have enough experiences about Socket Programming.
But we're handling "Stream", how can we assume Stream is OK (without having partial response) if connection is broken or timeout has been occurred during reading response?

@HeartSaVioR I don't have a lot of experience in Socket programming either. But if we make sure that whenever there's a connection error we just clear any left buffers upon reconnection we shouldn't see the errors @mpaoletta is seeting like getting wrong values form a request. I'm not sure about this though, it's just an assumption. That's why I need to run further tests to see how this wan be improved.

Suggestions and ideas are more than welcome 😄

mpaoletta · 2015-03-07T23:16:20Z

Just a few comments:

We don't usually use the DEBUG SLEEP and KEYS commands on production, they were our last resort trying to reproduce the issue, we were stressing the app for 3 hours without it breaking a sweat. The fact remains that stop-the-world events on Redis, either because of long transactions or other things can make Jedis return wrong values. I understand that there is already a solution with returnBrokenResource, and I believe that it could be handled in a nicer way from the API side.

But what I truly believe that if you already know the connection is in broken state, you should at least throw some kind of exception when performing any operation on the Jedis instance, because returning wrong values just because someone isn't using the approved usage pattern is just wrong. I'm ok with Jedis throwing exceptions all over the place or having slow performance because of bad configuration, I can work with that, it's something that will get my attention. But having someone getting someone else's credit card statement is really hard to justify.

Also, I wasn't using try-with-resource because we are using Scala where it's not available and my first attempt to reproduce it wasn't very nice (and since I was fixing a critical production issue, it had to be done very quick). And I wasn't using a pool in the first place, because I'm creating a connection from the actor instance, to avoid contention when getting the connection from the pool. So fixing this at the pool level IMHO is not correct, because it makes the Jedis instance unusable by itself.

Hope this helps.

nykolaslima · 2015-03-08T00:23:15Z

We cannot assume that everybody will use try-with-resources.
People that uses CDI will use @produces and @Disposes for example.

In my opinion we should do two things:

Simplify the way uses close the resource. We dont need two methods. We
can have just one returnResource with the broken logic in it.
We should not allow operations from a broken Jedis instance. If Jedis
instance is broken, any attempt to execute a command will thrown a
exception.
On Sat, Mar 7, 2015 at 8:16 PM Martin Paoletta [email protected]
wrote:

Just a few comments:

We don't usually use the DEBUG SLEEP and KEYS commands on production, they
were our last resort trying to reproduce the issue, we were stressing the
app for 3 hours without it breaking a sweat. The fact remains that
stop-the-world events on Redis, either because of long transactions or
other things can make Jedis return wrong values. I understand that there is
already a solution with returnBrokenResource, and I believe that it could
be handled in a nicer way from the API side.

But what I truly believe that if you already know the connection is in
broken state, you should at least throw some kind of exception when
performing any operation on the Jedis instance, because returning wrong
values just because someone isn't using the approved usage pattern is just
wrong. I'm ok with Jedis throwing exceptions all over the place or having
slow performance because of bad configuration, I can work with that, it's
something that will get my attention. But having someone getting someone
else's credit card statement is really hard to justify.

Also, I wasn't using try-with-resource because we are using Scala where
it's not available and my first attempt to reproduce it wasn't very nice
(and since I was fixing a critical production issue, it had to be done very
quick). And I wasn't using a pool in the first place, because I'm creating
a connection from the actor instance, to avoid contention when getting the
connection from the pool. So fixing this at the pool level IMHO is not
correct, because it makes the Jedis instance unusable by itself.

Hope this helps.

—
Reply to this email directly or view it on GitHub
#909 (comment).

marcosnils · 2015-03-08T01:18:23Z

@nykolaslima @mpaoletta when I mentioned that Jedis supports Closable I was trying to clarify the fact that we already have a close method in the Jedis instance which basically does the proper cleanup for every cause (with or without pool). Knowing this my opinion is that we can just get rid of returnXXXResource in JedisPool. Of course the change breaks backward compatibility but we can start by Deprecating it and then remove it in major versions (3.x.x for example).

Simplify the way uses close the resource. We dont need two methods. We
can have just one returnResource with the broken logic in it.

@nykolaslima is there a reason we should keep the returnResource in JedisPool?. I think it's best to only have one way of handling proper cleanup and as I see it we can just use the close method in Jedis class.

We should not allow operations from a broken Jedis instance. If Jedis
instance is broken, any attempt to execute a command will thrown a
exception.

I totally agree. I think we can work in two fronts for this issue.

Improve Jedis reconnection strategy
Avoid getting responses if Jedis is in broken state. As you suggested.

HeartSaVioR · 2015-03-08T01:46:29Z

@marcosnils
At least JedisPool should provide at least one method regarding returning object to Pool, cause Jedis.close() uses it. We may try hiding these with package private.

Btw, maybe we could get rid of broken state instead of keeping it.
When Jedis detects IOException internally, just disconnects quietly (with consuming all Exception) ASAP and throw JedisConnectionException to let users know there's an issue. But it doesn't mean Jedis instance is invalid.
Thanks to basic reconnection logic in Jedis, on next operation Jedis will reconnect and nothing will be happened.
Furthermore, no more broken resource exist, so we don't need to use returnBrokenResource in JedisPool / ShardedJedisPool / JedisSentinelPool.

Please correct me if I'm wrong. :)

nykolaslima · 2015-03-08T02:25:31Z

@HeartSaVioR I don't think if I understand you. We are already throwing exception but the problem is that Jedis instance remains in a invalid state.

I believe we all agree about deprecating returnBrokenResource. But we need to find a solution to invalid Jedis instances.

HeartSaVioR · 2015-03-08T03:12:21Z

I mean why we leave Jedis instance to invalid state though we can restore state ASAP by disconnecting (or maybe reconnecting).
Jedis has a basic reconnection logic, connect first if it doesn't connected, and we can use it then.

@xetorthio @marcosnils @nykolaslima
I'll make a change if we agree that my idea makes sense.

nykolaslima · 2015-03-08T05:00:14Z

I agree with that Heart! +1
On Sun, Mar 8, 2015 at 12:12 AM Jungtaek Lim [email protected]
wrote:

I mean why we leave Jedis instance to invalid state though we can restore
state by disconnecting.
Jedis has a basic reconnection logic, connect if it doesn't connected, and
we can use it then.

@xetorthio https://github.com/xetorthio @marcosnils
https://github.com/marcosnils @nykolaslima
https://github.com/nykolaslima
I'll make a change if we agree that my idea makes sense.

—
Reply to this email directly or view it on GitHub
#909 (comment).

HeartSaVioR · 2015-03-08T05:14:48Z

Found another hurdle. ;(

Connection layer would be easy, as I stated earlier.
Next issue hides in Jedis layer. We also need to reset state (watch, multi) before reuse.
It can be easily resolved by catching JedisConnectionException and call resetState(), but we should add handling codes to most of methods!

Another chance of applying command pattern though it's not good at memory consumption and performance?

HeartSaVioR · 2015-03-08T05:34:40Z

Actually we need to add precondition to most of methods when we decide to throw Exception if Jedis is broken.

marcosnils · 2015-03-08T09:42:29Z

@HeartSaVioR you're correct. By removing returnXXXResource method I was referring to making them protected so users won't be able to access them at all and we can still call them from Jedis close method.

@nykolaslima what do you think about this?, I've tried to explained it in my previous comments but maybe I haven't transmitted the correct idea.

Regarding reconnections what about overriding the sendCommand method in BinaryClient so that way we have the isInMulti and isInWatch flags to make the proper calling to resetState.

The method in BinaryJedis would be something like this:

@Override
    protected Connection sendCommand(Command cmd) {
      try {
          return  super.sendCommand(cmd);  
      } catch (JedisConnectionException jce) {
         if (isInMulti || isInWatch) {
             resetState();   
         }
         throw jce;
      }
    }

nykolaslima · 2015-03-08T14:26:21Z

@marcosnils I still don't understand why we are wanting to keep the returnBrokenResource.
We can find out if the Jedis instance is broken or not using the broken attribute. Because of this, IMO, we can just have the returnResource and it will handle the both cases:

public void returnResource(Jedis jedis) {
  if(jedis.isBroken()) {
    //returnBrokenResource logic here
  } else {
    //return normal logic here
  }
}

I believe we could have the returnBrokenResource as a private method, just to make the code cleaner. But I think we should not expose it to any other class, because it's not their responsibility to decide when use the normal return resource and the broken return resource.

(I don't know if I'm expressing my idea right here, I'm sorry about my bad english)

What is the reason for keeping the returnBrokenResource exposed as a public/protected?

HeartSaVioR · 2015-03-08T15:59:00Z

@marcosnils
Unfortunately Jedis doesn't extend Connection, but has a Client which extends Connection. (has-a relation)
It's on different layer. ;(

@nykolaslima
I think it's better not for JedisPool to know Jedis deeply, especially its state.
And Jedis.close() can be used widely, whether it's from Pool or not.
So I think Jedis.close() is better than JedisPool.returnResource().

If we agree that, hiding JedisPool's returnXXXResource() to users is the only issue for us.
Please note that it should be seen by Jedis, so we can try package-private.

marcosnils · 2015-03-08T18:45:46Z

@nykolaslima I'm becoming dizzy here. I've never said that we should keep the returnBrokenResouce method. Please re-read my last comment, I said that we should make those two methods package protected (as heart said) and then only use jedis.close() for resource cleanup.

@HeartSaVioR I see what you're saying. Another idea that I have is that we can pass a reference of the BinaryJedis to the BinaryClient. That way whenever the BinaryClient raises a ConnectionException we can call the resetState in the BinaryClient.

Thoughts?

nykolaslima · 2015-03-08T19:10:55Z

@marcosnils I'm sorry, I understood it wrong. Now I understand what you are saying. So we all agree that we should make returnXxxResource methods package protected.

I'm not sure if we need to reset the state if something goes wrong with Jedis instance. IMO, it would be better to just discard this instance, as returnBrokenResource do.

marcosnils · 2015-03-08T19:41:36Z

@nykolaslima what if you're not using JedisPool and you have a simple program that deals with just one redis instance?. In that case if you get an exception then you need to re-instantiate Jedis to continue working.

Nowadays we're just flagging the connection as broken and we only use that for resource cleanup. As @HeartSaVioR said, I like the idea about removing broken state at all and allowing a Jedis instance to recover itself by reconnecting. Whenever a user tries to execute a command on a disconnected socket we can through an exception untl the client reconnects.

What do you think?

HeartSaVioR · 2015-03-08T23:06:23Z

Hmm, command pattern or cross reference. Both things are not really good though. ;(
Actually I was thinking about removing Jedis state (watch, multi, pipeline) but it is not easy thing, too.

This issue is no longer simple, so let's resolve this issue to iterative way.
A. Modify JedisPool.returnXXXResource() to package private so that only Jedis.close() can see.
B. Add precondition to all operation methods which checks broken state, and throw JedisConnectionException if it's broken.
C. Find the best way to remove broken state (means automatically recover)

Since we're discussing from closed issue, how about creating new issue and migrate our conversation?

nykolaslima · 2015-03-09T02:57:58Z

I agree with you Heart.

Let's do it by parts and open a new issue to discuss it!

To improve usability we're changing the visibility of `returnResoure` and `returnBrokenResource` in favor of `jedis.close()` which automatically returns the resource to the pool accordingly. As `Pool` and `Jedis` are currently in different packages i've created a JedisPoolAbstract class to provide a bridge between the two implementations

marcosnils · 2015-03-09T06:55:10Z

I've opened two PR's to fix the returnxxxResource usage. I'm closing this issue and I will open another one to decide what to do about the Jedis connection broken state.

nykolaslima · 2015-03-09T13:57:02Z

Great @marcosnils 🎉

Conflicts: src/main/java/redis/clients/jedis/BinaryJedis.java src/main/java/redis/clients/jedis/BinaryShardedJedis.java

mpaoletta mentioned this issue Mar 6, 2015

Redis not blocking when running slow commands redis/redis#2441

Closed

mpaoletta closed this as completed Mar 6, 2015

nykolaslima mentioned this issue Mar 8, 2015

deprecating returnBrokenResource #910

Closed

marcosnils added a commit that referenced this issue Mar 9, 2015

Deprecate and per #909

4a92fea

marcosnils mentioned this issue Mar 9, 2015

Deprecates JedisPool returnResource and returnBrokenResource #912

Merged

marcosnils reopened this Mar 9, 2015

marcosnils mentioned this issue Mar 9, 2015

Improve Jedis reconnection strategy #914

Closed

marcosnils closed this as completed Mar 9, 2015

marcosnils added a commit that referenced this issue Mar 10, 2015

Deprecate and per #909

a26ad40

Conflicts: src/main/java/redis/clients/jedis/BinaryJedis.java src/main/java/redis/clients/jedis/BinaryShardedJedis.java

marcosnils added a commit that referenced this issue Mar 10, 2015

Deprecate and per #909

acde13c

This was referenced Mar 27, 2015

Jodis should return jedis resource after using it CodisLabs/codis#166

Closed

Jedis.close() should return Jedis to pool instead of closing it #562

Closed

dontgitit mentioned this issue Sep 1, 2017

Redis plugin - Sedis pool bug playframework/play-plugins#179

Open

Jedis returning wrong values after long command execution #909

Jedis returning wrong values after long command execution #909

Comments

mpaoletta commented Mar 6, 2015

mpaoletta commented Mar 6, 2015

nykolaslima commented Mar 6, 2015

mpaoletta commented Mar 6, 2015

nykolaslima commented Mar 6, 2015

mpaoletta commented Mar 6, 2015

mpaoletta commented Mar 6, 2015

nykolaslima commented Mar 6, 2015

mpaoletta commented Mar 6, 2015

nykolaslima commented Mar 6, 2015

HeartSaVioR commented Mar 7, 2015

nykolaslima commented Mar 7, 2015

mpaoletta commented Mar 7, 2015

marcosnils commented Mar 7, 2015

HeartSaVioR commented Mar 7, 2015

HeartSaVioR commented Mar 7, 2015

HeartSaVioR commented Mar 7, 2015

HeartSaVioR commented Mar 7, 2015

nykolaslima commented Mar 7, 2015

marcosnils commented Mar 7, 2015

mpaoletta commented Mar 7, 2015

nykolaslima commented Mar 8, 2015

marcosnils commented Mar 8, 2015

HeartSaVioR commented Mar 8, 2015

nykolaslima commented Mar 8, 2015

HeartSaVioR commented Mar 8, 2015

nykolaslima commented Mar 8, 2015

HeartSaVioR commented Mar 8, 2015

HeartSaVioR commented Mar 8, 2015

marcosnils commented Mar 8, 2015

nykolaslima commented Mar 8, 2015

HeartSaVioR commented Mar 8, 2015

marcosnils commented Mar 8, 2015

nykolaslima commented Mar 8, 2015

marcosnils commented Mar 8, 2015

HeartSaVioR commented Mar 8, 2015

nykolaslima commented Mar 9, 2015

marcosnils commented Mar 9, 2015

nykolaslima commented Mar 9, 2015