Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jedis returning wrong values after long command execution #909

Closed
mpaoletta opened this issue Mar 6, 2015 · 38 comments
Closed

Jedis returning wrong values after long command execution #909

mpaoletta opened this issue Mar 6, 2015 · 38 comments

Comments

@mpaoletta
Copy link

We're having a severe app/driver bug on our heavily concurrent app.

This is a single redis installation with over 4M keys. In order to reproduce the problem, we execute a long running command from redis-cli, like a KEYS * or a DEBUG SLEEP (this is just to reproduce the problem), our client code (Jedis based) starts to throw errors, first socket read timeouts, which are expected and then it throws typical concurrency errors in Jedis, like trying to cast a string into a binary. The worst scenario is that sometimes it seems that it returns wrong keys. These errors keep occurring even after the long running command finishes.

We were using a Jedis instance in each akka actor instance, so it shouldn't be a concurrency issue in our code (we were fairly sure that the redis connection wasn't shared between threads) but we changed to JedisPool nevertheless. But the problem remained (we are using Jedis 2.6.2).

The usual pattern is (this is Scala):

        val redis = MicrositeActor.jedisPool.getResource()
        try {
            redis.getSet(token + ":sent", "sent") == null
        }
        finally {
          MicrositeActor.jedisPool.returnResource(redis)
        }  

We use the following commands:

get, exists, hgetAll, getSet, expire, ttl, multi/exec, set, del.

Are you aware of any concurrency issue on any of these commands?

We are currently evaluating other drivers, but we would really appreciate any tips regarding this.

TIA,
Martin Paoletta

@mpaoletta
Copy link
Author

I closed the original redis issue because it seems it is not related to redis but to Jedis (about to upload a simple case to reproduce it)

@nykolaslima
Copy link
Contributor

It would be nice if you could provide us a small test case about this problem.

I know that Jedis isn't thread safe, but you said that you are using one for each akka actor.

I'll try to take a look at it. If you could provide a test case it would be easier 😄

@mpaoletta
Copy link
Author

We reproduced the issue just with GET commands on a single redis instance. About to upload sample test code that reproduces the issue.

@nykolaslima
Copy link
Contributor

Ok, thank you. I'll wait for this.

@mpaoletta
Copy link
Author

This is simple groovy code that reproduces the issue:

package main.groovy

import redis.clients.jedis.*

import java.util.concurrent.*

static main(args) {

def THREADS = 5

def jedisPool = new redis.clients.jedis.JedisPool(new redis.clients.jedis.JedisPoolConfig(maxTotal:10), 'localhost')

// This populates the DB
j = jedisPool.resource
100000.times { j.set("foo:$it", "bar$it") }
jedisPool.returnResource(j)

// Now we reproduce the issue
pool = Executors.newFixedThreadPool(THREADS)
defer = { c -> pool.submit(c as Callable) }
fib = { n ->
defer {
def j
try {
j = jedisPool.resource
Thread.sleep(500)
def v = j.get("foo:$n")
def equals = { a, b ->
a == b
}
println "$n -> $v -- equals ${equals('bar'+n, v)}"
}catch(e){
e.printStackTrace();
} finally {
jedisPool.returnResource(j)
}
}
}

(1..10000).each{ n -> fib(n) }
pool.shutdown()

}

After it populates the DB, it spawns threads and get foo:number. In another window, we call from redis-cli a DEBUG SLEEP 10. The code throws SocketTimeoutException. As soon as the sleep ends the get commands return wrong values.

This happens even with just one thread.

@mpaoletta
Copy link
Author

Sorry about the groovy code, but it is the fastest thing a coworker could whip to reproduce the problem. Also, I don't know if it helps, but the jedis instances that fail have a broken instance var set to true.
Is there a pattern for handling errors?

@nykolaslima
Copy link
Contributor

When you got some error with the Jedis instance you need to use returnBrokenResource instead of returnResource. This will invalidate the problematic Jedis instance and I believe it will fix your problem.

Could you please try again with this fix?

@mpaoletta
Copy link
Author

Thanks a lot! that was it! Where is this in the docs?

@nykolaslima
Copy link
Contributor

I believe there is nothing documented about JedisPool. Did you saw any documentation about it?

But anyway I believe we should thrown an exception when the Jedis instance is broken. This way we avoid awkward situations like that.

What do you think about throwing this exception @HeartSaVioR @marcosnils @xetorthio?

@HeartSaVioR
Copy link
Contributor

About Exception, handling JedisConnectionException would be enough since Jedis transforms IOException to JedisConnectionException.

About documentation, sure, here it is.

https://github.com/xetorthio/jedis/wiki/Getting-started#using-jedis-in-a-multithreaded-environment

Since Jedis implements Closeable, we're fine to call Jedis.close() from finally statement or just use try-with-resource.
If it was borrowed from pool, it will be returned to pool with proper method since it already determines there was JedisConnectionException occurred.
If it wasn't borrowed from pool, it will be closed.

Hope this helps.

ps. Maybe it's better to add more detailed description to wiki. ;)

@nykolaslima
Copy link
Contributor

@HeartSaVioR I see your point. But because Jedis is an "API" consumed by many developers it would be nice to create a "defensive programming" in this case. We know that users could close it in a wrong way (not using Closeable or using the wrong method), so it would be nice trying to avoid it or at least warning it.

Maybe instead of throwing a exception if the Jedis instance is invalid, we could remove the returnBrokenResource and modify returnResource to decide what to do (we already do this in Jedis's close method). What do you think about it?

@mpaoletta
Copy link
Author

I ran into the problem using just a Jedis instance (no pool). Since my code is single threaded (using actors), I thought that I didn't need to use a pool. Even now, I can reproduce the issue without a pool (just try the sample code without ), so it doesn't seem right to require to solve it at the pool level. Since Jedis already handles reconnection well, why doesn't it reconnect whenever it enters the broken state? What's the point of allowing to make calls from a connection when you already flagged it as broken?

Just my 0,2.

@marcosnils
Copy link
Contributor

We should definitely start by improving the wiki as it doesn't say anywhere that within an exception users need to use returnBrokenResource method.

Regarding @mpaoletta comments Jedis shouldn't retrieve wrong values even in broken state as that's a very terrible thing. I'll make some quick tests and hopefully we can altogether come with the optimal solution

@HeartSaVioR
Copy link
Contributor

Maybe I don't have enough experiences about Socket Programming.
But we're handling "Stream", how can we assume Stream is OK (without having partial response) if connection is broken or timeout has been occurred during reading response?

Could anyone explain me how we can be fine to use timed-out socket's InputStream?

Btw, reconnect strategy stated by @mpaoletta seems to be good.
We can reconnect it as soon as IOException has been occurred, or at next operation.

@HeartSaVioR
Copy link
Contributor

I'd like to add some opinions about your issue, with another issue you posted to Redis repo.

\2. Please increase read timeout from Jedis/JedisPool long enough when you run long commands occasionally. Actually it is best way to resolve this issue.

ps. I remove 1. as you already understand Redis behavior. I was misleading issue from your Redis repo. Sorry about it. ;)

@HeartSaVioR
Copy link
Contributor

@nykolaslima
Yes, it could be one way of resolving issue, and I like defensive programming too. :)

Btw, I'm afraid that Jedis users are already familiar with returnResponse / returnBrokenResponse. Removing returnBrokenResopnse will break backward compatibility.

And I don't know it's better to let JedisPool asks Jedis for its state.
Current Jedis.close() could be better cause Jedis knows itself's state, so it doesn't need to ask.

@HeartSaVioR
Copy link
Contributor

Updated wiki page : https://github.com/xetorthio/jedis/wiki/Getting-started

Btw, we need to rewrite wiki pages since we don't need to explain replication from Getting-started.
We need to explain Transaction, Pipeline, JedisCluster from Getting-started instead, since I think it's not advanced usages. Just my 2 cents.
Actually AdvancedUsage page is outdated, too.

@nykolaslima
Copy link
Contributor

@HeartSaVioR Actually I believe that having returnResoure and returnBrokenResource is unnecessary. It's too easy to make a mistake and don't use the returnBrokenResponse when needed.

IMO, users should not know that when resource is broken, something different has to been done. I believe that it should be the framework responsability to decide when call returnBrokenResource.
As I said, I would prefer if we have just one returnResource and it decide when to call returnBrokenResource.

About breaking compatible, I think it's not an issue. We are planning the 3.0 and in this version we could break things.
I believe that this solution will make the Pool management easier for our users.

About the automatically reconnection, I think it would be great too!

@marcosnils
Copy link
Contributor

I also think like @nykolaslima. We already support Closable which simplifies error handling as it automatically returns the resource accordingly. Maybe for 3.0 a good strategy would be to just remove the returnResource and returnBrokenResource out of JedisPool. We can also start by deprecating these last two mehods en 2.x.

Regarding reconnections Jedis already supports a very basic reconnection scheme but there's lots of cases that haven't been tested correctly.

Maybe I don't have enough experiences about Socket Programming.
But we're handling "Stream", how can we assume Stream is OK (without having partial response) if connection is broken or timeout has been occurred during reading response?

@HeartSaVioR I don't have a lot of experience in Socket programming either. But if we make sure that whenever there's a connection error we just clear any left buffers upon reconnection we shouldn't see the errors @mpaoletta is seeting like getting wrong values form a request. I'm not sure about this though, it's just an assumption. That's why I need to run further tests to see how this wan be improved.

Suggestions and ideas are more than welcome 😄

@mpaoletta
Copy link
Author

Just a few comments:

We don't usually use the DEBUG SLEEP and KEYS commands on production, they were our last resort trying to reproduce the issue, we were stressing the app for 3 hours without it breaking a sweat. The fact remains that stop-the-world events on Redis, either because of long transactions or other things can make Jedis return wrong values. I understand that there is already a solution with returnBrokenResource, and I believe that it could be handled in a nicer way from the API side.

But what I truly believe that if you already know the connection is in broken state, you should at least throw some kind of exception when performing any operation on the Jedis instance, because returning wrong values just because someone isn't using the approved usage pattern is just wrong. I'm ok with Jedis throwing exceptions all over the place or having slow performance because of bad configuration, I can work with that, it's something that will get my attention. But having someone getting someone else's credit card statement is really hard to justify.

Also, I wasn't using try-with-resource because we are using Scala where it's not available and my first attempt to reproduce it wasn't very nice (and since I was fixing a critical production issue, it had to be done very quick). And I wasn't using a pool in the first place, because I'm creating a connection from the actor instance, to avoid contention when getting the connection from the pool. So fixing this at the pool level IMHO is not correct, because it makes the Jedis instance unusable by itself.

Hope this helps.

@nykolaslima
Copy link
Contributor

We cannot assume that everybody will use try-with-resources.
People that uses CDI will use @produces and @Disposes for example.

In my opinion we should do two things:

  1. Simplify the way uses close the resource. We dont need two methods. We
    can have just one returnResource with the broken logic in it.
  2. We should not allow operations from a broken Jedis instance. If Jedis
    instance is broken, any attempt to execute a command will thrown a
    exception.
    On Sat, Mar 7, 2015 at 8:16 PM Martin Paoletta [email protected]
    wrote:

Just a few comments:

We don't usually use the DEBUG SLEEP and KEYS commands on production, they
were our last resort trying to reproduce the issue, we were stressing the
app for 3 hours without it breaking a sweat. The fact remains that
stop-the-world events on Redis, either because of long transactions or
other things can make Jedis return wrong values. I understand that there is
already a solution with returnBrokenResource, and I believe that it could
be handled in a nicer way from the API side.

But what I truly believe that if you already know the connection is in
broken state, you should at least throw some kind of exception when
performing any operation on the Jedis instance, because returning wrong
values just because someone isn't using the approved usage pattern is just
wrong. I'm ok with Jedis throwing exceptions all over the place or having
slow performance because of bad configuration, I can work with that, it's
something that will get my attention. But having someone getting someone
else's credit card statement is really hard to justify.

Also, I wasn't using try-with-resource because we are using Scala where
it's not available and my first attempt to reproduce it wasn't very nice
(and since I was fixing a critical production issue, it had to be done very
quick). And I wasn't using a pool in the first place, because I'm creating
a connection from the actor instance, to avoid contention when getting the
connection from the pool. So fixing this at the pool level IMHO is not
correct, because it makes the Jedis instance unusable by itself.

Hope this helps.


Reply to this email directly or view it on GitHub
#909 (comment).

@marcosnils
Copy link
Contributor

@nykolaslima @mpaoletta when I mentioned that Jedis supports Closable I was trying to clarify the fact that we already have a close method in the Jedis instance which basically does the proper cleanup for every cause (with or without pool). Knowing this my opinion is that we can just get rid of returnXXXResource in JedisPool. Of course the change breaks backward compatibility but we can start by Deprecating it and then remove it in major versions (3.x.x for example).

  1. Simplify the way uses close the resource. We dont need two methods. We
    can have just one returnResource with the broken logic in it.

@nykolaslima is there a reason we should keep the returnResource in JedisPool?. I think it's best to only have one way of handling proper cleanup and as I see it we can just use the close method in Jedis class.

We should not allow operations from a broken Jedis instance. If Jedis
instance is broken, any attempt to execute a command will thrown a
exception.

I totally agree. I think we can work in two fronts for this issue.

  1. Improve Jedis reconnection strategy
  2. Avoid getting responses if Jedis is in broken state. As you suggested.

@HeartSaVioR
Copy link
Contributor

@marcosnils
At least JedisPool should provide at least one method regarding returning object to Pool, cause Jedis.close() uses it. We may try hiding these with package private.

Btw, maybe we could get rid of broken state instead of keeping it.
When Jedis detects IOException internally, just disconnects quietly (with consuming all Exception) ASAP and throw JedisConnectionException to let users know there's an issue. But it doesn't mean Jedis instance is invalid.
Thanks to basic reconnection logic in Jedis, on next operation Jedis will reconnect and nothing will be happened.
Furthermore, no more broken resource exist, so we don't need to use returnBrokenResource in JedisPool / ShardedJedisPool / JedisSentinelPool.

Please correct me if I'm wrong. :)

@nykolaslima
Copy link
Contributor

@HeartSaVioR I don't think if I understand you. We are already throwing exception but the problem is that Jedis instance remains in a invalid state.

I believe we all agree about deprecating returnBrokenResource. But we need to find a solution to invalid Jedis instances.

@HeartSaVioR
Copy link
Contributor

I mean why we leave Jedis instance to invalid state though we can restore state ASAP by disconnecting (or maybe reconnecting).
Jedis has a basic reconnection logic, connect first if it doesn't connected, and we can use it then.

@xetorthio @marcosnils @nykolaslima
I'll make a change if we agree that my idea makes sense.

@nykolaslima
Copy link
Contributor

I agree with that Heart! +1
On Sun, Mar 8, 2015 at 12:12 AM Jungtaek Lim [email protected]
wrote:

I mean why we leave Jedis instance to invalid state though we can restore
state by disconnecting.
Jedis has a basic reconnection logic, connect if it doesn't connected, and
we can use it then.

@xetorthio https://github.com/xetorthio @marcosnils
https://github.com/marcosnils @nykolaslima
https://github.com/nykolaslima
I'll make a change if we agree that my idea makes sense.


Reply to this email directly or view it on GitHub
#909 (comment).

@HeartSaVioR
Copy link
Contributor

Found another hurdle. ;(

Connection layer would be easy, as I stated earlier.
Next issue hides in Jedis layer. We also need to reset state (watch, multi) before reuse.
It can be easily resolved by catching JedisConnectionException and call resetState(), but we should add handling codes to most of methods!

Another chance of applying command pattern though it's not good at memory consumption and performance?

@HeartSaVioR
Copy link
Contributor

Actually we need to add precondition to most of methods when we decide to throw Exception if Jedis is broken.

@marcosnils
Copy link
Contributor

@HeartSaVioR you're correct. By removing returnXXXResource method I was referring to making them protected so users won't be able to access them at all and we can still call them from Jedis close method.

@nykolaslima what do you think about this?, I've tried to explained it in my previous comments but maybe I haven't transmitted the correct idea.

Regarding reconnections what about overriding the sendCommand method in BinaryClient so that way we have the isInMulti and isInWatch flags to make the proper calling to resetState.

The method in BinaryJedis would be something like this:

@Override
    protected Connection sendCommand(Command cmd) {
      try {
          return  super.sendCommand(cmd);  
      } catch (JedisConnectionException jce) {
         if (isInMulti || isInWatch) {
             resetState();   
         }
         throw jce;
      }
    }

@nykolaslima
Copy link
Contributor

@marcosnils I still don't understand why we are wanting to keep the returnBrokenResource.
We can find out if the Jedis instance is broken or not using the broken attribute. Because of this, IMO, we can just have the returnResource and it will handle the both cases:

public void returnResource(Jedis jedis) {
  if(jedis.isBroken()) {
    //returnBrokenResource logic here
  } else {
    //return normal logic here
  }
}

I believe we could have the returnBrokenResource as a private method, just to make the code cleaner. But I think we should not expose it to any other class, because it's not their responsibility to decide when use the normal return resource and the broken return resource.

(I don't know if I'm expressing my idea right here, I'm sorry about my bad english)

What is the reason for keeping the returnBrokenResource exposed as a public/protected?

@HeartSaVioR
Copy link
Contributor

@marcosnils
Unfortunately Jedis doesn't extend Connection, but has a Client which extends Connection. (has-a relation)
It's on different layer. ;(

@nykolaslima
I think it's better not for JedisPool to know Jedis deeply, especially its state.
And Jedis.close() can be used widely, whether it's from Pool or not.
So I think Jedis.close() is better than JedisPool.returnResource().

If we agree that, hiding JedisPool's returnXXXResource() to users is the only issue for us.
Please note that it should be seen by Jedis, so we can try package-private.

@marcosnils
Copy link
Contributor

@nykolaslima I'm becoming dizzy here. I've never said that we should keep the returnBrokenResouce method. Please re-read my last comment, I said that we should make those two methods package protected (as heart said) and then only use jedis.close() for resource cleanup.

@HeartSaVioR I see what you're saying. Another idea that I have is that we can pass a reference of the BinaryJedis to the BinaryClient. That way whenever the BinaryClient raises a ConnectionException we can call the resetState in the BinaryClient.

Thoughts?

@nykolaslima
Copy link
Contributor

@marcosnils I'm sorry, I understood it wrong. Now I understand what you are saying. So we all agree that we should make returnXxxResource methods package protected.

I'm not sure if we need to reset the state if something goes wrong with Jedis instance. IMO, it would be better to just discard this instance, as returnBrokenResource do.

@marcosnils
Copy link
Contributor

@nykolaslima what if you're not using JedisPool and you have a simple program that deals with just one redis instance?. In that case if you get an exception then you need to re-instantiate Jedis to continue working.

Nowadays we're just flagging the connection as broken and we only use that for resource cleanup. As @HeartSaVioR said, I like the idea about removing broken state at all and allowing a Jedis instance to recover itself by reconnecting. Whenever a user tries to execute a command on a disconnected socket we can through an exception untl the client reconnects.

What do you think?

@HeartSaVioR
Copy link
Contributor

Hmm, command pattern or cross reference. Both things are not really good though. ;(
Actually I was thinking about removing Jedis state (watch, multi, pipeline) but it is not easy thing, too.

This issue is no longer simple, so let's resolve this issue to iterative way.
A. Modify JedisPool.returnXXXResource() to package private so that only Jedis.close() can see.
B. Add precondition to all operation methods which checks broken state, and throw JedisConnectionException if it's broken.
C. Find the best way to remove broken state (means automatically recover)

Since we're discussing from closed issue, how about creating new issue and migrate our conversation?

@nykolaslima
Copy link
Contributor

I agree with you Heart.

Let's do it by parts and open a new issue to discuss it!

marcosnils added a commit that referenced this issue Mar 9, 2015
To improve usability we're changing the visibility of `returnResoure`
and `returnBrokenResource` in favor of `jedis.close()` which
automatically returns the resource to the pool accordingly.
As `Pool` and `Jedis` are currently in different packages i've created
a JedisPoolAbstract class to provide a bridge between the two
implementations
marcosnils added a commit that referenced this issue Mar 9, 2015
@marcosnils
Copy link
Contributor

I've opened two PR's to fix the returnxxxResource usage. I'm closing this issue and I will open another one to decide what to do about the Jedis connection broken state.

@nykolaslima
Copy link
Contributor

Great @marcosnils 🎉

marcosnils added a commit that referenced this issue Mar 10, 2015
Conflicts:
	src/main/java/redis/clients/jedis/BinaryJedis.java
	src/main/java/redis/clients/jedis/BinaryShardedJedis.java
marcosnils added a commit that referenced this issue Mar 10, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants