If ZK has an old master, redis_failover will use it even though the redis instance does not exist #55

petelacey · 2013-05-16T17:23:57Z

This issue may be particular to cloud environments where IP addresses are not very stable.

Somehow we managed to get into a state where ZK maintained a no-longer-in-service IP for the redis master. That is, we shutdown a redis node and brought up a whole new one, and ZK never noticed because that node also runs the redis_node_manager.

At startup, when the node manager gets this old IP from ZK it attempts to ask that Redis instance whether it is still the master. In our case, this times out. And when this happens redis_failover simply logs a warning and continues to use this old IP as master. Which basically causes all future client usage to fail. I believe this is the code in question

It would probably be better to:

Check to see if IPs returned from ZK match the list of nodes provided at startup of the node manager, and expire from ZK the ones that don't. And/or....
Return nil instead of master if the find_existing_master method rescues a NodeUnavailableError.

ryanlecompte · 2013-05-20T19:26:16Z

Thanks for reporting this, @petelacey. This is the first time I've seen this issue, so my guess is that most folks are using redis_failover in an environment where the ip addresses are more stable than yours. I don't have the time to provide a fix for it now, unfortunately. If you'd like, feel free to provide a fix and test it in your environment, then submit a pull request. Thanks!

arohter · 2014-01-09T05:24:46Z

We looked/ran into this, and determined that the current behavior is the safe and sane methodology. We run on AWS, so we are familiar with cloudy environments.

You ran into this issue because you are running only a single node manager. You need more than one manager to maintain healthy monitoring and quorum in a clustered env. We run a manager on each redis node.
Node state stored in zk is considered the word of god; it's the only trusted canonical source of truth, especially when running multiple node managers. From the point of view of a single manager, there's no way to automatically determine whether a failed master connection attempt is due to bad/old config, or simply a network partition. We could incur data loss if we reconfigure the cluster based on just one snapshot, since slaveof commands drop all existing data.
You can resolve this cluster "deadlock" by simply triggering a manual failover. We lose some automation by making humans pick the proper master in these edge cases, but downtime is preferable to wholesale data loss.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

If ZK has an old master, redis_failover will use it even though the redis instance does not exist #55

If ZK has an old master, redis_failover will use it even though the redis instance does not exist #55

petelacey commented May 16, 2013

ryanlecompte commented May 20, 2013

arohter commented Jan 9, 2014

If ZK has an old master, redis_failover will use it even though the redis instance does not exist #55

If ZK has an old master, redis_failover will use it even though the redis instance does not exist #55

Comments

petelacey commented May 16, 2013

ryanlecompte commented May 20, 2013

arohter commented Jan 9, 2014