retry Redis connection in case of network errors#2512
Conversation
services/gundeck/src/Gundeck/Env.hs
Outdated
| Log.msg (Log.val $ "starting connection to " <> identifier <> "...") | ||
| . Log.field "connectionMode" (show $ endpoint ^. rConnectionMode) | ||
| . Log.field "connInfo" (show redisConnInfo) | ||
| let connectWithRetry = Redis.connectRobust l (exponentialBackoff 50000) |
There was a problem hiding this comment.
To the reviewer: I wonder whether what currently is exponentialBackoff 50000 should be more flexible and I can imagine two ways to add flexibility,
50000is a magic number and should perhaps rather come from the environment variables, however, the choice of the constant might not be mattering much enough, since more importantly the back-off would always be exponential;exponentialBackoffis just one of many retry policies in the retry package and perhaps retries should be limited in their number or by the time they require and/or even- combine this with environment configuration.
| import qualified Database.Redis as Redis | ||
| import qualified Gundeck.Aws as Aws | ||
| import Gundeck.Options as Opt | ||
| import qualified Gundeck.Redis as Redis |
There was a problem hiding this comment.
To the reviewer: this line effectively pollutes the Redis package namespace in this module, however, I deemed this is justifiable, since the namespace is artificial and functions of both modules logically belong to Redis anyway.
| { _rrConnection :: Connection, -- established (and potentially breaking) connection to Redis | ||
| _rrReconnect :: IO () -- action which can be called to reconnect to Redis |
There was a problem hiding this comment.
To the reviewer: do we have a policy or is there some good rule of thumb on when to use bang patterns in record syntax?
There was a problem hiding this comment.
Unfortunately, we don't have any of those.
services/gundeck/src/Gundeck/Run.hs
Outdated
| forM_ wtbs Async.cancel | ||
| Redis.disconnect (e ^. rstate) | ||
| Redis.disconnect . (^. Redis.rrConnection) =<< takeMVar (e ^. rstate) | ||
| maybe (pure ()) ((=<<) (Redis.disconnect . (^. Redis.rrConnection)) . takeMVar) (e ^. rstateAdditionalWrite) |
There was a problem hiding this comment.
To the reviewer: the second Redis connection was never disconnected upon service shutdown in the past. I added disconnecting now, so if this is undesirable for some reason, I need to know.
battermann
left a comment
There was a problem hiding this comment.
Looks pretty good to me. Because I am not so familiar with redis, this is just a comment review.
|
Are there any plans on how to fix this in Hedis in general, instead of working around this downstream in every application using the library? |
akshaymankar
left a comment
There was a problem hiding this comment.
Looks good so far, I will test it in some cluster before approving.
| { _rrConnection :: Connection, -- established (and potentially breaking) connection to Redis | ||
| _rrReconnect :: IO () -- action which can be called to reconnect to Redis |
There was a problem hiding this comment.
Unfortunately, we don't have any of those.
| { _rrConnection :: Connection, -- established (and potentially breaking) connection to Redis | ||
| _rrReconnect :: IO () -- action which can be called to reconnect to Redis |
There was a problem hiding this comment.
Nit: Make these comments haddocks.
services/gundeck/src/Gundeck/Env.hs
Outdated
| Log.msg (Log.val $ "starting connection to " <> identifier <> "...") | ||
| . Log.field "connectionMode" (show $ endpoint ^. rConnectionMode) | ||
| . Log.field "connInfo" (show redisConnInfo) | ||
| let connectWithRetry = Redis.connectRobust l (exponentialBackoff 50000) |
There was a problem hiding this comment.
| let connectWithRetry = Redis.connectRobust l (exponentialBackoff 50000) | |
| let connectWithRetry = Redis.connectRobust l (capDelay 1000000 (exponentialBackoff 50000)) |
Waiting more than a second to retry is not very nice and retrying every second shouldn't cause any problems either.
| -- TODO With ping, we only verify that a single node is running as opposed | ||
| -- to verifying that all nodes of the cluster are up and running. It | ||
| -- remains unclear how cluster health can be verified in hedis. |
There was a problem hiding this comment.
| -- TODO With ping, we only verify that a single node is running as opposed | |
| -- to verifying that all nodes of the cluster are up and running. It | |
| -- remains unclear how cluster health can be verified in hedis. | |
| -- FUTUREWORK: With ping, we only verify that a single node is running as opposed | |
| -- to verifying that all nodes of the cluster are up and running. It | |
| -- remains unclear how cluster health can be verified in hedis. |
I think it is ok if we don't do this now.
Not sure whether I would propose merging this code upstream. It's not really a bugfix. The library just assumes that the Redis cluster wouldn't die altogether. The library should work fine if the Redis cluster is migrated node by node. This might be connected to the way how we deploy Redis, though, I don't have access to Kubernetes and have thus just dangerous half-knowledge. |
We migrate the cluster node by node. We're replacing the ips the DNS name points to one by one. Once that's done none of the old IPs is pointing to a redis node anymore. IMHO, this is a bug in the library - users of the library specify a DNS name pointing to a cluster, and instead of honoring DNS and TTLs, keeping the hostname and re-resolving (at least in the case of connection errors), the library just resolves once and then just deals with IPs, loosing the connection to the cluster once a migration to new IPs happens. |
akshaymankar
left a comment
There was a problem hiding this comment.
It works 🎉
There is a small window where the messages just don't go. But I guess, this is what was happening even with the old library when connecting to redis in master mode.
| [ Handler (\(_ :: ConnectionLostException) -> reconnectRetry robustConnection), -- Redis connection lost during request | ||
| Handler (\(_ :: IOException) -> reconnectRetry robustConnection) -- Redis unreachable |
There was a problem hiding this comment.
Can you also add logs here? Without this it is a little confusing why the lazy connection starts restarting. Also, it would be nice to know which type of errors are causing a connection restart.
The plain
hedisclient discards the initial connection data and only retains a list of Redis cluster node IPs. When none of these IPs is valid anymore, for instance, due to Redis cluster or (Kubernetes) node updates,hedisimmediately looses all retained connections to Redis without any option to reconnect. In this patch, we wrap thehedisclient and retry connecting with the initially provided connection data in case of network errors. Also, the wrapper makes sure thatChecklist
changelog.d.