retry Redis connection in case of network errors by stefanwire · Pull Request #2512 · wireapp/wire-server

stefanwire · 2022-06-23T13:57:48Z

The plain hedis client discards the initial connection data and only retains a list of Redis cluster node IPs. When none of these IPs is valid anymore, for instance, due to Redis cluster or (Kubernetes) node updates, hedis immediately looses all retained connections to Redis without any option to reconnect. In this patch, we wrap the hedis client and retry connecting with the initially provided connection data in case of network errors. Also, the wrapper makes sure that

if reconnecting fails, it is retried with exponential back-off, and
only one thread reconnects while all other threads are blocked and are immediately unblocked as soon as the new connection is established.

Checklist

The PR Title explains the impact of the change.
The PR description provides context as to why the change should occur and what the code contributes to that effect. This could also be a link to a JIRA ticket or a Github issue, if there is one.
changelog.d contains the following bits of information (details):
- A file with the changelog entry in one or more suitable sub-sections. The sub-sections are marked by directories inside changelog.d.

stefanwire · 2022-06-23T14:34:25Z

services/gundeck/src/Gundeck/Env.hs

    Log.msg (Log.val $ "starting connection to " <> identifier <> "...")
      . Log.field "connectionMode" (show $ endpoint ^. rConnectionMode)
      . Log.field "connInfo" (show redisConnInfo)
+  let connectWithRetry = Redis.connectRobust l (exponentialBackoff 50000) 


To the reviewer: I wonder whether what currently is exponentialBackoff 50000 should be more flexible and I can imagine two ways to add flexibility,

50000 is a magic number and should perhaps rather come from the environment variables, however, the choice of the constant might not be mattering much enough, since more importantly the back-off would always be exponential;

exponentialBackoff is just one of many retry policies in the retry package and perhaps retries should be limited in their number or by the time they require and/or even

combine this with environment configuration.

stefanwire · 2022-06-23T14:37:34Z

services/gundeck/src/Gundeck/Env.hs

 import qualified Database.Redis as Redis
 import qualified Gundeck.Aws as Aws
 import Gundeck.Options as Opt
+import qualified Gundeck.Redis as Redis


To the reviewer: this line effectively pollutes the Redis package namespace in this module, however, I deemed this is justifiable, since the namespace is artificial and functions of both modules logically belong to Redis anyway.

stefanwire · 2022-06-23T14:40:25Z

services/gundeck/src/Gundeck/Redis.hs

+  { _rrConnection :: Connection, -- established (and potentially breaking) connection to Redis
+    _rrReconnect :: IO () -- action which can be called to reconnect to Redis


To the reviewer: do we have a policy or is there some good rule of thumb on when to use bang patterns in record syntax?

Unfortunately, we don't have any of those.

stefanwire · 2022-06-23T14:47:25Z

services/gundeck/src/Gundeck/Run.hs

    forM_ wtbs Async.cancel
-    Redis.disconnect (e ^. rstate)
+    Redis.disconnect . (^. Redis.rrConnection) =<< takeMVar (e ^. rstate)
+    maybe (pure ()) ((=<<) (Redis.disconnect . (^. Redis.rrConnection)) . takeMVar) (e ^. rstateAdditionalWrite)


To the reviewer: the second Redis connection was never disconnected upon service shutdown in the past. I added disconnecting now, so if this is undesirable for some reason, I need to know.

battermann

Looks pretty good to me. Because I am not so familiar with redis, this is just a comment review.

flokli · 2022-06-27T07:52:24Z

Are there any plans on how to fix this in Hedis in general, instead of working around this downstream in every application using the library?

akshaymankar

Looks good so far, I will test it in some cluster before approving.

akshaymankar · 2022-06-27T07:39:58Z

services/gundeck/src/Gundeck/Redis.hs

+  { _rrConnection :: Connection, -- established (and potentially breaking) connection to Redis
+    _rrReconnect :: IO () -- action which can be called to reconnect to Redis


Unfortunately, we don't have any of those.

akshaymankar · 2022-06-27T07:40:15Z

services/gundeck/src/Gundeck/Redis.hs

+  { _rrConnection :: Connection, -- established (and potentially breaking) connection to Redis
+    _rrReconnect :: IO () -- action which can be called to reconnect to Redis


Nit: Make these comments haddocks.

akshaymankar · 2022-06-27T08:05:34Z

services/gundeck/src/Gundeck/Env.hs

    Log.msg (Log.val $ "starting connection to " <> identifier <> "...")
      . Log.field "connectionMode" (show $ endpoint ^. rConnectionMode)
      . Log.field "connInfo" (show redisConnInfo)
+  let connectWithRetry = Redis.connectRobust l (exponentialBackoff 50000)


Suggested change

let connectWithRetry = Redis.connectRobust l (exponentialBackoff 50000)

let connectWithRetry = Redis.connectRobust l (capDelay 1000000 (exponentialBackoff 50000))

Waiting more than a second to retry is not very nice and retrying every second shouldn't cause any problems either.

akshaymankar · 2022-06-27T08:06:15Z

services/gundeck/src/Gundeck/Redis.hs

+      -- TODO With ping, we only verify that a single node is running as opposed
+      -- to verifying that all nodes of the cluster are up and running. It
+      -- remains unclear how cluster health can be verified in hedis.


Suggested change

-- TODO With ping, we only verify that a single node is running as opposed

-- to verifying that all nodes of the cluster are up and running. It

-- remains unclear how cluster health can be verified in hedis.

-- FUTUREWORK: With ping, we only verify that a single node is running as opposed

-- to verifying that all nodes of the cluster are up and running. It

-- remains unclear how cluster health can be verified in hedis.

I think it is ok if we don't do this now.

stefanwire · 2022-06-27T08:49:45Z

Are there any plans on how to fix this in Hedis in general, instead of working around this downstream in every application using the library?

Not sure whether I would propose merging this code upstream. It's not really a bugfix. The library just assumes that the Redis cluster wouldn't die altogether. The library should work fine if the Redis cluster is migrated node by node. This might be connected to the way how we deploy Redis, though, I don't have access to Kubernetes and have thus just dangerous half-knowledge.

flokli · 2022-06-27T09:23:42Z

Are there any plans on how to fix this in Hedis in general, instead of working around this downstream in every application using the library?

Not sure whether I would propose merging this code upstream. It's not really a bugfix. The library just assumes that the Redis cluster wouldn't die altogether. The library should work fine if the Redis cluster is migrated node by node. This might be connected to the way how we deploy Redis, though, I don't have access to Kubernetes and have thus just dangerous half-knowledge.

We migrate the cluster node by node. We're replacing the ips the DNS name points to one by one. Once that's done none of the old IPs is pointing to a redis node anymore.

IMHO, this is a bug in the library - users of the library specify a DNS name pointing to a cluster, and instead of honoring DNS and TTLs, keeping the hostname and re-resolving (at least in the case of connection errors), the library just resolves once and then just deals with IPs, loosing the connection to the cluster once a migration to new IPs happens.

akshaymankar

It works 🎉
There is a small window where the messages just don't go. But I guess, this is what was happening even with the old library when connecting to redis in master mode.

akshaymankar · 2022-06-27T10:30:08Z

services/gundeck/src/Gundeck/Redis.hs

+    [ Handler (\(_ :: ConnectionLostException) -> reconnectRetry robustConnection), -- Redis connection lost during request
+      Handler (\(_ :: IOException) -> reconnectRetry robustConnection) -- Redis unreachable


Can you also add logs here? Without this it is a little confusing why the lazy connection starts restarting. Also, it would be nice to know which type of errors are causing a connection restart.

retry Redis connection in case of network errors

111eaa0

stefanwire temporarily deployed to cachix June 23, 2022 13:57 Inactive

stefanwire commented Jun 23, 2022

View reviewed changes

add changelog

24b92c4

stefanwire temporarily deployed to cachix June 23, 2022 15:25 Inactive

ormolu and other changes

8e53e7a

stefanwire temporarily deployed to cachix June 23, 2022 15:38 Inactive

rename reconnectingRedis to robustClient

02b02a8

battermann reviewed Jun 24, 2022

View reviewed changes

use unlessM instead of if success

23f2fce

stefanwire temporarily deployed to cachix June 24, 2022 14:39 Inactive

replace 'maybe (pure ())' by whenJust

fc694ce

stefanwire temporarily deployed to cachix June 24, 2022 15:41 Inactive

akshaymankar reviewed Jun 27, 2022

View reviewed changes

incorporate review comments

6d241bf

stefanwire temporarily deployed to cachix June 27, 2022 08:33 Inactive

akshaymankar approved these changes Jun 27, 2022

View reviewed changes

stefanwire merged commit 73a4dc3 into develop Jun 27, 2022

stefanwire deleted the sb/BM-100/robustredis branch June 27, 2022 17:00

akshaymankar mentioned this pull request Jun 28, 2022

Gundeck: Log when connection with redis breaks #2518

Merged

3 tasks

This was referenced Jul 5, 2022

Release 2022-07-05 - (expected chart version 4.18.0) #2534

Closed

Release 2022-07-05 - (expected chart version 4.18.0) #2537

Merged

Release 2022-07-05 - (expected chart version 4.18.0) #2540

Closed

		{ _rrConnection :: Connection, -- established (and potentially breaking) connection to Redis
		_rrReconnect :: IO () -- action which can be called to reconnect to Redis

	let connectWithRetry = Redis.connectRobust l (exponentialBackoff 50000)
	let connectWithRetry = Redis.connectRobust l (capDelay 1000000 (exponentialBackoff 50000))

		[ Handler (\(_ :: ConnectionLostException) -> reconnectRetry robustConnection), -- Redis connection lost during request
		Handler (\(_ :: IOException) -> reconnectRetry robustConnection) -- Redis unreachable

Comments

Conversation

stefanwire commented Jun 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

battermann left a comment

Choose a reason for hiding this comment

Uh oh!

flokli commented Jun 27, 2022

Uh oh!

akshaymankar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stefanwire commented Jun 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

flokli commented Jun 27, 2022

Uh oh!

akshaymankar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

stefanwire commented Jun 23, 2022 •

edited

Loading

stefanwire commented Jun 27, 2022 •

edited

Loading