Skip to content

Comments

retry Redis connection in case of network errors#2512

Merged
stefanwire merged 7 commits intodevelopfrom
sb/BM-100/robustredis
Jun 27, 2022
Merged

retry Redis connection in case of network errors#2512
stefanwire merged 7 commits intodevelopfrom
sb/BM-100/robustredis

Conversation

@stefanwire
Copy link
Contributor

@stefanwire stefanwire commented Jun 23, 2022

The plain hedis client discards the initial connection data and only retains a list of Redis cluster node IPs. When none of these IPs is valid anymore, for instance, due to Redis cluster or (Kubernetes) node updates, hedis immediately looses all retained connections to Redis without any option to reconnect. In this patch, we wrap the hedis client and retry connecting with the initially provided connection data in case of network errors. Also, the wrapper makes sure that

  • if reconnecting fails, it is retried with exponential back-off, and
  • only one thread reconnects while all other threads are blocked and are immediately unblocked as soon as the new connection is established.

Checklist

  • The PR Title explains the impact of the change.
  • The PR description provides context as to why the change should occur and what the code contributes to that effect. This could also be a link to a JIRA ticket or a Github issue, if there is one.
  • changelog.d contains the following bits of information (details):
    • A file with the changelog entry in one or more suitable sub-sections. The sub-sections are marked by directories inside changelog.d.

@stefanwire stefanwire temporarily deployed to cachix June 23, 2022 13:57 Inactive
Log.msg (Log.val $ "starting connection to " <> identifier <> "...")
. Log.field "connectionMode" (show $ endpoint ^. rConnectionMode)
. Log.field "connInfo" (show redisConnInfo)
let connectWithRetry = Redis.connectRobust l (exponentialBackoff 50000)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To the reviewer: I wonder whether what currently is exponentialBackoff 50000 should be more flexible and I can imagine two ways to add flexibility,

  • 50000 is a magic number and should perhaps rather come from the environment variables, however, the choice of the constant might not be mattering much enough, since more importantly the back-off would always be exponential;
  • exponentialBackoff is just one of many retry policies in the retry package and perhaps retries should be limited in their number or by the time they require and/or even
    • combine this with environment configuration.

import qualified Database.Redis as Redis
import qualified Gundeck.Aws as Aws
import Gundeck.Options as Opt
import qualified Gundeck.Redis as Redis
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To the reviewer: this line effectively pollutes the Redis package namespace in this module, however, I deemed this is justifiable, since the namespace is artificial and functions of both modules logically belong to Redis anyway.

Comment on lines 47 to 48
{ _rrConnection :: Connection, -- established (and potentially breaking) connection to Redis
_rrReconnect :: IO () -- action which can be called to reconnect to Redis
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To the reviewer: do we have a policy or is there some good rule of thumb on when to use bang patterns in record syntax?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, we don't have any of those.

forM_ wtbs Async.cancel
Redis.disconnect (e ^. rstate)
Redis.disconnect . (^. Redis.rrConnection) =<< takeMVar (e ^. rstate)
maybe (pure ()) ((=<<) (Redis.disconnect . (^. Redis.rrConnection)) . takeMVar) (e ^. rstateAdditionalWrite)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To the reviewer: the second Redis connection was never disconnected upon service shutdown in the past. I added disconnecting now, so if this is undesirable for some reason, I need to know.

@stefanwire stefanwire temporarily deployed to cachix June 23, 2022 15:25 Inactive
@stefanwire stefanwire temporarily deployed to cachix June 23, 2022 15:38 Inactive
Copy link
Contributor

@battermann battermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good to me. Because I am not so familiar with redis, this is just a comment review.

@stefanwire stefanwire temporarily deployed to cachix June 24, 2022 14:39 Inactive
@stefanwire stefanwire temporarily deployed to cachix June 24, 2022 15:41 Inactive
@flokli
Copy link
Contributor

flokli commented Jun 27, 2022

Are there any plans on how to fix this in Hedis in general, instead of working around this downstream in every application using the library?

Copy link
Member

@akshaymankar akshaymankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so far, I will test it in some cluster before approving.

Comment on lines 47 to 48
{ _rrConnection :: Connection, -- established (and potentially breaking) connection to Redis
_rrReconnect :: IO () -- action which can be called to reconnect to Redis
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, we don't have any of those.

Comment on lines 47 to 48
{ _rrConnection :: Connection, -- established (and potentially breaking) connection to Redis
_rrReconnect :: IO () -- action which can be called to reconnect to Redis
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Make these comments haddocks.

Log.msg (Log.val $ "starting connection to " <> identifier <> "...")
. Log.field "connectionMode" (show $ endpoint ^. rConnectionMode)
. Log.field "connInfo" (show redisConnInfo)
let connectWithRetry = Redis.connectRobust l (exponentialBackoff 50000)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let connectWithRetry = Redis.connectRobust l (exponentialBackoff 50000)
let connectWithRetry = Redis.connectRobust l (capDelay 1000000 (exponentialBackoff 50000))

Waiting more than a second to retry is not very nice and retrying every second shouldn't cause any problems either.

Comment on lines 77 to 79
-- TODO With ping, we only verify that a single node is running as opposed
-- to verifying that all nodes of the cluster are up and running. It
-- remains unclear how cluster health can be verified in hedis.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
-- TODO With ping, we only verify that a single node is running as opposed
-- to verifying that all nodes of the cluster are up and running. It
-- remains unclear how cluster health can be verified in hedis.
-- FUTUREWORK: With ping, we only verify that a single node is running as opposed
-- to verifying that all nodes of the cluster are up and running. It
-- remains unclear how cluster health can be verified in hedis.

I think it is ok if we don't do this now.

@stefanwire stefanwire temporarily deployed to cachix June 27, 2022 08:33 Inactive
@stefanwire
Copy link
Contributor Author

stefanwire commented Jun 27, 2022

Are there any plans on how to fix this in Hedis in general, instead of working around this downstream in every application using the library?

Not sure whether I would propose merging this code upstream. It's not really a bugfix. The library just assumes that the Redis cluster wouldn't die altogether. The library should work fine if the Redis cluster is migrated node by node. This might be connected to the way how we deploy Redis, though, I don't have access to Kubernetes and have thus just dangerous half-knowledge.

@flokli
Copy link
Contributor

flokli commented Jun 27, 2022

Are there any plans on how to fix this in Hedis in general, instead of working around this downstream in every application using the library?

Not sure whether I would propose merging this code upstream. It's not really a bugfix. The library just assumes that the Redis cluster wouldn't die altogether. The library should work fine if the Redis cluster is migrated node by node. This might be connected to the way how we deploy Redis, though, I don't have access to Kubernetes and have thus just dangerous half-knowledge.

We migrate the cluster node by node. We're replacing the ips the DNS name points to one by one. Once that's done none of the old IPs is pointing to a redis node anymore.

IMHO, this is a bug in the library - users of the library specify a DNS name pointing to a cluster, and instead of honoring DNS and TTLs, keeping the hostname and re-resolving (at least in the case of connection errors), the library just resolves once and then just deals with IPs, loosing the connection to the cluster once a migration to new IPs happens.

Copy link
Member

@akshaymankar akshaymankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works 🎉
There is a small window where the messages just don't go. But I guess, this is what was happening even with the old library when connecting to redis in master mode.

Comment on lines +114 to +115
[ Handler (\(_ :: ConnectionLostException) -> reconnectRetry robustConnection), -- Redis connection lost during request
Handler (\(_ :: IOException) -> reconnectRetry robustConnection) -- Redis unreachable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add logs here? Without this it is a little confusing why the lazy connection starts restarting. Also, it would be nice to know which type of errors are causing a connection restart.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants