Skip to content

Fix gundeck leak#3136

Merged
jschaul merged 11 commits intowireapp:developfrom
isovector:fix-gundeck-leak
Mar 13, 2023
Merged

Fix gundeck leak#3136
jschaul merged 11 commits intowireapp:developfrom
isovector:fix-gundeck-leak

Conversation

@isovector
Copy link
Contributor

@isovector isovector commented Mar 7, 2023

This PR fixes a memory leak in gundeck caused by its reconnection logic when Redis is down. The heap before this PR:

image

and after:

image

Each spike here is a series of requests sent to gundeck while redis is down; the resulting taper-off in both charts is when redis comes back online.

It's unclear exactly what is causing this behavior in HEAD, but the original logic is highly suspicious. An MVar is created with the redis connection and is always assumed to be full. Subsequent requests to Redis then catch any exceptions they might see, and trigger a reconnection event, where they spin, attempting to reconnect. Only one thread may spin on this reconnection event, but the others are all blocked on it, and all will be in contention to update the MVar --- ironically, they will all be attempting to update it to the same value. My assumption is the old, bad behavior is caused by new threads attempting to talk to Redis, even though the application knows the connection has been lost, and then jumping in to do pointless work. Due to the MVar contention, all threads will keep the expensive Connection object on the stack attempting to update the value.

Furthermore, the problem is exacerbated by nothing pining redis, so by the time we discover it's down, we already have work we'd like to accomplish.

This PR instead creates a new thread who is responsible for maintaining the connection to redis. It pings redis every second, and if that ever fails, it immediately empties the MVar Connection, so that other requests to Redis will block until it has been refilled. This new thread will then spin, attempting to reconnect to redis, and will fill the MVar as soon as it's back online, which will unblock the other threads. This makes it extremely clear who is responsible for what, and prevents the weird recursive retry loops and MVar update races.


Additionally, you'll notice a ~140MB allocation spike at the beginning of each new reconnection in the new logic. This is a caused by insane stupidity in hedis' implementation that I found while diagnosing this problem. It wouldn't be very much work to fix if we're at all concerned about it.


My debugging trials and tribulations are documented here if anyone is interested in all the things that didn't help in diagnosing this problem :)

Checklist

  • Add a new entry in an appropriate subdirectory of changelog.d
  • Read and follow the PR guidelines

@mdimjasevic mdimjasevic added the ok-to-test Approved for running tests in CI, overrides not-ok-to-test if both labels exist label Mar 8, 2023
@mdimjasevic
Copy link
Contributor

@isovector , I've pushed an empty commit to trigger the CI pipelines. I hope you don't mind.

Marko Dimjašević and others added 2 commits March 8, 2023 14:35
@isovector
Copy link
Contributor Author

@mdimjasevic not at all, thanks for being on top of it

@isovector
Copy link
Contributor Author

Looks like the redis thread isn't gracefully shutting down.

conn <- connectLowLevel
Log.info l $ Log.msg (Log.val "successfully connected to Redis")

reconnectOnce <- once . retry $ reconnectRedis robustConnection -- avoid concurrent attempts to reconnect
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially you say, once was not working? If once was working, there should never have been a race for updating the MVar. Or rather, the MVar would be swapped several times which could have been optimised by putting swapMVar below once. Or is there more to it which I haven't grasped in this case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's bugging me is that I don't see the memory leak. After all pending Redis actions went through the (admittedly redundant) swapMVar, no memory should be retained. Though, I do see that it seems fixed by this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't see it, which is why I spent 16 hours digging through heap profiles before I tried changing the code 😊 . My best guess is that once isn't safe in a concurrent environment. It's implemented (indirectly) via modifyMVar, which comes with this warning:

This function is only atomic if there are no other producers for this MVar. In other words, it cannot guarantee that, by the time modifyMVar_ gets the chance to write to the MVar, the value of the MVar has not been altered by a write operation from another thread.

which makes me think once is indeed the culprit. But I don't know!

The other difference between the two logics is that HEAD will still call runRedis in runRobust on a known-bad connection, and it could be some of the janky pipelining in hedis is responsible for leaking the stacks here.

@isovector
Copy link
Contributor Author

@jschaul asked me:

could you run this commented-out test: https://github.com/wireapp/wire-server/blob/develop/services/gundeck/test/integration/API.hs#L142-L149 (which loops forever); kill redis, and re-start redis, and ensure that user websocket registration works again (i.e. that the connection to redis is re-established correctly?)

The results are here:

Running Test
Ran Test
Running Test
Ran Test
Running Test
Ran Test
Running Test
Ran Test
Running Test
Ran Test
Running Test
[gundeck@A] E, request=N/A, Redis connection failed, error=ConnectionLost
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, Redis cluster down, error=ClusterDownError (ClusterInfoResponse {clusterInfoResponseState = Down, clusterInfoResponseSlotsAssigned = 16384, clusterInfoResponseSlotsOK = 16384, clusterInfoResponseSlotsPfail = 0, clusterInfoResponseSlotsFail = 0, clusterInfoResponseKnownNodes = 6, clusterInfoResponseSize = 1, clusterInfoResponseCurrentEpoch = 6, clusterInfoResponseMyEpoch = 1, clusterInfoResponseStatsMessagesSent = 10, clusterInfoResponseStatsMessagesReceived = 10, clusterInfoResponseTotalLinksBufferLimitExceeded = 0, clusterInfoResponseStatsMessagesPingSent = Just 5, clusterInfoResponseStatsMessagesPingReceived = Just 5, clusterInfoResponseStatsMessagesPongSent = Just 5, clusterInfoResponseStatsMessagesPongReceived = Just 5, clusterInfoResponseStatsMessagesMeetSent = Nothing, clusterInfoResponseStatsMessagesMeetReceived = Nothing, clusterInfoResponseStatsMessagesFailSent = Nothing, clusterInfoResponseStatsMessagesFailReceived = Nothing, clusterInfoResponseStatsMessagesPublishSent = Nothing, clusterInfoResponseStatsMessagesPublishReceived = Nothing, clusterInfoResponseStatsMessagesAuthReqSent = Nothing, clusterInfoResponseStatsMessagesAuthReqReceived = Nothing, clusterInfoResponseStatsMessagesAuthAckSent = Nothing, clusterInfoResponseStatsMessagesAuthAckReceived = Nothing, clusterInfoResponseStatsMessagesUpdateSent = Nothing, clusterInfoResponseStatsMessagesUpdateReceived = Nothing, clusterInfoResponseStatsMessagesMfstartSent = Nothing, clusterInfoResponseStatsMessagesMfstartReceived = Nothing, clusterInfoResponseStatsMessagesModuleSent = Nothing, clusterInfoResponseStatsMessagesModuleReceived = Nothing, clusterInfoResponseStatsMessagesPublishshardSent = Nothing, clusterInfoResponseStatsMessagesPublishshardReceived = Nothing})
[gundeck@A] E, Redis cluster down, error=ClusterDownError (ClusterInfoResponse {clusterInfoResponseState = Down, clusterInfoResponseSlotsAssigned = 16384, clusterInfoResponseSlotsOK = 16384, clusterInfoResponseSlotsPfail = 0, clusterInfoResponseSlotsFail = 0, clusterInfoResponseKnownNodes = 6, clusterInfoResponseSize = 1, clusterInfoResponseCurrentEpoch = 6, clusterInfoResponseMyEpoch = 1, clusterInfoResponseStatsMessagesSent = 12, clusterInfoResponseStatsMessagesReceived = 12, clusterInfoResponseTotalLinksBufferLimitExceeded = 0, clusterInfoResponseStatsMessagesPingSent = Just 6, clusterInfoResponseStatsMessagesPingReceived = Just 6, clusterInfoResponseStatsMessagesPongSent = Just 6, clusterInfoResponseStatsMessagesPongReceived = Just 6, clusterInfoResponseStatsMessagesMeetSent = Nothing, clusterInfoResponseStatsMessagesMeetReceived = Nothing, clusterInfoResponseStatsMessagesFailSent = Nothing, clusterInfoResponseStatsMessagesFailReceived = Nothing, clusterInfoResponseStatsMessagesPublishSent = Nothing, clusterInfoResponseStatsMessagesPublishReceived = Nothing, clusterInfoResponseStatsMessagesAuthReqSent = Nothing, clusterInfoResponseStatsMessagesAuthReqReceived = Nothing, clusterInfoResponseStatsMessagesAuthAckSent = Nothing, clusterInfoResponseStatsMessagesAuthAckReceived = Nothing, clusterInfoResponseStatsMessagesUpdateSent = Nothing, clusterInfoResponseStatsMessagesUpdateReceived = Nothing, clusterInfoResponseStatsMessagesMfstartSent = Nothing, clusterInfoResponseStatsMessagesMfstartReceived = Nothing, clusterInfoResponseStatsMessagesModuleSent = Nothing, clusterInfoResponseStatsMessagesModuleReceived = Nothing, clusterInfoResponseStatsMessagesPublishshardSent = Nothing, clusterInfoResponseStatsMessagesPublishshardReceived = Nothing})
Ran Test
Running Test
Ran Test
Running Test
Ran Test
Running Test
Ran Test
Running Test
Ran Test
Running Test
Ran Test
Running Test
[gundeck@A] E, request=N/A, Redis connection failed, error=ConnectionLost
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, Redis cluster down, error=ClusterDownError (ClusterInfoResponse {clusterInfoResponseState = Down, clusterInfoResponseSlotsAssigned = 16384, clusterInfoResponseSlotsOK = 16384, clusterInfoResponseSlotsPfail = 0, clusterInfoResponseSlotsFail = 0, clusterInfoResponseKnownNodes = 6, clusterInfoResponseSize = 1, clusterInfoResponseCurrentEpoch = 6, clusterInfoResponseMyEpoch = 1, clusterInfoResponseStatsMessagesSent = 7, clusterInfoResponseStatsMessagesReceived = 7, clusterInfoResponseTotalLinksBufferLimitExceeded = 0, clusterInfoResponseStatsMessagesPingSent = Just 4, clusterInfoResponseStatsMessagesPingReceived = Just 3, clusterInfoResponseStatsMessagesPongSent = Just 3, clusterInfoResponseStatsMessagesPongReceived = Just 4, clusterInfoResponseStatsMessagesMeetSent = Nothing, clusterInfoResponseStatsMessagesMeetReceived = Nothing, clusterInfoResponseStatsMessagesFailSent = Nothing, clusterInfoResponseStatsMessagesFailReceived = Nothing, clusterInfoResponseStatsMessagesPublishSent = Nothing, clusterInfoResponseStatsMessagesPublishReceived = Nothing, clusterInfoResponseStatsMessagesAuthReqSent = Nothing, clusterInfoResponseStatsMessagesAuthReqReceived = Nothing, clusterInfoResponseStatsMessagesAuthAckSent = Nothing, clusterInfoResponseStatsMessagesAuthAckReceived = Nothing, clusterInfoResponseStatsMessagesUpdateSent = Nothing, clusterInfoResponseStatsMessagesUpdateReceived = Nothing, clusterInfoResponseStatsMessagesMfstartSent = Nothing, clusterInfoResponseStatsMessagesMfstartReceived = Nothing, clusterInfoResponseStatsMessagesModuleSent = Nothing, clusterInfoResponseStatsMessagesModuleReceived = Nothing, clusterInfoResponseStatsMessagesPublishshardSent = Nothing, clusterInfoResponseStatsMessagesPublishshardReceived = Nothing})
[gundeck@A] E, Redis cluster down, error=ClusterDownError (ClusterInfoResponse {clusterInfoResponseState = Down, clusterInfoResponseSlotsAssigned = 16384, clusterInfoResponseSlotsOK = 16384, clusterInfoResponseSlotsPfail = 0, clusterInfoResponseSlotsFail = 0, clusterInfoResponseKnownNodes = 6, clusterInfoResponseSize = 1, clusterInfoResponseCurrentEpoch = 6, clusterInfoResponseMyEpoch = 1, clusterInfoResponseStatsMessagesSent = 11, clusterInfoResponseStatsMessagesReceived = 11, clusterInfoResponseTotalLinksBufferLimitExceeded = 0, clusterInfoResponseStatsMessagesPingSent = Just 6, clusterInfoResponseStatsMessagesPingReceived = Just 5, clusterInfoResponseStatsMessagesPongSent = Just 5, clusterInfoResponseStatsMessagesPongReceived = Just 6, clusterInfoResponseStatsMessagesMeetSent = Nothing, clusterInfoResponseStatsMessagesMeetReceived = Nothing, clusterInfoResponseStatsMessagesFailSent = Nothing, clusterInfoResponseStatsMessagesFailReceived = Nothing, clusterInfoResponseStatsMessagesPublishSent = Nothing, clusterInfoResponseStatsMessagesPublishReceived = Nothing, clusterInfoResponseStatsMessagesAuthReqSent = Nothing, clusterInfoResponseStatsMessagesAuthReqReceived = Nothing, clusterInfoResponseStatsMessagesAuthAckSent = Nothing, clusterInfoResponseStatsMessagesAuthAckReceived = Nothing, clusterInfoResponseStatsMessagesUpdateSent = Nothing, clusterInfoResponseStatsMessagesUpdateReceived = Nothing, clusterInfoResponseStatsMessagesMfstartSent = Nothing, clusterInfoResponseStatsMessagesMfstartReceived = Nothing, clusterInfoResponseStatsMessagesModuleSent = Nothing, clusterInfoResponseStatsMessagesModuleReceived = Nothing, clusterInfoResponseStatsMessagesPublishshardSent = Nothing, clusterInfoResponseStatsMessagesPublishshardReceived = Nothing})
Ran Test
Running Test
Ran Test
Running Test
Ran Test
Running Test
^CFailure
Ran Test
[cannon@A] I, request=N/A, draining all websockets, numberOfConns=0, computedBatchSize=0, minBatchSize=100, batchSize=100, maxNumberOfBatches=200
[cannon@A] I, request=N/A, Draining complete
[cannon@A] cannon: SignalledToExit
Running Test

which responds with Failure only when I ^C to end the test.

@isovector
Copy link
Contributor Author

@jschaul @stefanwire this is ready to merge, but I don't have the necessary bits.

@jschaul
Copy link
Member

jschaul commented Mar 13, 2023

@jschaul asked me:

could you run this commented-out test: https://github.com/wireapp/wire-server/blob/develop/services/gundeck/test/integration/API.hs#L142-L149 (which loops forever); kill redis, and re-start redis, and ensure that user websocket registration works again (i.e. that the connection to redis is re-established correctly?)

The results are here:

Running Test
Ran Test
Running Test
Ran Test
Running Test
Ran Test
Running Test
Ran Test
Running Test
Ran Test
Running Test
[gundeck@A] E, request=N/A, Redis connection failed, error=ConnectionLost
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, Redis cluster down, error=ClusterDownError (ClusterInfoResponse {clusterInfoResponseState = Down, clusterInfoResponseSlotsAssigned = 16384, clusterInfoResponseSlotsOK = 16384, clusterInfoResponseSlotsPfail = 0, clusterInfoResponseSlotsFail = 0, clusterInfoResponseKnownNodes = 6, clusterInfoResponseSize = 1, clusterInfoResponseCurrentEpoch = 6, clusterInfoResponseMyEpoch = 1, clusterInfoResponseStatsMessagesSent = 10, clusterInfoResponseStatsMessagesReceived = 10, clusterInfoResponseTotalLinksBufferLimitExceeded = 0, clusterInfoResponseStatsMessagesPingSent = Just 5, clusterInfoResponseStatsMessagesPingReceived = Just 5, clusterInfoResponseStatsMessagesPongSent = Just 5, clusterInfoResponseStatsMessagesPongReceived = Just 5, clusterInfoResponseStatsMessagesMeetSent = Nothing, clusterInfoResponseStatsMessagesMeetReceived = Nothing, clusterInfoResponseStatsMessagesFailSent = Nothing, clusterInfoResponseStatsMessagesFailReceived = Nothing, clusterInfoResponseStatsMessagesPublishSent = Nothing, clusterInfoResponseStatsMessagesPublishReceived = Nothing, clusterInfoResponseStatsMessagesAuthReqSent = Nothing, clusterInfoResponseStatsMessagesAuthReqReceived = Nothing, clusterInfoResponseStatsMessagesAuthAckSent = Nothing, clusterInfoResponseStatsMessagesAuthAckReceived = Nothing, clusterInfoResponseStatsMessagesUpdateSent = Nothing, clusterInfoResponseStatsMessagesUpdateReceived = Nothing, clusterInfoResponseStatsMessagesMfstartSent = Nothing, clusterInfoResponseStatsMessagesMfstartReceived = Nothing, clusterInfoResponseStatsMessagesModuleSent = Nothing, clusterInfoResponseStatsMessagesModuleReceived = Nothing, clusterInfoResponseStatsMessagesPublishshardSent = Nothing, clusterInfoResponseStatsMessagesPublishshardReceived = Nothing})
[gundeck@A] E, Redis cluster down, error=ClusterDownError (ClusterInfoResponse {clusterInfoResponseState = Down, clusterInfoResponseSlotsAssigned = 16384, clusterInfoResponseSlotsOK = 16384, clusterInfoResponseSlotsPfail = 0, clusterInfoResponseSlotsFail = 0, clusterInfoResponseKnownNodes = 6, clusterInfoResponseSize = 1, clusterInfoResponseCurrentEpoch = 6, clusterInfoResponseMyEpoch = 1, clusterInfoResponseStatsMessagesSent = 12, clusterInfoResponseStatsMessagesReceived = 12, clusterInfoResponseTotalLinksBufferLimitExceeded = 0, clusterInfoResponseStatsMessagesPingSent = Just 6, clusterInfoResponseStatsMessagesPingReceived = Just 6, clusterInfoResponseStatsMessagesPongSent = Just 6, clusterInfoResponseStatsMessagesPongReceived = Just 6, clusterInfoResponseStatsMessagesMeetSent = Nothing, clusterInfoResponseStatsMessagesMeetReceived = Nothing, clusterInfoResponseStatsMessagesFailSent = Nothing, clusterInfoResponseStatsMessagesFailReceived = Nothing, clusterInfoResponseStatsMessagesPublishSent = Nothing, clusterInfoResponseStatsMessagesPublishReceived = Nothing, clusterInfoResponseStatsMessagesAuthReqSent = Nothing, clusterInfoResponseStatsMessagesAuthReqReceived = Nothing, clusterInfoResponseStatsMessagesAuthAckSent = Nothing, clusterInfoResponseStatsMessagesAuthAckReceived = Nothing, clusterInfoResponseStatsMessagesUpdateSent = Nothing, clusterInfoResponseStatsMessagesUpdateReceived = Nothing, clusterInfoResponseStatsMessagesMfstartSent = Nothing, clusterInfoResponseStatsMessagesMfstartReceived = Nothing, clusterInfoResponseStatsMessagesModuleSent = Nothing, clusterInfoResponseStatsMessagesModuleReceived = Nothing, clusterInfoResponseStatsMessagesPublishshardSent = Nothing, clusterInfoResponseStatsMessagesPublishshardReceived = Nothing})
Ran Test
Running Test
Ran Test
Running Test
Ran Test
Running Test
Ran Test
Running Test
Ran Test
Running Test
Ran Test
Running Test
[gundeck@A] E, request=N/A, Redis connection failed, error=ConnectionLost
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, Redis cluster down, error=ClusterDownError (ClusterInfoResponse {clusterInfoResponseState = Down, clusterInfoResponseSlotsAssigned = 16384, clusterInfoResponseSlotsOK = 16384, clusterInfoResponseSlotsPfail = 0, clusterInfoResponseSlotsFail = 0, clusterInfoResponseKnownNodes = 6, clusterInfoResponseSize = 1, clusterInfoResponseCurrentEpoch = 6, clusterInfoResponseMyEpoch = 1, clusterInfoResponseStatsMessagesSent = 7, clusterInfoResponseStatsMessagesReceived = 7, clusterInfoResponseTotalLinksBufferLimitExceeded = 0, clusterInfoResponseStatsMessagesPingSent = Just 4, clusterInfoResponseStatsMessagesPingReceived = Just 3, clusterInfoResponseStatsMessagesPongSent = Just 3, clusterInfoResponseStatsMessagesPongReceived = Just 4, clusterInfoResponseStatsMessagesMeetSent = Nothing, clusterInfoResponseStatsMessagesMeetReceived = Nothing, clusterInfoResponseStatsMessagesFailSent = Nothing, clusterInfoResponseStatsMessagesFailReceived = Nothing, clusterInfoResponseStatsMessagesPublishSent = Nothing, clusterInfoResponseStatsMessagesPublishReceived = Nothing, clusterInfoResponseStatsMessagesAuthReqSent = Nothing, clusterInfoResponseStatsMessagesAuthReqReceived = Nothing, clusterInfoResponseStatsMessagesAuthAckSent = Nothing, clusterInfoResponseStatsMessagesAuthAckReceived = Nothing, clusterInfoResponseStatsMessagesUpdateSent = Nothing, clusterInfoResponseStatsMessagesUpdateReceived = Nothing, clusterInfoResponseStatsMessagesMfstartSent = Nothing, clusterInfoResponseStatsMessagesMfstartReceived = Nothing, clusterInfoResponseStatsMessagesModuleSent = Nothing, clusterInfoResponseStatsMessagesModuleReceived = Nothing, clusterInfoResponseStatsMessagesPublishshardSent = Nothing, clusterInfoResponseStatsMessagesPublishshardReceived = Nothing})
[gundeck@A] E, Redis cluster down, error=ClusterDownError (ClusterInfoResponse {clusterInfoResponseState = Down, clusterInfoResponseSlotsAssigned = 16384, clusterInfoResponseSlotsOK = 16384, clusterInfoResponseSlotsPfail = 0, clusterInfoResponseSlotsFail = 0, clusterInfoResponseKnownNodes = 6, clusterInfoResponseSize = 1, clusterInfoResponseCurrentEpoch = 6, clusterInfoResponseMyEpoch = 1, clusterInfoResponseStatsMessagesSent = 11, clusterInfoResponseStatsMessagesReceived = 11, clusterInfoResponseTotalLinksBufferLimitExceeded = 0, clusterInfoResponseStatsMessagesPingSent = Just 6, clusterInfoResponseStatsMessagesPingReceived = Just 5, clusterInfoResponseStatsMessagesPongSent = Just 5, clusterInfoResponseStatsMessagesPongReceived = Just 6, clusterInfoResponseStatsMessagesMeetSent = Nothing, clusterInfoResponseStatsMessagesMeetReceived = Nothing, clusterInfoResponseStatsMessagesFailSent = Nothing, clusterInfoResponseStatsMessagesFailReceived = Nothing, clusterInfoResponseStatsMessagesPublishSent = Nothing, clusterInfoResponseStatsMessagesPublishReceived = Nothing, clusterInfoResponseStatsMessagesAuthReqSent = Nothing, clusterInfoResponseStatsMessagesAuthReqReceived = Nothing, clusterInfoResponseStatsMessagesAuthAckSent = Nothing, clusterInfoResponseStatsMessagesAuthAckReceived = Nothing, clusterInfoResponseStatsMessagesUpdateSent = Nothing, clusterInfoResponseStatsMessagesUpdateReceived = Nothing, clusterInfoResponseStatsMessagesMfstartSent = Nothing, clusterInfoResponseStatsMessagesMfstartReceived = Nothing, clusterInfoResponseStatsMessagesModuleSent = Nothing, clusterInfoResponseStatsMessagesModuleReceived = Nothing, clusterInfoResponseStatsMessagesPublishshardSent = Nothing, clusterInfoResponseStatsMessagesPublishshardReceived = Nothing})
Ran Test
Running Test
Ran Test
Running Test
Ran Test
Running Test
^CFailure
Ran Test
[cannon@A] I, request=N/A, draining all websockets, numberOfConns=0, computedBatchSize=0, minBatchSize=100, batchSize=100, maxNumberOfBatches=200
[cannon@A] I, request=N/A, Draining complete
[cannon@A] cannon: SignalledToExit
Running Test

which responds with Failure only when I ^C to end the test.

Thank you, looks fine to me, as expected.

@jschaul
Copy link
Member

jschaul commented Mar 13, 2023

I ran one more manual test on our staging environment.

  • keep two browsers open with a websocket connection on web
  • restart the redis stateful set
  • keep sending messages from one account to the other for 2-3 minutes, the time it takes for redis to restart all pods
  • compare messages in the two browsers. Did end-user message loss occurr?

Current behaviour (on staging):

message-loss-before

And after deploying this patch:

message-loss

Message loss for end users, on web (error 207) occurs right now when redis is restarted; and it also occurs after this patch.

Possibly messages are not delivered at all; or they are delivered out-of-order, in both cases they end up not being decryptable/visible to the end user.

So we still have an operational problem here with redis & hedis & gundeck & cannon.

As this patch does not seem to significantly make this problem worse (in the end, things recover; even if intermediate messages are lost), and it may fix a memory leak, let's still get this improvement in.

The retry logic here; or the general architecture about websockets needs some improvement though at some point.

@jschaul jschaul merged commit c97b712 into wireapp:develop Mar 13, 2023
battermann pushed a commit that referenced this pull request Mar 15, 2023
lepsa pushed a commit to lepsa/wire-server that referenced this pull request Nov 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ok-to-test Approved for running tests in CI, overrides not-ok-to-test if both labels exist

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Comments