Fix gundeck leak by isovector · Pull Request #3136 · wireapp/wire-server

isovector · 2023-03-07T21:04:24Z

This PR fixes a memory leak in gundeck caused by its reconnection logic when Redis is down. The heap before this PR:

and after:

Each spike here is a series of requests sent to gundeck while redis is down; the resulting taper-off in both charts is when redis comes back online.

It's unclear exactly what is causing this behavior in HEAD, but the original logic is highly suspicious. An MVar is created with the redis connection and is always assumed to be full. Subsequent requests to Redis then catch any exceptions they might see, and trigger a reconnection event, where they spin, attempting to reconnect. Only one thread may spin on this reconnection event, but the others are all blocked on it, and all will be in contention to update the MVar --- ironically, they will all be attempting to update it to the same value. My assumption is the old, bad behavior is caused by new threads attempting to talk to Redis, even though the application knows the connection has been lost, and then jumping in to do pointless work. Due to the MVar contention, all threads will keep the expensive Connection object on the stack attempting to update the value.

Furthermore, the problem is exacerbated by nothing pining redis, so by the time we discover it's down, we already have work we'd like to accomplish.

This PR instead creates a new thread who is responsible for maintaining the connection to redis. It pings redis every second, and if that ever fails, it immediately empties the MVar Connection, so that other requests to Redis will block until it has been refilled. This new thread will then spin, attempting to reconnect to redis, and will fill the MVar as soon as it's back online, which will unblock the other threads. This makes it extremely clear who is responsible for what, and prevents the weird recursive retry loops and MVar update races.

Additionally, you'll notice a ~140MB allocation spike at the beginning of each new reconnection in the new logic. This is a caused by insane stupidity in hedis' implementation that I found while diagnosing this problem. It wouldn't be very much work to fix if we're at all concerned about it.

My debugging trials and tribulations are documented here if anyone is interested in all the things that didn't help in diagnosing this problem :)

Checklist

Add a new entry in an appropriate subdirectory of changelog.d
Read and follow the PR guidelines

mdimjasevic · 2023-03-08T12:43:37Z

@isovector , I've pushed an empty commit to trigger the CI pipelines. I hope you don't mind.

isovector · 2023-03-08T17:02:23Z

@mdimjasevic not at all, thanks for being on top of it

isovector · 2023-03-08T23:24:21Z

Looks like the redis thread isn't gracefully shutting down.

stefanwire · 2023-03-09T14:38:19Z

services/gundeck/src/Gundeck/Redis.hs

-      conn <- connectLowLevel
-      Log.info l $ Log.msg (Log.val "successfully connected to Redis")
-
-      reconnectOnce <- once . retry $ reconnectRedis robustConnection -- avoid concurrent attempts to reconnect


Essentially you say, once was not working? If once was working, there should never have been a race for updating the MVar. Or rather, the MVar would be swapped several times which could have been optimised by putting swapMVar below once. Or is there more to it which I haven't grasped in this case?

What's bugging me is that I don't see the memory leak. After all pending Redis actions went through the (admittedly redundant) swapMVar, no memory should be retained. Though, I do see that it seems fixed by this PR.

I also don't see it, which is why I spent 16 hours digging through heap profiles before I tried changing the code 😊 . My best guess is that once isn't safe in a concurrent environment. It's implemented (indirectly) via modifyMVar, which comes with this warning:

This function is only atomic if there are no other producers for this MVar. In other words, it cannot guarantee that, by the time modifyMVar_ gets the chance to write to the MVar, the value of the MVar has not been altered by a write operation from another thread.

which makes me think once is indeed the culprit. But I don't know!

The other difference between the two logics is that HEAD will still call runRedis in runRobust on a known-bad connection, and it could be some of the janky pipelining in hedis is responsible for leaking the stacks here.

isovector · 2023-03-09T16:18:06Z

@jschaul asked me:

could you run this commented-out test: https://github.com/wireapp/wire-server/blob/develop/services/gundeck/test/integration/API.hs#L142-L149 (which loops forever); kill redis, and re-start redis, and ensure that user websocket registration works again (i.e. that the connection to redis is re-established correctly?)

The results are here:

Running Test
Ran Test
Running Test
Ran Test
Running Test
Ran Test
Running Test
Ran Test
Running Test
Ran Test
Running Test
[gundeck@A] E, request=N/A, Redis connection failed, error=ConnectionLost
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, Redis cluster down, error=ClusterDownError (ClusterInfoResponse {clusterInfoResponseState = Down, clusterInfoResponseSlotsAssigned = 16384, clusterInfoResponseSlotsOK = 16384, clusterInfoResponseSlotsPfail = 0, clusterInfoResponseSlotsFail = 0, clusterInfoResponseKnownNodes = 6, clusterInfoResponseSize = 1, clusterInfoResponseCurrentEpoch = 6, clusterInfoResponseMyEpoch = 1, clusterInfoResponseStatsMessagesSent = 10, clusterInfoResponseStatsMessagesReceived = 10, clusterInfoResponseTotalLinksBufferLimitExceeded = 0, clusterInfoResponseStatsMessagesPingSent = Just 5, clusterInfoResponseStatsMessagesPingReceived = Just 5, clusterInfoResponseStatsMessagesPongSent = Just 5, clusterInfoResponseStatsMessagesPongReceived = Just 5, clusterInfoResponseStatsMessagesMeetSent = Nothing, clusterInfoResponseStatsMessagesMeetReceived = Nothing, clusterInfoResponseStatsMessagesFailSent = Nothing, clusterInfoResponseStatsMessagesFailReceived = Nothing, clusterInfoResponseStatsMessagesPublishSent = Nothing, clusterInfoResponseStatsMessagesPublishReceived = Nothing, clusterInfoResponseStatsMessagesAuthReqSent = Nothing, clusterInfoResponseStatsMessagesAuthReqReceived = Nothing, clusterInfoResponseStatsMessagesAuthAckSent = Nothing, clusterInfoResponseStatsMessagesAuthAckReceived = Nothing, clusterInfoResponseStatsMessagesUpdateSent = Nothing, clusterInfoResponseStatsMessagesUpdateReceived = Nothing, clusterInfoResponseStatsMessagesMfstartSent = Nothing, clusterInfoResponseStatsMessagesMfstartReceived = Nothing, clusterInfoResponseStatsMessagesModuleSent = Nothing, clusterInfoResponseStatsMessagesModuleReceived = Nothing, clusterInfoResponseStatsMessagesPublishshardSent = Nothing, clusterInfoResponseStatsMessagesPublishshardReceived = Nothing})
[gundeck@A] E, Redis cluster down, error=ClusterDownError (ClusterInfoResponse {clusterInfoResponseState = Down, clusterInfoResponseSlotsAssigned = 16384, clusterInfoResponseSlotsOK = 16384, clusterInfoResponseSlotsPfail = 0, clusterInfoResponseSlotsFail = 0, clusterInfoResponseKnownNodes = 6, clusterInfoResponseSize = 1, clusterInfoResponseCurrentEpoch = 6, clusterInfoResponseMyEpoch = 1, clusterInfoResponseStatsMessagesSent = 12, clusterInfoResponseStatsMessagesReceived = 12, clusterInfoResponseTotalLinksBufferLimitExceeded = 0, clusterInfoResponseStatsMessagesPingSent = Just 6, clusterInfoResponseStatsMessagesPingReceived = Just 6, clusterInfoResponseStatsMessagesPongSent = Just 6, clusterInfoResponseStatsMessagesPongReceived = Just 6, clusterInfoResponseStatsMessagesMeetSent = Nothing, clusterInfoResponseStatsMessagesMeetReceived = Nothing, clusterInfoResponseStatsMessagesFailSent = Nothing, clusterInfoResponseStatsMessagesFailReceived = Nothing, clusterInfoResponseStatsMessagesPublishSent = Nothing, clusterInfoResponseStatsMessagesPublishReceived = Nothing, clusterInfoResponseStatsMessagesAuthReqSent = Nothing, clusterInfoResponseStatsMessagesAuthReqReceived = Nothing, clusterInfoResponseStatsMessagesAuthAckSent = Nothing, clusterInfoResponseStatsMessagesAuthAckReceived = Nothing, clusterInfoResponseStatsMessagesUpdateSent = Nothing, clusterInfoResponseStatsMessagesUpdateReceived = Nothing, clusterInfoResponseStatsMessagesMfstartSent = Nothing, clusterInfoResponseStatsMessagesMfstartReceived = Nothing, clusterInfoResponseStatsMessagesModuleSent = Nothing, clusterInfoResponseStatsMessagesModuleReceived = Nothing, clusterInfoResponseStatsMessagesPublishshardSent = Nothing, clusterInfoResponseStatsMessagesPublishshardReceived = Nothing})
Ran Test
Running Test
Ran Test
Running Test
Ran Test
Running Test
Ran Test
Running Test
Ran Test
Running Test
Ran Test
Running Test
[gundeck@A] E, request=N/A, Redis connection failed, error=ConnectionLost
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, Redis cluster down, error=ClusterDownError (ClusterInfoResponse {clusterInfoResponseState = Down, clusterInfoResponseSlotsAssigned = 16384, clusterInfoResponseSlotsOK = 16384, clusterInfoResponseSlotsPfail = 0, clusterInfoResponseSlotsFail = 0, clusterInfoResponseKnownNodes = 6, clusterInfoResponseSize = 1, clusterInfoResponseCurrentEpoch = 6, clusterInfoResponseMyEpoch = 1, clusterInfoResponseStatsMessagesSent = 7, clusterInfoResponseStatsMessagesReceived = 7, clusterInfoResponseTotalLinksBufferLimitExceeded = 0, clusterInfoResponseStatsMessagesPingSent = Just 4, clusterInfoResponseStatsMessagesPingReceived = Just 3, clusterInfoResponseStatsMessagesPongSent = Just 3, clusterInfoResponseStatsMessagesPongReceived = Just 4, clusterInfoResponseStatsMessagesMeetSent = Nothing, clusterInfoResponseStatsMessagesMeetReceived = Nothing, clusterInfoResponseStatsMessagesFailSent = Nothing, clusterInfoResponseStatsMessagesFailReceived = Nothing, clusterInfoResponseStatsMessagesPublishSent = Nothing, clusterInfoResponseStatsMessagesPublishReceived = Nothing, clusterInfoResponseStatsMessagesAuthReqSent = Nothing, clusterInfoResponseStatsMessagesAuthReqReceived = Nothing, clusterInfoResponseStatsMessagesAuthAckSent = Nothing, clusterInfoResponseStatsMessagesAuthAckReceived = Nothing, clusterInfoResponseStatsMessagesUpdateSent = Nothing, clusterInfoResponseStatsMessagesUpdateReceived = Nothing, clusterInfoResponseStatsMessagesMfstartSent = Nothing, clusterInfoResponseStatsMessagesMfstartReceived = Nothing, clusterInfoResponseStatsMessagesModuleSent = Nothing, clusterInfoResponseStatsMessagesModuleReceived = Nothing, clusterInfoResponseStatsMessagesPublishshardSent = Nothing, clusterInfoResponseStatsMessagesPublishshardReceived = Nothing})
[gundeck@A] E, Redis cluster down, error=ClusterDownError (ClusterInfoResponse {clusterInfoResponseState = Down, clusterInfoResponseSlotsAssigned = 16384, clusterInfoResponseSlotsOK = 16384, clusterInfoResponseSlotsPfail = 0, clusterInfoResponseSlotsFail = 0, clusterInfoResponseKnownNodes = 6, clusterInfoResponseSize = 1, clusterInfoResponseCurrentEpoch = 6, clusterInfoResponseMyEpoch = 1, clusterInfoResponseStatsMessagesSent = 11, clusterInfoResponseStatsMessagesReceived = 11, clusterInfoResponseTotalLinksBufferLimitExceeded = 0, clusterInfoResponseStatsMessagesPingSent = Just 6, clusterInfoResponseStatsMessagesPingReceived = Just 5, clusterInfoResponseStatsMessagesPongSent = Just 5, clusterInfoResponseStatsMessagesPongReceived = Just 6, clusterInfoResponseStatsMessagesMeetSent = Nothing, clusterInfoResponseStatsMessagesMeetReceived = Nothing, clusterInfoResponseStatsMessagesFailSent = Nothing, clusterInfoResponseStatsMessagesFailReceived = Nothing, clusterInfoResponseStatsMessagesPublishSent = Nothing, clusterInfoResponseStatsMessagesPublishReceived = Nothing, clusterInfoResponseStatsMessagesAuthReqSent = Nothing, clusterInfoResponseStatsMessagesAuthReqReceived = Nothing, clusterInfoResponseStatsMessagesAuthAckSent = Nothing, clusterInfoResponseStatsMessagesAuthAckReceived = Nothing, clusterInfoResponseStatsMessagesUpdateSent = Nothing, clusterInfoResponseStatsMessagesUpdateReceived = Nothing, clusterInfoResponseStatsMessagesMfstartSent = Nothing, clusterInfoResponseStatsMessagesMfstartReceived = Nothing, clusterInfoResponseStatsMessagesModuleSent = Nothing, clusterInfoResponseStatsMessagesModuleReceived = Nothing, clusterInfoResponseStatsMessagesPublishshardSent = Nothing, clusterInfoResponseStatsMessagesPublishshardReceived = Nothing})
Ran Test
Running Test
Ran Test
Running Test
Ran Test
Running Test
^CFailure
Ran Test
[cannon@A] I, request=N/A, draining all websockets, numberOfConns=0, computedBatchSize=0, minBatchSize=100, batchSize=100, maxNumberOfBatches=200
[cannon@A] I, request=N/A, Draining complete
[cannon@A] cannon: SignalledToExit
Running Test

which responds with Failure only when I ^C to end the test.

isovector · 2023-03-09T16:18:51Z

@jschaul @stefanwire this is ready to merge, but I don't have the necessary bits.

jschaul · 2023-03-13T12:05:29Z

@jschaul asked me:

could you run this commented-out test: https://github.com/wireapp/wire-server/blob/develop/services/gundeck/test/integration/API.hs#L142-L149 (which loops forever); kill redis, and re-start redis, and ensure that user websocket registration works again (i.e. that the connection to redis is re-established correctly?)

The results are here:

Running Test
Ran Test
Running Test
Ran Test
Running Test
Ran Test
Running Test
Ran Test
Running Test
Ran Test
Running Test
[gundeck@A] E, request=N/A, Redis connection failed, error=ConnectionLost
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, Redis cluster down, error=ClusterDownError (ClusterInfoResponse {clusterInfoResponseState = Down, clusterInfoResponseSlotsAssigned = 16384, clusterInfoResponseSlotsOK = 16384, clusterInfoResponseSlotsPfail = 0, clusterInfoResponseSlotsFail = 0, clusterInfoResponseKnownNodes = 6, clusterInfoResponseSize = 1, clusterInfoResponseCurrentEpoch = 6, clusterInfoResponseMyEpoch = 1, clusterInfoResponseStatsMessagesSent = 10, clusterInfoResponseStatsMessagesReceived = 10, clusterInfoResponseTotalLinksBufferLimitExceeded = 0, clusterInfoResponseStatsMessagesPingSent = Just 5, clusterInfoResponseStatsMessagesPingReceived = Just 5, clusterInfoResponseStatsMessagesPongSent = Just 5, clusterInfoResponseStatsMessagesPongReceived = Just 5, clusterInfoResponseStatsMessagesMeetSent = Nothing, clusterInfoResponseStatsMessagesMeetReceived = Nothing, clusterInfoResponseStatsMessagesFailSent = Nothing, clusterInfoResponseStatsMessagesFailReceived = Nothing, clusterInfoResponseStatsMessagesPublishSent = Nothing, clusterInfoResponseStatsMessagesPublishReceived = Nothing, clusterInfoResponseStatsMessagesAuthReqSent = Nothing, clusterInfoResponseStatsMessagesAuthReqReceived = Nothing, clusterInfoResponseStatsMessagesAuthAckSent = Nothing, clusterInfoResponseStatsMessagesAuthAckReceived = Nothing, clusterInfoResponseStatsMessagesUpdateSent = Nothing, clusterInfoResponseStatsMessagesUpdateReceived = Nothing, clusterInfoResponseStatsMessagesMfstartSent = Nothing, clusterInfoResponseStatsMessagesMfstartReceived = Nothing, clusterInfoResponseStatsMessagesModuleSent = Nothing, clusterInfoResponseStatsMessagesModuleReceived = Nothing, clusterInfoResponseStatsMessagesPublishshardSent = Nothing, clusterInfoResponseStatsMessagesPublishshardReceived = Nothing})
[gundeck@A] E, Redis cluster down, error=ClusterDownError (ClusterInfoResponse {clusterInfoResponseState = Down, clusterInfoResponseSlotsAssigned = 16384, clusterInfoResponseSlotsOK = 16384, clusterInfoResponseSlotsPfail = 0, clusterInfoResponseSlotsFail = 0, clusterInfoResponseKnownNodes = 6, clusterInfoResponseSize = 1, clusterInfoResponseCurrentEpoch = 6, clusterInfoResponseMyEpoch = 1, clusterInfoResponseStatsMessagesSent = 12, clusterInfoResponseStatsMessagesReceived = 12, clusterInfoResponseTotalLinksBufferLimitExceeded = 0, clusterInfoResponseStatsMessagesPingSent = Just 6, clusterInfoResponseStatsMessagesPingReceived = Just 6, clusterInfoResponseStatsMessagesPongSent = Just 6, clusterInfoResponseStatsMessagesPongReceived = Just 6, clusterInfoResponseStatsMessagesMeetSent = Nothing, clusterInfoResponseStatsMessagesMeetReceived = Nothing, clusterInfoResponseStatsMessagesFailSent = Nothing, clusterInfoResponseStatsMessagesFailReceived = Nothing, clusterInfoResponseStatsMessagesPublishSent = Nothing, clusterInfoResponseStatsMessagesPublishReceived = Nothing, clusterInfoResponseStatsMessagesAuthReqSent = Nothing, clusterInfoResponseStatsMessagesAuthReqReceived = Nothing, clusterInfoResponseStatsMessagesAuthAckSent = Nothing, clusterInfoResponseStatsMessagesAuthAckReceived = Nothing, clusterInfoResponseStatsMessagesUpdateSent = Nothing, clusterInfoResponseStatsMessagesUpdateReceived = Nothing, clusterInfoResponseStatsMessagesMfstartSent = Nothing, clusterInfoResponseStatsMessagesMfstartReceived = Nothing, clusterInfoResponseStatsMessagesModuleSent = Nothing, clusterInfoResponseStatsMessagesModuleReceived = Nothing, clusterInfoResponseStatsMessagesPublishshardSent = Nothing, clusterInfoResponseStatsMessagesPublishshardReceived = Nothing})
Ran Test
Running Test
Ran Test
Running Test
Ran Test
Running Test
Ran Test
Running Test
Ran Test
Running Test
Ran Test
Running Test
[gundeck@A] E, request=N/A, Redis connection failed, error=ConnectionLost
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, network error when connecting to Redis, error=Network.Socket.connect: <socket: 11>: does not exist (Connection refused)
[gundeck@A] E, Redis cluster down, error=ClusterDownError (ClusterInfoResponse {clusterInfoResponseState = Down, clusterInfoResponseSlotsAssigned = 16384, clusterInfoResponseSlotsOK = 16384, clusterInfoResponseSlotsPfail = 0, clusterInfoResponseSlotsFail = 0, clusterInfoResponseKnownNodes = 6, clusterInfoResponseSize = 1, clusterInfoResponseCurrentEpoch = 6, clusterInfoResponseMyEpoch = 1, clusterInfoResponseStatsMessagesSent = 7, clusterInfoResponseStatsMessagesReceived = 7, clusterInfoResponseTotalLinksBufferLimitExceeded = 0, clusterInfoResponseStatsMessagesPingSent = Just 4, clusterInfoResponseStatsMessagesPingReceived = Just 3, clusterInfoResponseStatsMessagesPongSent = Just 3, clusterInfoResponseStatsMessagesPongReceived = Just 4, clusterInfoResponseStatsMessagesMeetSent = Nothing, clusterInfoResponseStatsMessagesMeetReceived = Nothing, clusterInfoResponseStatsMessagesFailSent = Nothing, clusterInfoResponseStatsMessagesFailReceived = Nothing, clusterInfoResponseStatsMessagesPublishSent = Nothing, clusterInfoResponseStatsMessagesPublishReceived = Nothing, clusterInfoResponseStatsMessagesAuthReqSent = Nothing, clusterInfoResponseStatsMessagesAuthReqReceived = Nothing, clusterInfoResponseStatsMessagesAuthAckSent = Nothing, clusterInfoResponseStatsMessagesAuthAckReceived = Nothing, clusterInfoResponseStatsMessagesUpdateSent = Nothing, clusterInfoResponseStatsMessagesUpdateReceived = Nothing, clusterInfoResponseStatsMessagesMfstartSent = Nothing, clusterInfoResponseStatsMessagesMfstartReceived = Nothing, clusterInfoResponseStatsMessagesModuleSent = Nothing, clusterInfoResponseStatsMessagesModuleReceived = Nothing, clusterInfoResponseStatsMessagesPublishshardSent = Nothing, clusterInfoResponseStatsMessagesPublishshardReceived = Nothing})
[gundeck@A] E, Redis cluster down, error=ClusterDownError (ClusterInfoResponse {clusterInfoResponseState = Down, clusterInfoResponseSlotsAssigned = 16384, clusterInfoResponseSlotsOK = 16384, clusterInfoResponseSlotsPfail = 0, clusterInfoResponseSlotsFail = 0, clusterInfoResponseKnownNodes = 6, clusterInfoResponseSize = 1, clusterInfoResponseCurrentEpoch = 6, clusterInfoResponseMyEpoch = 1, clusterInfoResponseStatsMessagesSent = 11, clusterInfoResponseStatsMessagesReceived = 11, clusterInfoResponseTotalLinksBufferLimitExceeded = 0, clusterInfoResponseStatsMessagesPingSent = Just 6, clusterInfoResponseStatsMessagesPingReceived = Just 5, clusterInfoResponseStatsMessagesPongSent = Just 5, clusterInfoResponseStatsMessagesPongReceived = Just 6, clusterInfoResponseStatsMessagesMeetSent = Nothing, clusterInfoResponseStatsMessagesMeetReceived = Nothing, clusterInfoResponseStatsMessagesFailSent = Nothing, clusterInfoResponseStatsMessagesFailReceived = Nothing, clusterInfoResponseStatsMessagesPublishSent = Nothing, clusterInfoResponseStatsMessagesPublishReceived = Nothing, clusterInfoResponseStatsMessagesAuthReqSent = Nothing, clusterInfoResponseStatsMessagesAuthReqReceived = Nothing, clusterInfoResponseStatsMessagesAuthAckSent = Nothing, clusterInfoResponseStatsMessagesAuthAckReceived = Nothing, clusterInfoResponseStatsMessagesUpdateSent = Nothing, clusterInfoResponseStatsMessagesUpdateReceived = Nothing, clusterInfoResponseStatsMessagesMfstartSent = Nothing, clusterInfoResponseStatsMessagesMfstartReceived = Nothing, clusterInfoResponseStatsMessagesModuleSent = Nothing, clusterInfoResponseStatsMessagesModuleReceived = Nothing, clusterInfoResponseStatsMessagesPublishshardSent = Nothing, clusterInfoResponseStatsMessagesPublishshardReceived = Nothing})
Ran Test
Running Test
Ran Test
Running Test
Ran Test
Running Test
^CFailure
Ran Test
[cannon@A] I, request=N/A, draining all websockets, numberOfConns=0, computedBatchSize=0, minBatchSize=100, batchSize=100, maxNumberOfBatches=200
[cannon@A] I, request=N/A, Draining complete
[cannon@A] cannon: SignalledToExit
Running Test

which responds with Failure only when I ^C to end the test.

Thank you, looks fine to me, as expected.

jschaul · 2023-03-13T12:14:00Z

I ran one more manual test on our staging environment.

keep two browsers open with a websocket connection on web
restart the redis stateful set
keep sending messages from one account to the other for 2-3 minutes, the time it takes for redis to restart all pods
compare messages in the two browsers. Did end-user message loss occurr?

Current behaviour (on staging):

And after deploying this patch:

Message loss for end users, on web (error 207) occurs right now when redis is restarted; and it also occurs after this patch.

Possibly messages are not delivered at all; or they are delivered out-of-order, in both cases they end up not being decryptable/visible to the end user.

So we still have an operational problem here with redis & hedis & gundeck & cannon.

As this patch does not seem to significantly make this problem worse (in the end, things recover; even if intermediate messages are lost), and it may fix a memory leak, let's still get this improvement in.

The retry logic here; or the general architecture about websockets needs some improvement though at some point.

isovector added 4 commits March 7, 2023 12:45

fix: memory leak in Gundeck

7b20bc7

chore: format

4c2d060

fix: keep old retry logic in runRobust

5baea96

doc: changelog

ae93c93

mdimjasevic added the ok-to-test Approved for running tests in CI, overrides not-ok-to-test if both labels exist label Mar 8, 2023

Hi CI

941e139

Marko Dimjašević and others added 2 commits March 8, 2023 14:35

chore: format

e37ecca

chore: hlint

4a8c4ef

chore: format again

fcc9b8e

isovector added 2 commits March 8, 2023 15:32

fix: cleanup redis threads

fa9fce1

empty push

e6143ea

jschaul requested review from akshaymankar and stefanwire March 9, 2023 12:18

Hi CI

16c3c46

stefanwire approved these changes Mar 9, 2023

View reviewed changes

jschaul merged commit c97b712 into wireapp:develop Mar 13, 2023

battermann pushed a commit that referenced this pull request Mar 15, 2023

Fix gundeck leak (#3136)

55ccf1c

zebot mentioned this pull request Apr 17, 2023

Release 2023-04-17 - (expected chart version 4.35.0) #3230

Merged

lepsa pushed a commit to lepsa/wire-server that referenced this pull request Nov 28, 2023

Fix gundeck leak (wireapp#3136)

a3aa41c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix gundeck leak#3136

Fix gundeck leak#3136
jschaul merged 11 commits intowireapp:developfrom
isovector:fix-gundeck-leak

isovector commented Mar 7, 2023 •

edited

Loading

Uh oh!

mdimjasevic commented Mar 8, 2023

Uh oh!

isovector commented Mar 8, 2023

Uh oh!

isovector commented Mar 8, 2023

Uh oh!

stefanwire Mar 9, 2023

Uh oh!

stefanwire Mar 9, 2023

Uh oh!

isovector Mar 9, 2023

Uh oh!

isovector commented Mar 9, 2023

Uh oh!

isovector commented Mar 9, 2023

Uh oh!

jschaul commented Mar 13, 2023

Uh oh!

jschaul commented Mar 13, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

Conversation

isovector commented Mar 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

mdimjasevic commented Mar 8, 2023

Uh oh!

isovector commented Mar 8, 2023

Uh oh!

isovector commented Mar 8, 2023

Uh oh!

stefanwire Mar 9, 2023

Choose a reason for hiding this comment

Uh oh!

stefanwire Mar 9, 2023

Choose a reason for hiding this comment

Uh oh!

isovector Mar 9, 2023

Choose a reason for hiding this comment

Uh oh!

isovector commented Mar 9, 2023

Uh oh!

isovector commented Mar 9, 2023

Uh oh!

jschaul commented Mar 13, 2023

Uh oh!

jschaul commented Mar 13, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

isovector commented Mar 7, 2023 •

edited

Loading