Freenode bridge outage (19-21 Nov) #288

kegsay · 2016-11-21T11:29:17Z

This is a post-mortem post of an outage over the weekend. This issue has been mitigated for now; there are a number of outstanding performance issues to address before this can be closed.

Summary of the outage

All times GMT.

Nov 19 / 17:40 : Prometheus graphs show a consistent spike in requests to and from IRC being delayed (>10s to be passed through the bridge).
Nov 19 / 18:40 : Prometheus graphs show a spike in request failures to and from IRC.
Nov 19 / 19:48 : The IRC bridge is restarted. It begins to spin on restarts.
Nov 19 / 20:15 : The IRC bridge is restarted with membership list syncing disabled.
Nov 21 / 10:20 : The IRC bridge is restarted with membership list syncing enabled again.

Initial mitigations

The restart at 19:48 was done by @leonerd who noticed it then started to spin:

<LeoNerd> It's getting stuck
<LeoNerd> 'main' terminated and then it restarted. it seems to be spinning
<LeoNerd> 2016-11-19 20:07:54 ERROR:main FATAL EXCEPTION
2016-11-19 20:07:54 ERROR:main Error: toString failed
    at Buffer.toString (buffer.js:448:11)
    at BufferList.toString (/mnt/disk/home/ircas/matrix-org/freenode/matrix-appservice-irc/node_modules/request/node_modules/bl/bl.js:178:33)
    at Request.<anonymous> (/mnt/disk/home/ircas/matrix-org/freenode/matrix-appservice-irc/node_modules/request/request.js:1023:32)
    at emitOne (events.js:77:13)
    at Request.emit (events.js:169:7)
    at IncomingMessage.<anonymous> (/mnt/disk/home/ircas/matrix-org/freenode/matrix-appservice-irc/node_modules/request/request.js:969:12)
    at emitNone (events.js:72:20)
    at IncomingMessage.emit (events.js:166:7)
    at endReadableNT (_stream_readable.js:921:12)
    at nextTickCallbackWith2Args (node.js:442:9)
    at process._tickCallback (node.js:356:17)
2016-11-19 20:07:54 ERROR:main Terminating (exitcode=1)
<LeoNerd> It's done that 4 times in a row now and just keeps doing it

I stepped in to put a plaster on it by blindly applying a try catch block to the bl library.

<kegan> have blindly wrapped it in a try catch

That caused another crash in matrix-appservice-bridge's ClientFactory - The body was missing on an HTTP 200, and it was assumed the body would never be missing. I monkey-patched this by checking for 200 + missing body and failing the request.

<LeoNerd> 2016-11-19 20:22:35 ERROR:main TypeError: Cannot read property 'substring' of undefined
<LeoNerd> died again
<kegan> got that now as well
<kegan> and logging the request uri
<kegan> I think it is /initialSync
<kegan> in which case fuck.
<kegan> because we can't sync membership lists then.
<kegan> yup its initial sync
<kegan> so the request lib is dying trying to parse the response
<kegan> but the response should just be a timeout apache error initially
<kegan> so idk why it is struggling with that.
<kegan> it may eventually work
<kegan> but at least now it will stop spinning
<kegan> it's getting a 200 response from something
<kegan> usually it gets gateway timeout 50x
<kegan> right. It isn't spinning and restarting constantly now.

Investigation

I started with the original stack trace and started investigating why it was failing. The error message itself toString failed was very unhelpful.

<kegan> the original fail is on this line in the depths of node: https://github.com/nodejs/node/blob/v4.6.1/lib/buffer.js#L448
<kegan> that node function is being called with an encoding arg
<kegan> which means https://github.com/nodejs/node/blob/v4.6.1/lib/buffer.js#L395 is executed
<kegan> logs I added when I patched this logged the encoding, which was `undefined`
<kegan> which means it defaults to utf8
<kegan> I'm hoping that the tcpdump will show me what bytes caused this

I also checked the load balancer logs to see if the request really was being 200 OKed, and it was. That's when I figured out what the probable cause was:

<kegan> It DID return a 200!
<kegan> > 0/0/0/4637/39415 200 276467133
<kegan> !calc 276467133 bytes in MB
Wolfram Alpha: 276.5 MB  (megabytes) (≈ 0.41 × standard compact disc capacity ( 650 to 703 MB ))
<kegan> given the size of this response, I can imagine we're making the `request` lib fall over when it tries to parse it
<kegan> this would explain why we never saw it before, it was under the magic number (which 276MB looks suspiciously close to 256MB...)
<kegan> lo and behold: https://github.com/request/request/issues/2067
<kegan> https://github.com/nodejs/node/issues/3175
<kegan> > 256 MB is the limit for strings in V8

It turns out that we were hitting the size limit for strings on V8. request will dump the HTTP response into a Buffer and then stringify it to pass it to JSON.parse, which is why toString was failing.

Fixes

The IRC bridge uses matrix-appservice-bridge and specifically AppServiceBot.getMemberLists() which does a v1/initialSync to populate the member lists on startup. This response was now too large. I swapped over to use r0/sync instead, and applied a filter to reduce the response size.

Nov 21 10:49
<kegan> right, freenode bridge seems to be alive now
<kegan> using /sync and an aggressive filter I got the response down to 190MB
<kegan> it's a huge ticking time bomb though

However, using /sync caused another problem: the request was taking too long. It meant we were getting timeouts from intermediate servers before the entire response could be sent to the bridge. This
was mitigated against by directly hitting Synapse and bypassing these intermediate servers.

Outstanding issues

This outage has highlighted several issues with /sync which need to be resolved. They are as follows:

~~Takes too long to populate the cache on the first request. This causes requests to /sync to Gateway Timeout until the caches are populated.~~ Use the new membership list APIs instead of /sync #308
Takes too long to post-process from cache -> response bytes. This causes requests to /sync to Gateway Timeout even if the caches are populated. This was not the case with /initialSync which did very little post-processing. Return early on /sync code paths if a '*' filter is used synapse#1640
~~Does not honour all filter values (notably event_fields) so the response is much larger than it needs to be.~~ Implement "event_fields" in filters synapse#1638

The text was updated successfully, but these errors were encountered:

leonerd · 2016-11-21T11:49:34Z

I wonder whether a different API shape would be better for use cases like bridges?

Rather than try to get all rooms' state at once, you'd perform a lightweight "just tell me what rooms I'm in" query (e.g. matrix-org/matrix-spec-proposals#734), followed up by a per-room state sync. Overall we'd still hit the same amount of event storage and quantity of data, but split across many queries rather than just one, so it should be a lot lighter in terms of server overheads, string parsing, and so on.

kegsay · 2016-12-12T10:47:20Z

matrix-org/synapse#1680 is what we're going to do.

When this is implemented, this issue can be closed.

/sync is slow (#288) and unreliable (#303) when the response is large. Use dedicated membership list APIs instead. Currently, this PR is just a literal translation of the new APIs from /sync. To make full use of this API, we need to change when we begin processing the results, as we can now do so incrementally.

kegsay added bug matrix.org-support Matrix.org specific problem possibly unrelated to the bridge labels Nov 21, 2016

kegsay mentioned this issue Dec 12, 2016

Use the new membership list APIs instead of /sync #308

Merged

kegsay closed this as completed Dec 20, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Freenode bridge outage (19-21 Nov) #288

Freenode bridge outage (19-21 Nov) #288

kegsay commented Nov 21, 2016 •

edited

Loading

leonerd commented Nov 21, 2016

kegsay commented Dec 12, 2016

Freenode bridge outage (19-21 Nov) #288

Freenode bridge outage (19-21 Nov) #288

Comments

kegsay commented Nov 21, 2016 • edited Loading

Summary of the outage

Initial mitigations

Investigation

Fixes

Outstanding issues

leonerd commented Nov 21, 2016

kegsay commented Dec 12, 2016

kegsay commented Nov 21, 2016 •

edited

Loading