Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Freenode bridge outage (19-21 Nov) #288

Closed
kegsay opened this issue Nov 21, 2016 · 2 comments
Closed

Freenode bridge outage (19-21 Nov) #288

kegsay opened this issue Nov 21, 2016 · 2 comments
Labels
matrix.org-support Matrix.org specific problem possibly unrelated to the bridge

Comments

@kegsay
Copy link
Member

kegsay commented Nov 21, 2016

This is a post-mortem post of an outage over the weekend. This issue has been mitigated for now; there are a number of outstanding performance issues to address before this can be closed.


Summary of the outage

All times GMT.

Nov 19 / 17:40 : Prometheus graphs show a consistent spike in requests to and from IRC being delayed (>10s to be passed through the bridge).
Nov 19 / 18:40 : Prometheus graphs show a spike in request failures to and from IRC.
Nov 19 / 19:48 : The IRC bridge is restarted. It begins to spin on restarts.
Nov 19 / 20:15 : The IRC bridge is restarted with membership list syncing disabled.
Nov 21 / 10:20 : The IRC bridge is restarted with membership list syncing enabled again.

Initial mitigations

The restart at 19:48 was done by @leonerd who noticed it then started to spin:

<LeoNerd> It's getting stuck
<LeoNerd> 'main' terminated and then it restarted. it seems to be spinning
<LeoNerd> 2016-11-19 20:07:54 ERROR:main FATAL EXCEPTION
2016-11-19 20:07:54 ERROR:main Error: toString failed
    at Buffer.toString (buffer.js:448:11)
    at BufferList.toString (/mnt/disk/home/ircas/matrix-org/freenode/matrix-appservice-irc/node_modules/request/node_modules/bl/bl.js:178:33)
    at Request.<anonymous> (/mnt/disk/home/ircas/matrix-org/freenode/matrix-appservice-irc/node_modules/request/request.js:1023:32)
    at emitOne (events.js:77:13)
    at Request.emit (events.js:169:7)
    at IncomingMessage.<anonymous> (/mnt/disk/home/ircas/matrix-org/freenode/matrix-appservice-irc/node_modules/request/request.js:969:12)
    at emitNone (events.js:72:20)
    at IncomingMessage.emit (events.js:166:7)
    at endReadableNT (_stream_readable.js:921:12)
    at nextTickCallbackWith2Args (node.js:442:9)
    at process._tickCallback (node.js:356:17)
2016-11-19 20:07:54 ERROR:main Terminating (exitcode=1)
<LeoNerd> It's done that 4 times in a row now and just keeps doing it

I stepped in to put a plaster on it by blindly applying a try catch block to the bl library.

<kegan> have blindly wrapped it in a try catch

That caused another crash in matrix-appservice-bridge's ClientFactory - The body was missing on an HTTP 200, and it was assumed the body would never be missing. I monkey-patched this by checking for 200 + missing body and failing the request.

<LeoNerd> 2016-11-19 20:22:35 ERROR:main TypeError: Cannot read property 'substring' of undefined
<LeoNerd> died again
<kegan> got that now as well
<kegan> and logging the request uri
<kegan> I think it is /initialSync
<kegan> in which case fuck.
<kegan> because we can't sync membership lists then.
<kegan> yup its initial sync
<kegan> so the request lib is dying trying to parse the response
<kegan> but the response should just be a timeout apache error initially
<kegan> so idk why it is struggling with that.
<kegan> it may eventually work
<kegan> but at least now it will stop spinning
<kegan> it's getting a 200 response from something
<kegan> usually it gets gateway timeout 50x
<kegan> right. It isn't spinning and restarting constantly now.

Investigation

I started with the original stack trace and started investigating why it was failing. The error message itself toString failed was very unhelpful.

<kegan> the original fail is on this line in the depths of node: https://github.com/nodejs/node/blob/v4.6.1/lib/buffer.js#L448
<kegan> that node function is being called with an encoding arg
<kegan> which means https://github.com/nodejs/node/blob/v4.6.1/lib/buffer.js#L395 is executed
<kegan> logs I added when I patched this logged the encoding, which was `undefined`
<kegan> which means it defaults to utf8
<kegan> I'm hoping that the tcpdump will show me what bytes caused this

I also checked the load balancer logs to see if the request really was being 200 OKed, and it was. That's when I figured out what the probable cause was:

<kegan> It DID return a 200!
<kegan> > 0/0/0/4637/39415 200 276467133
<kegan> !calc 276467133 bytes in MB
Wolfram Alpha: 276.5 MB  (megabytes) (≈ 0.41 × standard compact disc capacity ( 650 to 703 MB ))
<kegan> given the size of this response, I can imagine we're making the `request` lib fall over when it tries to parse it
<kegan> this would explain why we never saw it before, it was under the magic number (which 276MB looks suspiciously close to 256MB...)
<kegan> lo and behold: https://github.com/request/request/issues/2067
<kegan> https://github.com/nodejs/node/issues/3175
<kegan> > 256 MB is the limit for strings in V8

It turns out that we were hitting the size limit for strings on V8. request will dump the HTTP response into a Buffer and then stringify it to pass it to JSON.parse, which is why toString was failing.

Fixes

The IRC bridge uses matrix-appservice-bridge and specifically AppServiceBot.getMemberLists() which does a v1/initialSync to populate the member lists on startup. This response was now too large. I swapped over to use r0/sync instead, and applied a filter to reduce the response size.

Nov 21 10:49
<kegan> right, freenode bridge seems to be alive now
<kegan> using /sync and an aggressive filter I got the response down to 190MB
<kegan> it's a huge ticking time bomb though

However, using /sync caused another problem: the request was taking too long. It meant we were getting timeouts from intermediate servers before the entire response could be sent to the bridge. This
was mitigated against by directly hitting Synapse and bypassing these intermediate servers.

Outstanding issues

This outage has highlighted several issues with /sync which need to be resolved. They are as follows:

@kegsay kegsay added bug matrix.org-support Matrix.org specific problem possibly unrelated to the bridge labels Nov 21, 2016
@leonerd
Copy link
Contributor

leonerd commented Nov 21, 2016

I wonder whether a different API shape would be better for use cases like bridges?

Rather than try to get all rooms' state at once, you'd perform a lightweight "just tell me what rooms I'm in" query (e.g. matrix-org/matrix-spec-proposals#734), followed up by a per-room state sync. Overall we'd still hit the same amount of event storage and quantity of data, but split across many queries rather than just one, so it should be a lot lighter in terms of server overheads, string parsing, and so on.

@kegsay
Copy link
Member Author

kegsay commented Dec 12, 2016

matrix-org/synapse#1680 is what we're going to do.

When this is implemented, this issue can be closed.

kegsay added a commit that referenced this issue Dec 12, 2016
/sync is slow (#288) and unreliable (#303) when the response is large. Use
dedicated membership list APIs instead.

Currently, this PR is just a literal translation of the new APIs from /sync.
To make full use of this API, we need to change when we begin processing the
results, as we can now do so incrementally.
@kegsay kegsay closed this as completed Dec 20, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
matrix.org-support Matrix.org specific problem possibly unrelated to the bridge
Projects
None yet
Development

No branches or pull requests

2 participants