You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a post-mortem post of an outage over the weekend. This issue has been mitigated for now; there are a number of outstanding performance issues to address before this can be closed.
Summary of the outage
All times GMT.
Nov 19 / 17:40 : Prometheus graphs show a consistent spike in requests to and from IRC being delayed (>10s to be passed through the bridge).
Nov 19 / 18:40 : Prometheus graphs show a spike in request failures to and from IRC.
Nov 19 / 19:48 : The IRC bridge is restarted. It begins to spin on restarts.
Nov 19 / 20:15 : The IRC bridge is restarted with membership list syncing disabled.
Nov 21 / 10:20 : The IRC bridge is restarted with membership list syncing enabled again.
Initial mitigations
The restart at 19:48 was done by @leonerd who noticed it then started to spin:
<LeoNerd> It's getting stuck
<LeoNerd> 'main' terminated and then it restarted. it seems to be spinning
<LeoNerd> 2016-11-19 20:07:54 ERROR:main FATAL EXCEPTION
2016-11-19 20:07:54 ERROR:main Error: toString failed
at Buffer.toString (buffer.js:448:11)
at BufferList.toString (/mnt/disk/home/ircas/matrix-org/freenode/matrix-appservice-irc/node_modules/request/node_modules/bl/bl.js:178:33)
at Request.<anonymous> (/mnt/disk/home/ircas/matrix-org/freenode/matrix-appservice-irc/node_modules/request/request.js:1023:32)
at emitOne (events.js:77:13)
at Request.emit (events.js:169:7)
at IncomingMessage.<anonymous> (/mnt/disk/home/ircas/matrix-org/freenode/matrix-appservice-irc/node_modules/request/request.js:969:12)
at emitNone (events.js:72:20)
at IncomingMessage.emit (events.js:166:7)
at endReadableNT (_stream_readable.js:921:12)
at nextTickCallbackWith2Args (node.js:442:9)
at process._tickCallback (node.js:356:17)
2016-11-19 20:07:54 ERROR:main Terminating (exitcode=1)
<LeoNerd> It's done that 4 times in a row now and just keeps doing it
I stepped in to put a plaster on it by blindly applying a try catch block to the bl library.
<kegan> have blindly wrapped it in a try catch
That caused another crash in matrix-appservice-bridge's ClientFactory - The body was missing on an HTTP 200, and it was assumed the body would never be missing. I monkey-patched this by checking for 200 + missing body and failing the request.
<LeoNerd> 2016-11-19 20:22:35 ERROR:main TypeError: Cannot read property 'substring' of undefined
<LeoNerd> died again
<kegan> got that now as well
<kegan> and logging the request uri
<kegan> I think it is /initialSync
<kegan> in which case fuck.
<kegan> because we can't sync membership lists then.
<kegan> yup its initial sync
<kegan> so the request lib is dying trying to parse the response
<kegan> but the response should just be a timeout apache error initially
<kegan> so idk why it is struggling with that.
<kegan> it may eventually work
<kegan> but at least now it will stop spinning
<kegan> it's getting a 200 response from something
<kegan> usually it gets gateway timeout 50x
<kegan> right. It isn't spinning and restarting constantly now.
Investigation
I started with the original stack trace and started investigating why it was failing. The error message itself toString failed was very unhelpful.
<kegan> the original fail is on this line in the depths of node: https://github.com/nodejs/node/blob/v4.6.1/lib/buffer.js#L448
<kegan> that node function is being called with an encoding arg
<kegan> which means https://github.com/nodejs/node/blob/v4.6.1/lib/buffer.js#L395 is executed
<kegan> logs I added when I patched this logged the encoding, which was `undefined`
<kegan> which means it defaults to utf8
<kegan> I'm hoping that the tcpdump will show me what bytes caused this
I also checked the load balancer logs to see if the request really was being 200 OKed, and it was. That's when I figured out what the probable cause was:
<kegan> It DID return a 200!
<kegan> > 0/0/0/4637/39415 200 276467133
<kegan> !calc 276467133 bytes in MB
Wolfram Alpha: 276.5 MB (megabytes) (≈ 0.41 × standard compact disc capacity ( 650 to 703 MB ))
<kegan> given the size of this response, I can imagine we're making the `request` lib fall over when it tries to parse it
<kegan> this would explain why we never saw it before, it was under the magic number (which 276MB looks suspiciously close to 256MB...)
<kegan> lo and behold: https://github.com/request/request/issues/2067
<kegan> https://github.com/nodejs/node/issues/3175
<kegan> > 256 MB is the limit for strings in V8
It turns out that we were hitting the size limit for strings on V8. request will dump the HTTP response into a Buffer and then stringify it to pass it to JSON.parse, which is why toString was failing.
Fixes
The IRC bridge uses matrix-appservice-bridge and specifically AppServiceBot.getMemberLists() which does a v1/initialSync to populate the member lists on startup. This response was now too large. I swapped over to use r0/sync instead, and applied a filter to reduce the response size.
Nov 21 10:49
<kegan> right, freenode bridge seems to be alive now
<kegan> using /sync and an aggressive filter I got the response down to 190MB
<kegan> it's a huge ticking time bomb though
However, using /sync caused another problem: the request was taking too long. It meant we were getting timeouts from intermediate servers before the entire response could be sent to the bridge. This
was mitigated against by directly hitting Synapse and bypassing these intermediate servers.
Outstanding issues
This outage has highlighted several issues with /sync which need to be resolved. They are as follows:
Takes too long to post-process from cache -> response bytes. This causes requests to /sync to Gateway Timeout even if the caches are populated. This was not the case with /initialSync which did very little post-processing.Return early on /sync code paths if a '*' filter is used synapse#1640
I wonder whether a different API shape would be better for use cases like bridges?
Rather than try to get all rooms' state at once, you'd perform a lightweight "just tell me what rooms I'm in" query (e.g. matrix-org/matrix-spec-proposals#734), followed up by a per-room state sync. Overall we'd still hit the same amount of event storage and quantity of data, but split across many queries rather than just one, so it should be a lot lighter in terms of server overheads, string parsing, and so on.
/sync is slow (#288) and unreliable (#303) when the response is large. Use
dedicated membership list APIs instead.
Currently, this PR is just a literal translation of the new APIs from /sync.
To make full use of this API, we need to change when we begin processing the
results, as we can now do so incrementally.
This is a post-mortem post of an outage over the weekend. This issue has been mitigated for now; there are a number of outstanding performance issues to address before this can be closed.
Summary of the outage
All times GMT.
Nov 19 / 17:40 : Prometheus graphs show a consistent spike in requests to and from IRC being delayed (>10s to be passed through the bridge).
Nov 19 / 18:40 : Prometheus graphs show a spike in request failures to and from IRC.
Nov 19 / 19:48 : The IRC bridge is restarted. It begins to spin on restarts.
Nov 19 / 20:15 : The IRC bridge is restarted with membership list syncing disabled.
Nov 21 / 10:20 : The IRC bridge is restarted with membership list syncing enabled again.
Initial mitigations
The restart at 19:48 was done by @leonerd who noticed it then started to spin:
I stepped in to put a plaster on it by blindly applying a try catch block to the
bl
library.That caused another crash in
matrix-appservice-bridge
'sClientFactory
- Thebody
was missing on an HTTP 200, and it was assumed thebody
would never be missing. I monkey-patched this by checking for 200 + missing body and failing the request.Investigation
I started with the original stack trace and started investigating why it was failing. The error message itself
toString failed
was very unhelpful.I also checked the load balancer logs to see if the request really was being 200 OKed, and it was. That's when I figured out what the probable cause was:
It turns out that we were hitting the size limit for strings on V8.
request
will dump the HTTP response into aBuffer
and then stringify it to pass it toJSON.parse
, which is whytoString
was failing.Fixes
The IRC bridge uses
matrix-appservice-bridge
and specificallyAppServiceBot.getMemberLists()
which does av1/initialSync
to populate the member lists on startup. This response was now too large. I swapped over to user0/sync
instead, and applied a filter to reduce the response size.However, using
/sync
caused another problem: the request was taking too long. It meant we were getting timeouts from intermediate servers before the entire response could be sent to the bridge. Thiswas mitigated against by directly hitting Synapse and bypassing these intermediate servers.
Outstanding issues
This outage has highlighted several issues with
/sync
which need to be resolved. They are as follows:Takes too long to populate the cache on the first request. This causes requests toUse the new membership list APIs instead of /sync #308/sync
to Gateway Timeout until the caches are populated.Takes too long to post-process from cache -> response bytes. This causes requests toReturn early on /sync code paths if a '*' filter is used synapse#1640/sync
to Gateway Timeout even if the caches are populated. This was not the case with/initialSync
which did very little post-processing.Does not honour all filter values (notablyImplement "event_fields" in filters synapse#1638event_fields
) so the response is much larger than it needs to be.The text was updated successfully, but these errors were encountered: