-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Some Synapse instances have been hammering their database after v1.66.0 -> v1.68.0 update #13942
Comments
That query was touched recently by #13575. |
On the affected hosts we saw that postgres used a heap scan rather than an index scan for the aggressive query. Forcing an index-only scan (SET enable_bitmapscan to off) made the query quicker (500->300ms) for one trial set of parameters. Unclear if those are representitive. Also unclear if the index-only scan used more or less file IO. But this might be barking up the wrong tree. Are we perhaps calling |
Note that on matrix.org the cache factor specifically for |
By trying different versions we've confirmed that #13575 contributed to the performance regression, and we haven't seen any evidence of other queries being suddenly more expensive. We did this by trying 51d732d and its parent 4f6de33 on a test server. (I tried to revert 51d732d on top of 1.68.0 for testing, but there were merge conflicts and I wasn't sure how to handle them---could use advice from @MadLittleMods on that one, and on how to proceed here more generally. Anyway, here's the graph of database IO (bigger = more IO credits available) from the test server. Seems convincing to me. |
From the PRs you linked, we use #13575 does make the You can see from the https://grafana.matrix.org/d/000000012/synapse?orgId=1&from=1661891889156&to=1664483889157&viewPanel=11 The giant spike in Also as a note, we will always use synapse/synapse/storage/databases/main/roommember.py Lines 960 to 963 in ebd9e2d
|
Summarizing some more discussion in backend internal: We don't know exactly what upstream callers are accounting for all of these This also makes sense since one of the affected hosts is Element One which has a lot of bridged appservice users. I suspect we have some more @clokep pointed out a couple
I'll work on a PR now to change these over ⏩ |
… users `get_local_users_in_room` is way more performant since it looks at a single table (`local_current_membership`) and is looking through way less data since it only worries about the local users in the room instead of everyone in the room across the federation. Fix #13942 Related to: - #13605 - #13608 - #13606
Spawning from looking into `get_users_in_room` while investigating #13942 (comment). See #13575 (comment) for the original exploration around finding `get_users_in_room` mis-uses.
The host I used for testing also has a whatsapp bridge with lots of remote users and a signal bridge. The other host mentioned (db0718c0-2480-11e9-83c4-ad579ecfcc33-live) is a rather large host with a Telegram bridge. Sorry I failed to mention these in the original issue description even though appservices came to mind early on. I failed to realize db0718c0-2480-11e9-83c4-ad579ecfcc33-live has a Telegram bridge and thus didn't feel there was a connection after all. One other large host I looked at (8e71e000-3607-11ea-8fb7-3d4a5344b740-live) has no appservices. It shows a slight baseline increase in IOPS, though nothing on the levels of the others. This may be why we're not seeing this issue all around in our monitoring stack. However, just want to make the point that it looks like non-appservice hosts are also affected by the changes, though in a much more smaller level. Doesn't look like this is affecting apdex, but it may affect for hosts where the RDS is struggling to provide the extra query weight. |
I have to say I'm leaning towards backing out the change to It's also worth noting that Beeper (cc @Fizzadar) have been trying to rip out all the joins onto the events table for performance reasons. I'd vote we add a separate |
But also I should say: I'm surprised the change has caused this much problem, so this is not anyone's fault! |
Fixes #13942. Introduced in #13575. Basically, let's only get the ordered set of hosts out of the DB if we need an ordered set of hosts. Since we split the function up the caching won't be as good, but I think it will still be fine as e.g. multiple backfill requests for the same room will hit the cache.
As Erik mentioned we (Beeper) also have this issue with RDS x IOPs starvation. Joins seem to be particularly expensive in terms of IOPs (both on index and not). I can't find the source right now but IIRC doing joins effectively acts as an IOPs multiplier which may be why this change hurt so much (x the AS paths utilising the function heavily). |
…de` (#13960) Spawning from looking into `get_users_in_room` while investigating #13942 (comment). See #13575 (comment) for the original exploration around finding `get_users_in_room` mis-uses. Related to the following PRs where we also cleaned up some `get_users_in_room` mis-uses: - #13605 - #13608 - #13606 - #13958
Description
Some EMS hosted Synapse instances are hammering their database after upgrading from v1.66.0 to v1.68.0. The host concentrating here on is
ecf6bc70-0bd7-11ec-8fb7-11c2f603e85f-live
(EMS internal host ID, please check with EMS team for real hostnames).The offensive query is:
The background update running at the time was
event_push_backfill_thread_id
, if relevant.Graphs:
IOPS increase at upgrade. The initial plateau at 4K was due to the database being locked to 4K IOPS. Now it has 10K and has consistently continued to hammer the database after ~7 hours since the upgrade.
Degraded event send times especially when constrained to 4K IOPS, which the host has been running with for a long time fine.
Stateres worst-case seems to reflect the database usage, just side effect of a busy db?
DB usage for background jobs had a rather massive spike for notify_interested_appservices_ephemeral right after upgrade.
Taking that away from the graph, we see db usage for background jobs higher since upgrade all around.
DB transactions:
Cache eviction seems to indicate we should raise the
get_local_users_in_room
cache as it is being evicted a lot by size. However, this has been the case pre-upgrade as well.Appservice transactions have not changed during this time by a large factor (3 bridges):
A few other hosts manually found:
Day time based changes in traffic have been ruled out, all these issues started on upgrade with no other changes to the hosting or deployment stack. There are probably more hosts affected by the db usage increase.
Also discussed in backend internal.
Steps to reproduce
Uprgade from v1.66.0 to v1.68.0.
Homeserver
ecf6bc70-0bd7-11ec-8fb7-11c2f603e85f-live, 01bbd800-4670-11e9-8324-b54a9efc8abc-live, db0718c0-2480-11e9-83c4-ad579ecfcc33-live
Synapse Version
v1.68.0
Installation Method
Other (please mention below)
Platform
EMS flavour Docker images built from upstream images. Kubernetes cluster.
Relevant log output
Anything else that would be useful to know?
No response
The text was updated successfully, but these errors were encountered: