This repository has been archived by the owner on Apr 26, 2024. It is now read-only.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Optimize how we calculate
likely_domains
during backfill #13575Optimize how we calculate
likely_domains
during backfill #13575Changes from 11 commits
4e397c5
abd77f7
d9cd3ef
a55f094
396ca2d
cf68485
b90ab98
6ba5ef3
55330c8
aafb356
f32739c
6700cf5
3fb2ade
9a56cbe
42b04c4
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eliminating this option to get domains for a room in favor of
get_current_hosts_in_room
.In a room like
#matrixhq
, getting all 80k state events in the room, then whittling it down in the app is not faster than a quick targeted database query or maybe even luckily hitting the cache.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is adding this join here going to bite us in the bum for callers that don't actually care about the depth ordering?
As an aside, adding an index on
events(event_id, depth)
could be considered to speed up this if it turns out to be a problem, but on the other hand we're not keen on adding more indices to the events tables because it's frequently written.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Time-wise, I think by the nature of the extra complexity, it will take more time but it's not something I optimized for here. If we don't like this trade-off, we could remove the cache-shortcut lookup in
get_current_hosts_in_room
around it (it's kinda doing the same thing anyway looking incurrent_state_events
).Is there another concern about the order itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fwiw, here are some first-run timings. Not sure how much the first one affects the second one 🤷. Doubling the time isn't very good though so I'm up for removing the cache-shortcut lookup (just poke me further). Or we just sort by oldest
current_state_events
rows (not necessarilydepth
order but might be pretty close in practice) and not worry about the slight difference it makes to backfill (this one maybe sounds more interesting to me).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only doubling isn't as bad as I was expecting, so maybe it's fine. Possibly worth checking a graph for how much time is spent on
get_users_in_rooms
and thinking if it would burn us if that time were doubled.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a graph for this? I don't see special metrics annotated in it but 👍 it would be good to look at if we had it. Depending on confidence, we can merge a PR to measure it.
I've looked at the usages and there are some areas where it doesn't feel great to increase the time (calculating if an appservice is interested in an event (this could probably be smarter though anyway to use the user namespace regex directly on the members of the room like we do to get the domains), calculating
join_authorised_via_users_server
in a/make_join
(idk how important this is), newly joined rooms in/sync
, presence). Overall seems kinda fine ⏩I can create separate PRs to fix up the following areas first, then we can merge this after those places are changed.
There are a few areas where it feels like we shouldn't be using
get_users_in_room
and can just use a direct lookup for the single user:synapse/synapse/handlers/events.py
Lines 176 to 177 in a25a370
synapse/synapse/handlers/room.py
Lines 1287 to 1288 in a25a370
synapse/synapse/handlers/message.py
Lines 764 to 765 in a25a370
synapse/synapse/handlers/room_member.py
Lines 1623 to 1624 in a25a370
synapse/synapse/server_notices/server_notices_manager.py
Lines 114 to 115 in a25a370
There are a few places where we should use
get_current_hosts_in_room
instead:synapse/synapse/handlers/directory.py
Lines 86 to 87 in a25a370
synapse/synapse/handlers/directory.py
Lines 290 to 291 in a25a370
synapse/synapse/handlers/presence.py
Lines 2054 to 2055 in a25a370
synapse/synapse/handlers/typing.py
Lines 365 to 366 in a25a370
synapse/synapse/handlers/device.py
Lines 697 to 698 in a25a370
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is some follow-up. I think this PR was rolled out on
matrix.org
(or maybe just the RC was which wouldn't include this?)The average looks about the same. It seems a bit more peakier on total txn time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another follow-up, this time I think the changes are actually deployed to
matrix.org
as indicated by that change in theUP
graph at the top (and/versions
shows a hash that includes it)It looks like it does take significantly more time on average (~100ms vs ~10ms before) but we're doing the query a whole lot more less probably because we fixed up those mis-uses.
https://grafana.matrix.org/d/000000012/synapse?orgId=1&from=1662362659334&to=1662535459334&viewPanel=180
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably has a bigger penalty for other usage patterns, see #13942
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably fixed #13942 via #13958 (another
get_users_in_room
mis-use)Fixed up another mis-use in #13960
And found another mis-use that's tracked by #13967. Since this involves more than just switching it out, I'm not going to tackle. Feel free to pick it up!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ended up as two functions
get_current_hosts_in_room
andget_current_hosts_in_room_ordered
so that we don't have to take the performance hit of the extra join inget_users_in_room
, see #13972There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updating the query to order by depth to match what
get_domains_from_state
did before (drop-in replacement).The heuristic of sorting by servers who have been there the longest is good because they're most likely to have anything we ask about.