Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

User directory is performing state resolution, which results in unnecessary CPU usage #9797

Closed
anoadragon453 opened this issue Apr 12, 2021 · 2 comments · Fixed by #9821
Closed
Labels
S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@anoadragon453
Copy link
Member

On April 11th, 2021 at ~12:00 UTC we saw matrix.org's user directory worker start using 100% CPU consistently, and continued doing so until restarted on April 12th 16:10 UTC.

It turns out that it was stuck doing state resolution for an IRC room with 123,000+ state events.

It's a little bit surprising that the user directory is doing state resolution at all though, as it should just be listening for membership changes happening on the current_state_deltas_stream, and updating tables used for user directory search accordingly.

In the logs, we see the following repeated multiple times per second:

2021-04-12 00:00:44,506 - synapse.replication.tcp.handler - 496 - INFO - replication_command_handler@7f0b5b2e2268 - Handling 'POSITION events event_persister-2 1939721421 1939721422'
2021-04-12 00:00:44,506 - synapse.replication.tcp.handler - 549 - INFO - process-replication-data-48623630 - Caught up with stream 'events' to 1939721422
2021-04-12 00:00:44,507 - synapse.replication.tcp.handler - 496 - INFO - replication_command_handler@7f0b5b2e2268 - Handling 'POSITION events event_persister-2 1939721422 1939721423'
2021-04-12 00:00:44,507 - synapse.replication.tcp.handler - 549 - INFO - process-replication-data-48623632 - Caught up with stream 'events' to 1939721423
2021-04-12 00:00:44,610 - synapse.state - 576 - INFO - Measure[resolve_state_groups_for_events]@7f09dc222840 - Resolving state for !xxx:domain with groups [596595428, 596513551]
2021-04-12 00:00:44,714 - synapse.state.v1 - 84 - INFO - Measure[state._resolve_events]@7f09dc222d68 - Asking for 104/104 conflicted events
2021-04-12 00:00:44,715 - synapse.state.v1 - 118 - INFO - Measure[state._resolve_events]@7f09dc222d68 - Asking for 3/3 auth events

(Note that we are using redis replication, even if that code is in the tcp/handler.py class).

So it seems that the user directory is listening to the events stream (I think), in addition to the current_state_deltas stream:

max_pos, deltas = await self.store.get_current_state_deltas(
self.pos, room_max_stream_ordering
)

Ideally the user directory would just accept membership updates from other worker processes without needing to perform state resolution itself in the meantime.

@turt2live
Copy link
Member

it appears to have started doing it again despite the restart

@anoadragon453
Copy link
Member Author

anoadragon453 commented Apr 13, 2021

get_current_users_in_room, which user directory calls a few times, which performs state resolution in order to get the current users in a room:

logger.debug("calling resolve_state_groups from get_current_users_in_room")
entry = await self.resolve_state_groups_for_events(room_id, latest_event_ids)
return await self.store.get_joined_users_from_state(room_id, entry)

But perhaps we can use get_users_in_room instead, which just pulls from the current_state_events table:

@cached(max_entries=100000, iterable=True)
async def get_users_in_room(self, room_id: str) -> List[str]:
return await self.db_pool.runInteraction(
"get_users_in_room", self.get_users_in_room_txn, room_id
)

@anoadragon453 anoadragon453 added S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. labels Apr 14, 2021
anoadragon453 added a commit that referenced this issue Apr 16, 2021
Fixes: #9797.

Should help reduce CPU usage on the user directory, especially when memberships change in rooms with lots of state history.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants