Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synapse stops sending persistent events (other than join memberships) to application services #17206

Closed
anoadragon453 opened this issue May 16, 2024 · 4 comments

Comments

@anoadragon453
Copy link
Member

anoadragon453 commented May 16, 2024

Description

There was a failure mode today on a customer's ESS deployment where their Synapse stopped sending PDUs to registered appservices. Ephemeral events (to-device, presence, read receipts) were sent, and curiously membership join events were being sent. But messages, m.room.create and membership leave events all weren't making it over (likely as join memberships have a separate notifier path). The deployment had no workers.

The synapse_event_processing_lag_by_event metric reported that the appservice_sender label was ~80K events behind (the build up starting ~20 days ago). We tried setting up workers on the deployment, but not an appservice worker. Then as the problem persisted, an appservice worker was configured and started, yet events were still not being sent. The synapse_event_processing_lag_by_event metric was also not being reported by the appservice worker, which means this code was not being reached:

synapse.metrics.event_processing_lag_by_event.labels(
"appservice_sender"
).observe((now - ts) / 1000)

Looking in the database, we saw that the stream IDs (which concern ephemeral events) in the application_services_state table were fine and up to date. However, looking at the appservice_stream_position table revealed a stream ID of ~20K, while in the stream_positions table one could see the events stream was at ~80K. It had fallen behind, and was failing to catch up.

The only thing that worked was to manually update the stream position in appservice_stream_position to match the events row in the stream_positions table and restart the appservice worker. This lead us to believe that there's a deadlock case that's possible upon falling very far behind.

The code that compares the event stream positions and pulls the necessary events out of the database is:

@wrap_as_background_process("notify_interested_services")
async def _notify_interested_services(self, max_token: RoomStreamToken) -> None:
with Measure(self.clock, "notify_interested_services"):
self.is_processing = True
try:
upper_bound = -1
while upper_bound < self.current_max:
last_token = await self.store.get_appservice_last_pos()
(
upper_bound,
event_to_received_ts,
) = await self.store.get_all_new_event_ids_stream(
last_token, self.current_max, limit=100
)
events = await self.store.get_events_as_list(
event_to_received_ts.keys(), get_prev_content=True
)

The suspicion is that either upper_bound or self.current_max were somehow incorrectly set. Some questions we're left with:

  • Why the appservice_sender stream got stuck in the first place?
  • Why it didn't unwedge itself after restarts of either the main process or (when it was brought into existance) the appservice worker
  • Why it required a worker restart once the DB was updated to go again

It's possible that this could be reproduced with a unit test with appropriately out of sync stream positions.

Synapse Version

Synapse 1.99.0+lts.3 for the full duration of the problem

Database

PostgreSQL

Workers

Multiple workers

Platform

Element Server Suite, so kubernetes.

Relevant log output

We didn't spot any relevant logs.
@Half-Shot
Copy link
Member

and curiously membership join events were being sent.

I find this especially odd, given the stream position wasn't moving.

Also, we found that the appservice_stream_position has an unused(?) lock column that we couldn't match to any code in Synapse.

@anoadragon453
Copy link
Member Author

Potentially a duplicate of matrix-org/synapse#13950.

@Half-Shot
Copy link
Member

@anoadragon453 did we think this was related to the timing out requests for user queries in the end?

@anoadragon453
Copy link
Member Author

@Half-Shot yes indeed. For context, this turned out to be due to requests from Synapse to the application service for user lookup queries timing out. Each of these calls to self._check_user_exists has a timeout of 60s:

https://github.com/element-hq/synapse-private/blob/550d760364bf79147046c7c939af59db06ae17f1/synapse/handlers/appservice.py#L179-L187

and this code runs for every event that is meant to be sent an AS. You can see why the AS's queuer may get backed up here as new events are created!

I think it's best to close this issue, and instead create a new one with suggestions for how to better handle a poor network connection between Synapse and an AS in this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants