You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There was a failure mode today on a customer's ESS deployment where their Synapse stopped sending PDUs to registered appservices. Ephemeral events (to-device, presence, read receipts) were sent, and curiously membership join events were being sent. But messages, m.room.create and membership leave events all weren't making it over (likely as join memberships have a separate notifier path). The deployment had no workers.
The synapse_event_processing_lag_by_event metric reported that the appservice_sender label was ~80K events behind (the build up starting ~20 days ago). We tried setting up workers on the deployment, but not an appservice worker. Then as the problem persisted, an appservice worker was configured and started, yet events were still not being sent. The synapse_event_processing_lag_by_event metric was also not being reported by the appservice worker, which means this code was not being reached:
Looking in the database, we saw that the stream IDs (which concern ephemeral events) in the application_services_state table were fine and up to date. However, looking at the appservice_stream_position table revealed a stream ID of ~20K, while in the stream_positions table one could see the events stream was at ~80K. It had fallen behind, and was failing to catch up.
The only thing that worked was to manually update the stream position in appservice_stream_position to match the events row in the stream_positions table and restart the appservice worker. This lead us to believe that there's a deadlock case that's possible upon falling very far behind.
The code that compares the event stream positions and pulls the necessary events out of the database is:
@Half-Shot yes indeed. For context, this turned out to be due to requests from Synapse to the application service for user lookup queries timing out. Each of these calls to self._check_user_exists has a timeout of 60s:
and this code runs for every event that is meant to be sent an AS. You can see why the AS's queuer may get backed up here as new events are created!
I think it's best to close this issue, and instead create a new one with suggestions for how to better handle a poor network connection between Synapse and an AS in this case.
Description
There was a failure mode today on a customer's ESS deployment where their Synapse stopped sending PDUs to registered appservices. Ephemeral events (to-device, presence, read receipts) were sent, and curiously membership join events were being sent. But messages,
m.room.create
and membership leave events all weren't making it over (likely as join memberships have a separate notifier path). The deployment had no workers.The
synapse_event_processing_lag_by_event
metric reported that theappservice_sender
label was ~80K events behind (the build up starting ~20 days ago). We tried setting up workers on the deployment, but not an appservice worker. Then as the problem persisted, an appservice worker was configured and started, yet events were still not being sent. Thesynapse_event_processing_lag_by_event
metric was also not being reported by the appservice worker, which means this code was not being reached:synapse/synapse/handlers/appservice.py
Lines 179 to 181 in 23740ea
Looking in the database, we saw that the stream IDs (which concern ephemeral events) in the
application_services_state
table were fine and up to date. However, looking at theappservice_stream_position
table revealed a stream ID of ~20K, while in thestream_positions
table one could see theevents
stream was at ~80K. It had fallen behind, and was failing to catch up.The only thing that worked was to manually update the stream position in
appservice_stream_position
to match theevents
row in thestream_positions
table and restart the appservice worker. This lead us to believe that there's a deadlock case that's possible upon falling very far behind.The code that compares the event stream positions and pulls the necessary events out of the database is:
synapse/synapse/handlers/appservice.py
Lines 121 to 138 in 23740ea
The suspicion is that either
upper_bound
orself.current_max
were somehow incorrectly set. Some questions we're left with:appservice_sender
stream got stuck in the first place?It's possible that this could be reproduced with a unit test with appropriately out of sync stream positions.
Synapse Version
Synapse 1.99.0+lts.3 for the full duration of the problem
Database
PostgreSQL
Workers
Multiple workers
Platform
Element Server Suite, so kubernetes.
Relevant log output
We didn't spot any relevant logs.
The text was updated successfully, but these errors were encountered: