Synapse stops sending persistent events (other than join memberships) to application services #17206

anoadragon453 · 2024-05-16T13:02:30Z

Description

There was a failure mode today on a customer's ESS deployment where their Synapse stopped sending PDUs to registered appservices. Ephemeral events (to-device, presence, read receipts) were sent, and curiously membership join events were being sent. But messages, m.room.create and membership leave events all weren't making it over (likely as join memberships have a separate notifier path). The deployment had no workers.

The synapse_event_processing_lag_by_event metric reported that the appservice_sender label was ~80K events behind (the build up starting ~20 days ago). We tried setting up workers on the deployment, but not an appservice worker. Then as the problem persisted, an appservice worker was configured and started, yet events were still not being sent. The synapse_event_processing_lag_by_event metric was also not being reported by the appservice worker, which means this code was not being reached:

synapse/synapse/handlers/appservice.py

Lines 179 to 181 in 23740ea

    
           synapse.metrics.event_processing_lag_by_event.labels( 
        
               "appservice_sender" 
        
           ).observe((now - ts) / 1000)

Looking in the database, we saw that the stream IDs (which concern ephemeral events) in the application_services_state table were fine and up to date. However, looking at the appservice_stream_position table revealed a stream ID of ~20K, while in the stream_positions table one could see the events stream was at ~80K. It had fallen behind, and was failing to catch up.

The only thing that worked was to manually update the stream position in appservice_stream_position to match the events row in the stream_positions table and restart the appservice worker. This lead us to believe that there's a deadlock case that's possible upon falling very far behind.

The code that compares the event stream positions and pulls the necessary events out of the database is:

synapse/synapse/handlers/appservice.py

Lines 121 to 138 in 23740ea

    
           @wrap_as_background_process("notify_interested_services") 
        
           async def _notify_interested_services(self, max_token: RoomStreamToken) -> None: 
        
               with Measure(self.clock, "notify_interested_services"): 
        
                   self.is_processing = True 
        
                   try: 
        
                       upper_bound = -1 
        
                       while upper_bound < self.current_max: 
        
                           last_token = await self.store.get_appservice_last_pos() 
        
                           ( 
        
                               upper_bound, 
        
                               event_to_received_ts, 
        
                           ) = await self.store.get_all_new_event_ids_stream( 
        
                               last_token, self.current_max, limit=100 
        
                           ) 
        
                           events = await self.store.get_events_as_list( 
        
                               event_to_received_ts.keys(), get_prev_content=True 
        
                           )

The suspicion is that either upper_bound or self.current_max were somehow incorrectly set. Some questions we're left with:

Why the appservice_sender stream got stuck in the first place?
Why it didn't unwedge itself after restarts of either the main process or (when it was brought into existance) the appservice worker
Why it required a worker restart once the DB was updated to go again

It's possible that this could be reproduced with a unit test with appropriately out of sync stream positions.

Synapse Version

Synapse 1.99.0+lts.3 for the full duration of the problem

Database

PostgreSQL

Workers

Multiple workers

Platform

Element Server Suite, so kubernetes.

Relevant log output

We didn't spot any relevant logs.

The text was updated successfully, but these errors were encountered:

Half-Shot · 2024-05-16T13:11:22Z

and curiously membership join events were being sent.

I find this especially odd, given the stream position wasn't moving.

Also, we found that the appservice_stream_position has an unused(?) lock column that we couldn't match to any code in Synapse.

anoadragon453 · 2024-05-16T13:13:55Z

Potentially a duplicate of matrix-org/synapse#13950.

Half-Shot · 2024-08-28T13:00:32Z

@anoadragon453 did we think this was related to the timing out requests for user queries in the end?

anoadragon453 · 2024-08-28T13:09:52Z

@Half-Shot yes indeed. For context, this turned out to be due to requests from Synapse to the application service for user lookup queries timing out. Each of these calls to self._check_user_exists has a timeout of 60s:

https://github.com/element-hq/synapse-private/blob/550d760364bf79147046c7c939af59db06ae17f1/synapse/handlers/appservice.py#L179-L187

and this code runs for every event that is meant to be sent an AS. You can see why the AS's queuer may get backed up here as new events are created!

I think it's best to close this issue, and instead create a new one with suggestions for how to better handle a poor network connection between Synapse and an AS in this case.

anoadragon453 added A-Application-Service S-Major O-Uncommon T-Defect labels May 16, 2024

anoadragon453 closed this as completed Aug 28, 2024

anoadragon453 mentioned this issue Aug 28, 2024

Slow user queries for an Application Service leads to persistent events failing to be sent #17621

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synapse stops sending persistent events (other than join memberships) to application services #17206

Synapse stops sending persistent events (other than join memberships) to application services #17206

anoadragon453 commented May 16, 2024 •

edited

Loading

Half-Shot commented May 16, 2024

anoadragon453 commented May 16, 2024

Half-Shot commented Aug 28, 2024

anoadragon453 commented Aug 28, 2024

Synapse stops sending persistent events (other than join memberships) to application services #17206

Synapse stops sending persistent events (other than join memberships) to application services #17206

Comments

anoadragon453 commented May 16, 2024 • edited Loading

Description

Synapse Version

Database

Workers

Platform

Relevant log output

Half-Shot commented May 16, 2024

anoadragon453 commented May 16, 2024

Half-Shot commented Aug 28, 2024

anoadragon453 commented Aug 28, 2024

anoadragon453 commented May 16, 2024 •

edited

Loading