Prevent broking the whole incoming federation process by one broken room #10605

MurzNN · 2021-08-14T05:15:53Z

Description:

Synapse instances sometimes gain problems with the whole incoming federation process, because of only one broken room.

For example, this is happened with our ru-matrix.org homeserver: by reason of one broken event in room (problem is described in #10589) - incoming federation was working very slowly (delays of many hours for most of incoming messages from popular homeservers) for years, and E2EE encryption (because of timeouts E2EE keys exchange is not happened most of times), calls and other stuff that requires robust federation - was totally broken too! And admins spend a lot of time to understand the source of this problem, because of missing ERROR level logs (#10597).

This is only one example, but there may be many different situations when one room becomes broken, and this leads to broken federation of all other rooms too!

To prevent this will be good to implement some workaround in Synapse (and maybe in Spec too), that skips broken room and continue syncing federated data for other non-broken rooms.

erikjohnston · 2021-08-25T09:42:59Z

We already try and do this by logging and discarding errors we get when processing events in a room we received in a transaction, e.g. here:

synapse/synapse/federation/federation_server.py

Lines 423 to 436 in e81d620

    
           try: 
        
               await self._handle_received_pdu(origin, pdu) 
        
               return {} 
        
           except FederationError as e: 
        
               logger.warning("Error handling PDU %s: %s", event_id, e) 
        
               return {"error": str(e)} 
        
           except Exception as e: 
        
               f = failure.Failure() 
        
               logger.error( 
        
                   "Failed to handle PDU %s", 
        
                   event_id, 
        
                   exc_info=(f.type, f.value, f.getTracebackObject()),  # type: ignore 
        
               ) 
        
               return {"error": str(e)}

And as of v1.38.0 we do the majority of the processing of events in the background to ensure that /send request returns quickly (c.f. #10284).

I'm not that surprised there is a bug somewhere in the logic, but we do already have the infrastructure in place to stop one broken room from breaking inbound federation. I'm going to close this for now and try and track the instances where this is not the case as separate bugs, if that is OK @MurzNN ?

erikjohnston closed this as completed Aug 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent broking the whole incoming federation process by one broken room #10605

Prevent broking the whole incoming federation process by one broken room #10605

MurzNN commented Aug 14, 2021

erikjohnston commented Aug 25, 2021

Prevent broking the whole incoming federation process by one broken room #10605

Prevent broking the whole incoming federation process by one broken room #10605

Comments

MurzNN commented Aug 14, 2021

erikjohnston commented Aug 25, 2021