Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Prevent broking the whole incoming federation process by one broken room #10605

Closed
MurzNN opened this issue Aug 14, 2021 · 1 comment
Closed

Comments

@MurzNN
Copy link

MurzNN commented Aug 14, 2021

Description:

Synapse instances sometimes gain problems with the whole incoming federation process, because of only one broken room.

For example, this is happened with our ru-matrix.org homeserver: by reason of one broken event in room (problem is described in #10589) - incoming federation was working very slowly (delays of many hours for most of incoming messages from popular homeservers) for years, and E2EE encryption (because of timeouts E2EE keys exchange is not happened most of times), calls and other stuff that requires robust federation - was totally broken too! And admins spend a lot of time to understand the source of this problem, because of missing ERROR level logs (#10597).

This is only one example, but there may be many different situations when one room becomes broken, and this leads to broken federation of all other rooms too!

To prevent this will be good to implement some workaround in Synapse (and maybe in Spec too), that skips broken room and continue syncing federated data for other non-broken rooms.

@erikjohnston
Copy link
Member

We already try and do this by logging and discarding errors we get when processing events in a room we received in a transaction, e.g. here:

try:
await self._handle_received_pdu(origin, pdu)
return {}
except FederationError as e:
logger.warning("Error handling PDU %s: %s", event_id, e)
return {"error": str(e)}
except Exception as e:
f = failure.Failure()
logger.error(
"Failed to handle PDU %s",
event_id,
exc_info=(f.type, f.value, f.getTracebackObject()), # type: ignore
)
return {"error": str(e)}

And as of v1.38.0 we do the majority of the processing of events in the background to ensure that /send request returns quickly (c.f. #10284).

I'm not that surprised there is a bug somewhere in the logic, but we do already have the infrastructure in place to stop one broken room from breaking inbound federation. I'm going to close this for now and try and track the instances where this is not the case as separate bugs, if that is OK @MurzNN ?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants