Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Federation catchup doesn't send to_device EDUs until the remote end has caught up #8691

Open
Tracked by #245
matrixbot opened this issue Dec 18, 2023 · 2 comments
Open
Tracked by #245

Comments

@matrixbot
Copy link
Collaborator

matrixbot commented Dec 18, 2023

This issue has been migrated from #8691.


Description

When a remote server falls behind on federation, Synapse back off and starts batching up requests. Usually this isn't too bad as the remote end will only be maybe 1 or 2 transactions behind, however more serious occurrences can put the server behind by hundreds of transactions or thousands of events.

Many of the messages could be encrypted, which means they'll be potentially accompanied by to_device EDUs in order to decrypt the messages on the client side. If the EDUs aren't sent as part of the catchup transactions, it's possible for the clients to not be able to decrypt messages and thus make users sad/angry.

Here's an example of this happening in real life:
image

For background on this graph: t2bot.io (the server in question) runs 2 federation readers, 1 of which (03) is dedicated to just handling matrix.org's traffic. The other (04) is left to handle any other random server which might exist in the wild.

In the graph, t2bot.io was behind on matrix.org's transactions and thus had a very spikey waveform due to the 50 PDU transactions having to be retried. When it did catch up, it was also met with all the EDUs it missed, creating a significant spike. Traffic after that is then normal.

This has been observed to happen on several catchups already, and only noticed today (with Synapse 1.22.0) - it's unclear if this is an issue in prior versions of synapse, or is a matrix.org federation sender-specific issue.

Version information

  • Homeserver: t2bot.io

If not matrix.org:

  • Version: 1.22.0 (with minor, unrelated, patches)

  • Install method: pip

  • Platform: Ubuntu 20.04, bare metal
@matrixbot matrixbot changed the title Dummy issue Federation catchup doesn't send to_device EDUs until the remote end has caught up Dec 21, 2023
@matrixbot matrixbot reopened this Dec 21, 2023
@richvdh richvdh added S-Major and removed S-Minor labels Sep 23, 2024
@richvdh
Copy link
Member

richvdh commented Sep 23, 2024

(From matrix-org/synapse#8691 (comment))

This is a particular problem, because if your server spends a lot of time lagging behind, then you can end up receiving room events but never the e2e keys for those events

This is still a real problem, causing real UTDs. IMHO to-device events should be prioritised ahead of PDUs.

@richvdh
Copy link
Member

richvdh commented Sep 23, 2024

This exacerbates matrix-org/matrix-spec#1123

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants