Federation catchup doesn't send to_device EDUs until the remote end has caught up #8691

matrixbot · 2023-12-18T07:55:07Z

This issue has been migrated from #8691.

Description

When a remote server falls behind on federation, Synapse back off and starts batching up requests. Usually this isn't too bad as the remote end will only be maybe 1 or 2 transactions behind, however more serious occurrences can put the server behind by hundreds of transactions or thousands of events.

Many of the messages could be encrypted, which means they'll be potentially accompanied by to_device EDUs in order to decrypt the messages on the client side. If the EDUs aren't sent as part of the catchup transactions, it's possible for the clients to not be able to decrypt messages and thus make users sad/angry.

Here's an example of this happening in real life:

For background on this graph: t2bot.io (the server in question) runs 2 federation readers, 1 of which (03) is dedicated to just handling matrix.org's traffic. The other (04) is left to handle any other random server which might exist in the wild.

In the graph, t2bot.io was behind on matrix.org's transactions and thus had a very spikey waveform due to the 50 PDU transactions having to be retried. When it did catch up, it was also met with all the EDUs it missed, creating a significant spike. Traffic after that is then normal.

This has been observed to happen on several catchups already, and only noticed today (with Synapse 1.22.0) - it's unclear if this is an issue in prior versions of synapse, or is a matrix.org federation sender-specific issue.

Version information

Homeserver: t2bot.io

If not matrix.org:

Version: 1.22.0 (with minor, unrelated, patches)
Install method: pip

Platform: Ubuntu 20.04, bare metal

richvdh · 2024-09-23T18:32:02Z

(From matrix-org/synapse#8691 (comment))

This is a particular problem, because if your server spends a lot of time lagging behind, then you can end up receiving room events but never the e2e keys for those events

This is still a real problem, causing real UTDs. IMHO to-device events should be prioritised ahead of PDUs.

richvdh · 2024-09-23T18:33:18Z

This exacerbates matrix-org/matrix-spec#1123

matrixbot closed this as completed Dec 18, 2023

matrixbot changed the title ~~Dummy issue~~ Federation catchup doesn't send to_device EDUs until the remote end has caught up Dec 21, 2023

matrixbot added A-Federation S-Minor O-Occasional T-Defect labels Dec 21, 2023

matrixbot reopened this Dec 21, 2023

richvdh added S-Major and removed S-Minor labels Sep 23, 2024

richvdh mentioned this issue Sep 23, 2024

Unable To Decrypt meta issue element-hq/element-meta#245

Open

82 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Federation catchup doesn't send to_device EDUs until the remote end has caught up #8691

Federation catchup doesn't send to_device EDUs until the remote end has caught up #8691

matrixbot commented Dec 18, 2023 •

edited

Loading

richvdh commented Sep 23, 2024

richvdh commented Sep 23, 2024

Federation catchup doesn't send to_device EDUs until the remote end has caught up #8691

Federation catchup doesn't send to_device EDUs until the remote end has caught up #8691

Comments

matrixbot commented Dec 18, 2023 • edited Loading

Description

Version information

richvdh commented Sep 23, 2024

richvdh commented Sep 23, 2024

matrixbot commented Dec 18, 2023 •

edited

Loading