Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replication timeouts due to message retention purge jobs #16489

Open
matrixbot opened this issue Dec 21, 2023 · 1 comment
Open

replication timeouts due to message retention purge jobs #16489

matrixbot opened this issue Dec 21, 2023 · 1 comment

Comments

@matrixbot
Copy link
Collaborator

matrixbot commented Dec 21, 2023

This issue has been migrated from #16489.


Description

Assumingly due to matrix-org/synapse#13632, the master process is unable to handle replication requests by workers due to the load from purge jobs. It is happily logging updates on the purge job states while clients can't connect anymore.

Steps to reproduce

  • enable message retention
  • maybe be a big instance idk
  • wait for the scheduled job to execute

Homeserver

tchncs.de

Synapse Version

1.94.0

Installation Method

pip (from PyPI)

Database

PostgreSQL

Workers

Multiple workers

Platform

Debian GNU/Linux 12 (bookworm), dedicated

Configuration

draupnir module, presence, retention

Relevant log output

synapse.replication.tcp.client - 352 - INFO - _process_incoming_pdus_in_room_inner-124023-$fbrT_6mck678v_gNV527V0f5Jp4kvbDiQVSeHOmiN2E - Finished waiting for repl stream 'events' to reach 361593234 (event_persister1)
synapse.http.client - 923 - INFO - PUT-890470 - Received response to POST synapse-replication://master/_synapse/replication/fed_send_edu/m.receipt/IjFSBKBxIa: 200
synapse.replication.tcp.client - 332 - INFO - PUT-890470 - Waiting for repl stream 'caches' to reach 416737455 (master); currently at: 416710210
synapse.replication.tcp.client - 342 - WARNING - PUT-890464 - Timed out waiting for repl stream 'caches' to reach 416737417 (master); currently at: 416710510
synapse.replication.tcp.client - 342 - WARNING - PUT-890470 - Timed out waiting for repl stream 'caches' to reach 416737422 (master); currently at: 416710510
synapse.replication.tcp.client - 342 - WARNING - PUT-890470 - Timed out waiting for repl stream 'caches' to reach 416737422 (master); currently at: 416710510
synapse.replication.tcp.client - 342 - WARNING - PUT-890470 - Timed out waiting for repl stream 'caches' to reach 416737422 (master); currently at: 416710510
synapse.replication.tcp.client - 342 - WARNING - PUT-890470 - Timed out waiting for repl stream 'caches' to reach 416737422 (master); currently at: 416710510
synapse.replication.tcp.client - 342 - WARNING - PUT-890470 - Timed out waiting for repl stream 'caches' to reach 416737422 (master); currently at: 416710510
synapse.replication.http._base - 300 - WARNING - GET-2559861 - presence_set_state request timed out; retrying
synapse.replication.http._base - 312 - WARNING - PUT-899550 - fed_send_edu request connection failed; retrying in 1s: ConnectError(<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>)
synapse.http.client - 932 - INFO - PUT-901284 - Error sending request to  POST synapse-replication://master/_synapse/replication/fed_send_edu/m.presence/WCoECfmCdH: ConnectError [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion: Connection lost.
]
SCR-20231014-kdvv
@matrixbot matrixbot changed the title Dummy issue replication timeouts due to message retention purge jobs Dec 22, 2023
@matrixbot matrixbot reopened this Dec 22, 2023
@Dominion0815
Copy link

Dominion0815 commented Feb 19, 2024

Same here with Docker + Worker Setup Synapse v1.101.0.
Currently, deactivating the retention does not seem to be enough, the problem reappears after a while.

Update:
I have set the retention.enabled to false and still every night at 5am after a PostgreSQL backup I get the errors
"Timed out waiting for repl stream".

Update 2:
ok I also get the error in the middle of the day.
Can someone please take care of this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants