This repository has been archived by the owner on Apr 26, 2024. It is now read-only.
Delete room API gets stuck in a very slow database query and (practically) never completes #15635
Labels
A-Database
DB stuff like queries, migrations, new/remove columns, indexes, unexpected entries in the db
A-Performance
Performance, both client-facing and admin-facing
O-Uncommon
Most users are unlikely to come across this or unexpected workflow
S-Critical
Blocks development, potential data loss, more than 25% of users possibly affected, no workarounds.
X-Regression
Something broke which worked on a previous release
Description
Attempting to delete a room using the delete room API gets stuck inside a PostgreSQL query deleting from
events
table:The relevant PostgreSQL backend process becomes pinned at 100%CPU with almost no I/O for a period of time disproportionate to the amount of affected rows (>30 minutes for 70k-ish rows on a high-end workstation-grade machine; I ran out of patience before the query ever completed).
Steps to reproduce
!HbIRFiVYbAwQBbobIc:matrix.org
) using the delete room APIDELETE FROM events ...
query for a disproportionately large timeSELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = 'synapse' AND state = 'active' AND query LIKE 'DELETE FROM events %';"
Homeserver
intelfx.name
Synapse Version
{"server_version":"1.83.0","python_version":"3.11.3"}
Installation Method
Other (please mention below)
Database
PostgreSQL 15.3 (single, never ported from SQLite, restored from backups in the past)
Workers
Multiple workers
Platform
Configuration
room_memberships
is ~1GBRelevant log output
Synapse log:
(repeated queries at the end are related to polling from my purge room script)
pg_stat_activity:
PostgreSQL logs:
After terminating the query, PostgreSQL logs are enlightening as to why this might happen:
Anything else that would be useful to know?
Judging by PostgreSQL logs (the context is reproducible across multiple purge attempts), the database spends most of the time verifying foreign key constraints for various tables referencing
events.stream_ordering
:It looks like none of these tables are indexed on
event_stream_ordering
, whereas theSELECT 1 FROM ONLY ... FOR KEY SHARE OF x
query seems to be a part of how the FK constraints are actually implemented in PostgreSQL.Excerpt of the
EXPLAIN ANALYZE
for the problematic query (note execution time forLIMIT 10
):This exact issue has also been spotted in the wild on pgsql-general: https://www.postgresql.org/message-id/CAA-aLv78noHZ2_nFyxd3zxoRPvq6Gm2enKpRuoxm56PtALU3Bw@mail.gmail.com
Creating the necessary indices by hand fixes the issue:
All in all, looks like a regression from #15356.
The text was updated successfully, but these errors were encountered: