Deleting events causes database corruption #13476

richvdh · 2022-08-08T12:25:07Z

Deleting events - whether you do it via the purge history admin API, the delete room admin API, or by enabling retention on your server - has potential to corrupt the database.

In short, the delete operations can race against other operations (notably event persistence), leading the database to get into an inconsistent state. The main reason is that the delete operations do not invalidate the in-memory caches, but in general there are a bunch of race conditions.

Examples include:

Error backfilling in room for 10 minutes after purging history #9864
completely broken room after purge_room/rejoin #11521
non-existent event-forward-extremities (Forward extremity event missing from database, leading to No state group for unknown or outlier event #12507 (comment)), or outlier forward extremities ("No state group for unknown or outlier event" error in federation #13026). Also ERROR - POST-758856 - Failed handle request via 'RoomMembershipRestServlet' – when trying to leave room #13524.

The text was updated successfully, but these errors were encountered:

richvdh · 2022-08-08T12:30:26Z

Other related issues:

POST /read_markers fails after history purge if cache is cleared #11738
Push notifications stop working when a history purge deletes the event a read receipt is pointing at #12880 (though I don't think that's database corruption, it's just stuff not working)

dklimpel · 2022-08-08T12:52:50Z

Is it a solution to delete the entire cache before deleting? And would that make sense?

dklimpel · 2022-08-08T13:04:25Z

Is this related to?

state_groups (& state_groups_state & state_group_edges) are not fully purged alongside the rooms #12821
My HS has accumulated thousands of unreferenced state groups #3364

richvdh · 2022-08-08T18:54:32Z

Is it a solution to delete the entire cache before deleting? And would that make sense?

Sadly that's not sufficient, since the cache might be immediately repopulated before the delete actually happens.

Is this related to?

I wouldn't say so. Those are the history purge deleting less stuff than they should; they do not cause inconsistency between Synapse and the database and cannot cause incorrect data in the database.

richvdh · 2022-08-15T18:30:18Z

the delete operations can race against other operations (notably event persistence)

Incidentally: I think purging a complete room with no local members should be safe from races, but there are still cache coherency bugs (#11521).

rettichschnidi · 2022-08-18T13:17:49Z

I am a homeserver admin affected by this - database is ever growing and deleting 90+ days old messages is a requirement (privacy aware NGO). Anything I can do to help getting this issue resolved?

anoadragon453 · 2022-10-18T08:49:30Z

Something like #13916 would help in the case of correctly invalidating the EventsWorkerStore._get_event_cache and EventsWorkerStore._event_ref caches upon fully purging a room. Currently, those caches aren't keyed by room ID, and thus invalidating all event entries related to a room in those caches is not possible.

This leads to coherency bugs such as #11521.

Dan-Sun · 2022-10-28T01:06:53Z

We had a massive increase in the database in the last 15 days, about 50 GB. Now it's over 102 GB with about 200 users (20-30 really use it).

We had retention enabled (before the massive increase) to minimize the history for the user and have more storage at the end.

Now we are having trouble getting rid of some rooms, especially the first three:

              room_id            | num_rows  
----------------------------------------------------------
 Matrix HQ:matrix.org		 |   400673076
 BlackArch:matrix.org		 |     7139888
 Arch Linux:archlinux.org	 |     4855957

The #matrix:matrix.org room is deleted and not listed via admin api anymore but still in the database. Any chance to get rid off the few rows? :)

Is this problem now related to this issue here? We have blocked the matrix room for our users until we can solve the problem.

4nd3r · 2022-10-28T10:02:26Z

Little offtopic, but I wonder how matrix.org handles this massive amount of data? Constantly increasing storage requirement is the main reason I don't "idle" in huge public rooms using my HS account. I really hope that core team prioritizes this issue, because after many years, storage requirement starts to get out of hand 😢

DMRobertson · 2022-10-28T10:40:42Z

The #matrix:matrix.org room is deleted and not listed via admin api anymore but still in the database.

@Dan-Sun How exactly did you delete the room?

Dan-Sun · 2022-10-28T19:02:26Z

@DMRobertson
I tried API v1 and 2 (also using force_purge) and both gave me that there were still users in that room that could not be kicked out. One of them was a deactivated user. So I reactivated this user and tried to leave the room manually and via API, which ended with an internal server error.
I asked Travis to try to kick the user for me. I don't know what actually happened in the end, but the user was no longer in the room and I ran API v2 again. It quickly finished the process. And that was really about it. Right now, I'm running the API v2 command every 2 seconds since yesterday. Sometimes it shows that it is still purging, but the row number doesn't change.

We've so far 7 rooms which are still in the database but not via API listed.

Leaving API error

Traceback (most recent call last):
  File "/usr/lib/python3.10/site-packages/synapse/http/server.py", line 306, in _async_render_wrapper
    callback_return = await self._async_render(request)
  File "/usr/lib/python3.10/site-packages/synapse/http/server.py", line 512, in _async_render
    callback_return = await raw_callback_return
  File "/usr/lib/python3.10/site-packages/synapse/rest/client/room.py", line 995, in on_POST
    await self.room_member_handler.update_membership(
  File "/usr/lib/python3.10/site-packages/synapse/handlers/room_member.py", line 575, in update_membership
    result = await self.update_membership_locked(
  File "/usr/lib/python3.10/site-packages/synapse/handlers/room_member.py", line 1032, in update_membership_locked
    return await self._local_membership_update(
  File "/usr/lib/python3.10/site-packages/synapse/handlers/room_member.py", line 437, in _local_membership_update
    result_event = await self.event_creation_handler.handle_new_client_event(
  File "/usr/lib/python3.10/site-packages/synapse/util/metrics.py", line 113, in measured_func
    r = await func(self, *args, **kwargs)
  File "/usr/lib/python3.10/site-packages/synapse/handlers/message.py", line 1382, in handle_new_client_event
    result, _ = await make_deferred_yieldable(
  File "/usr/lib/python3.10/site-packages/twisted/internet/defer.py", line 1696, in _inlineCallbacks
    result = context.run(gen.send, result)
  File "/usr/lib/python3.10/site-packages/synapse/handlers/message.py", line 1479, in _persist_events
    event = await self.persist_and_notify_client_events(
  File "/usr/lib/python3.10/site-packages/synapse/handlers/message.py", line 1871, in persist_and_notify_client_events
    ) = await self._storage_controllers.persistence.persist_events(
  File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 414, in persist_events
    ret_vals = await yieldable_gather_results(enqueue, partitioned.items())
  File "/usr/lib/python3.10/site-packages/synapse/util/async_helpers.py", line 291, in yieldable_gather_results
    raise dfe.subFailure.value from None
  File "/usr/lib/python3.10/site-packages/twisted/internet/defer.py", line 1692, in _inlineCallbacks
    result = context.run(
  File "/usr/lib/python3.10/site-packages/twisted/python/failure.py", line 518, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 409, in enqueue
    return await self._event_persist_queue.add_to_queue(
  File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 240, in add_to_queue
    res = await make_deferred_yieldable(end_item.deferred.observe())
  File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 283, in handle_queue_loop
    ret = await self._per_item_callback(room_id, item.task)
  File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 355, in _process_event_persist_queue_task
    return await self._persist_event_batch(room_id, task)
  File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 673, in _persist_event_batch
    res = await self._get_new_state_after_events(
  File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 878, in _get_new_state_after_events
    old_state_groups = {
  File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 879, in <setcomp>
    event_id_to_state_group[evid] for evid in old_latest_event_ids
KeyError: '$9n5rfTl0fNUdVLztun-SdVdWd_zSYjKhgR60p7SUxQs'

Delete room error

Traceback (most recent call last):
  File "/usr/lib/python3.10/site-packages/synapse/handlers/room.py", line 1879, in shutdown_room
    _, stream_id = await self.room_member_handler.update_membership(
  File "/usr/lib/python3.10/site-packages/synapse/handlers/room_member.py", line 575, in update_membership
    result = await self.update_membership_locked(
  File "/usr/lib/python3.10/site-packages/synapse/handlers/room_member.py", line 1032, in update_membership_locked
    return await self._local_membership_update(
  File "/usr/lib/python3.10/site-packages/synapse/handlers/room_member.py", line 437, in _local_membership_update
    result_event = await self.event_creation_handler.handle_new_client_event(
  File "/usr/lib/python3.10/site-packages/synapse/util/metrics.py", line 113, in measured_func
    r = await func(self, *args, **kwargs)
  File "/usr/lib/python3.10/site-packages/synapse/handlers/message.py", line 1382, in handle_new_client_event
    result, _ = await make_deferred_yieldable(
  File "/usr/lib/python3.10/site-packages/twisted/internet/defer.py", line 1696, in _inlineCallbacks
    result = context.run(gen.send, result)
  File "/usr/lib/python3.10/site-packages/synapse/handlers/message.py", line 1479, in _persist_events
    event = await self.persist_and_notify_client_events(
  File "/usr/lib/python3.10/site-packages/synapse/handlers/message.py", line 1871, in persist_and_notify_client_events
    ) = await self._storage_controllers.persistence.persist_events(
  File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 414, in persist_events
    ret_vals = await yieldable_gather_results(enqueue, partitioned.items())
  File "/usr/lib/python3.10/site-packages/synapse/util/async_helpers.py", line 291, in yieldable_gather_results
    raise dfe.subFailure.value from None
  File "/usr/lib/python3.10/site-packages/twisted/internet/defer.py", line 1692, in _inlineCallbacks
    result = context.run(
  File "/usr/lib/python3.10/site-packages/twisted/python/failure.py", line 518, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 409, in enqueue
    return await self._event_persist_queue.add_to_queue(
  File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 240, in add_to_queue
    res = await make_deferred_yieldable(end_item.deferred.observe())
  File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 283, in handle_queue_loop
    ret = await self._per_item_callback(room_id, item.task)
  File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 355, in _process_event_persist_queue_task
    return await self._persist_event_batch(room_id, task)
  File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 673, in _persist_event_batch
    res = await self._get_new_state_after_events(
  File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 878, in _get_new_state_after_events
    old_state_groups = {
  File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 879, in <setcomp>
    event_id_to_state_group[evid] for evid in old_latest_event_ids
KeyError: 'KEY'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/site-packages/synapse/handlers/pagination.py", line 677, in _shutdown_and_purge_room
    raise SynapseError(
synapse.api.errors.SynapseError: 400: Users are still joined to this room

At the moment we're having these errors:

2022-10-25 01:49:48,161 - synapse.federation.federation_server - 1216 - ERROR - _process_incoming_pdus_in_room_inner-5694 - Failed to handle PDU $5ot2eqxV4FTQQuHYwchhZHvNraGc-In7uL3hDH0PUKY
Traceback (most recent call last):
     File "/usr/lib/python3.10/site-packages/synapse/handlers/federation_event.py", line 276, in on_receive_pdu
       await self._get_missing_events_for_pdu(
     File "/usr/lib/python3.10/site-packages/synapse/handlers/federation_event.py", line 752, in _get_missing_events_for_pdu
       await self._process_pulled_events(origin, missing_events, backfilled=False)
     File "/usr/lib/python3.10/site-packages/synapse/handlers/federation_event.py", line 800, in _process_pulled_events
       await self._process_pulled_event(origin, ev, backfilled=backfilled)
    File "/usr/lib/python3.10/site-packages/synapse/handlers/federation_event.py", line 875, in _process_pulled_event 
       await self._process_received_pdu(
     File "/usr/lib/python3.10/site-packages/synapse/handlers/federation_event.py", line 1281, in _process_received_pdu
       await self._run_push_actions_and_persist_event(event, context, backfilled)
     File "/usr/lib/python3.10/site-packages/synapse/handlers/federation_event.py", line 2125, in _run_push_actions_and_persist_event
       await self.persist_events_and_notify(
     File "/usr/lib/python3.10/site-packages/synapse/handlers/federation_event.py", line 2186, in persist_events_and_notify
       ) = await self._storage_controllers.persistence.persist_events(  
     File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 414, in persist_events
       ret_vals = await yieldable_gather_results(enqueue, partitioned.items())
     File "/usr/lib/python3.10/site-packages/synapse/util/async_helpers.py", line 291, in yieldable_gather_results
       raise dfe.subFailure.value from None
     File "/usr/lib/python3.10/site-packages/twisted/internet/defer.py", line 1692, in _inlineCallbacks
       result = context.run(
     File "/usr/lib/python3.10/site-packages/twisted/python/failure.py", line 518, in throwExceptionIntoGenerator
       return g.throw(self.type, self.value, self.tb)
     File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 409, in enqueue
       return await self._event_persist_queue.add_to_queue(
     File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 240, in add_to_queue
       res = await make_deferred_yieldable(end_item.deferred.observe())
     File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 283, in handle_queue_loop
       ret = await self._per_item_callback(room_id, item.task)
     File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 355, in _process_event_persist_queue_task
       return await self._persist_event_batch(room_id, task)
     File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 673, in _persist_event_batch
       res = await self._get_new_state_after_events(
     File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 878, in _get_new_state_after_events
       old_state_groups = {
     File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 879, in <setcomp>
       event_id_to_state_group[evid] for evid in old_latest_event_ids
   KeyError: '$NK5I-eXXgK-2xm85ZGewUnFBFWZC5YkA3iacyFuCvqs'
   
   The above exception was the direct cause of the following exception:
   
   Traceback (most recent call last):
     File "/usr/lib/python3.10/site-packages/synapse/federation/federation_server.py", line 1206, in _process_incoming_pdus_in_room_inner
       await self._federation_event_handler.on_receive_pdu(
     File "/usr/lib/python3.10/site-packages/synapse/handlers/federation_event.py", line 280, in on_receive_pdu
       raise Exception(
   Exception: Error fetching missing prev_events for $5ot2eqxV4FTQQuHYwchhZHvNraGc-In7uL3hDH0PUKY: '$NK5I-eXXgK-2xm85ZGewUnFBFWZC5YkA3iacyFuCvqs'

And a lot of those

2022-10-25 01:56:11,270 - synapse.handlers.device - 1129 - WARNING - _maybe_retry_device_resync-114 - Failed to handle device list update for @USER: 403: Forbidden     
2022-10-25 01:56:11,388 - synapse.http.matrixfederationclient - 709 - WARNING - federation_transaction_transmission_loop-254477 - {PUT-O-2497351} [otherServer] Request failed: PUT matrix://otherServer/_matrix/federation/v1/s        end/1666633335267: HttpResponseException('403: Forbidden')

steadfasterX · 2023-02-02T11:14:27Z

Dunno if that would be technically possible but if we could have a maintenance mode (which takes the above into account and so stops almost all communication, federation etc) the retention tasks could be done safely, no?

I would greatly accept these regular "downtimes" to avoid the mentioned issues but disabling retention is impossible for me..

Deleting events causes database corruption, see e.g. matrix-org/synapse#11738 and matrix-org/synapse#13476.

erikjohnston · 2023-05-17T08:54:32Z

I think we can mitigate this a lot by taking the following steps:

Drop all caches related to events when we purge history. We should also purge state caches when we delete a room. c.f. Clear event caches when we purge history #15609 for WIP PR.
Before inserting into event_forward_extremities/event_backward_extremities/etc tables, check that the events still exist in the DB (by doing another select). We already do this when inserting new state groups for example.
- We should ideally go through the list of tables that have an event ID column and do the same thing there.
- If the check fails we should raise an error. This will likely cause us to drop the events entirely, for clients means returning an error, for federation means dropping them.
- These checks could also take the form of adding a foreign key constraint to the relevant columns
- Add foreign key constraint to event_forward_extremities. #15751
Add a cross-worker read/write style lock to ensure that we don't purge history while: 1) creating/persisting an event, 2) backfilling, 3) handling new inbound federation events. Note the cross-worker locks are best effort, so need to be combined with check above to be safe. This would take two steps:
- Add support for waiting on locks to current mechanism. Easiest way is to sit in a loop that tries to acquire the lock every e.g. 100ms, and add a hook to notifier/replication to notify when locks are dropped.
- Add support for read/write locks (trying to ensure reads are fast). This would likely need a different table to the existing lock mechanism.
Fix POST /read_markers to handle the case where existing read markers point to deleted events. We try and read the event to stop clients from replacing read markers with older events, so its probably safe to assume that if we no longer have the existing event then its sufficiently old that we skip the check. Handle missing previous read marker event #15464
Fix push after purging history Push notifications stop working when a history purge deletes the event a read receipt is pointing at #12880. There may be a similar issue with calculating notification counts.
- Add test to ensure push/notifications work after room purge sytest#1361 - Already done, we just need to add a regression test

Note: there are likely more races here that we don't know about, and even if we didn't I don't think implementing all the above steps will necessary fix all the known brokenness. It should massively reduce the frequency of the problems though.

This should help a little with #13476 --------- Co-authored-by: Patrick Cloke <[email protected]>

mcalinghee · 2023-06-08T13:03:49Z

Hi,

I understand the change #15609 will help in terms of reducing the race condition on the DB (correct me if I am wrong)

When upgrading to synapse 1.83.0, we have faced the situation where the DELETE statement in the event table were blocked whereas it was not the case before with synapse 1.59.0

Having in mind that the purge jobs can cause database corruptions, we are disabling the retention policy.

Do you advice that we can re-activate the retention policy after #15609 ?

samuel-p · 2023-07-15T09:05:30Z

I just stumbled across this issue this morning, when I had some outages on my server. Is there anything I can do to find the affected rooms and/or to fix the database again - except restoring a backup?

…eletion (#15791) c.f. #13476

toshanmugaraj · 2023-08-01T07:55:53Z

Instead of deleting from the DB, can it be just marked as purged and filtered out while trying to access through API call. Deactivated events, something similar to deactivated users. Pagination should stop once it reaches the purged event in a timeline.

anoadragon453 · 2023-08-01T10:24:48Z

@toshanmugaraj what you're describing are effectively "redacted" events. These are events that maintain their position in the room's Directed Acyclic Graph (DAG), yet the actual message content and other sensitive data have been removed.

Taking this approach to removing data indeed sidesteps the problems in this issue, but this is issue is talking about deleting events entirely -- something you may still want to do to save on disk space. The message retention feature in Synapse does currently delete events entirely. That makes sense to do for the space saving benefit, but is overkill if you're just looking to remove sensitive data from your database.

There is currently work being done by the team to close this issue, meaning that even deleting an event entirely shouldn't cause database corruption. It's a difficult thing to test for success however, as the problem is mainly caused by race conditions.

@samuel-p It's difficult to give general advice on the subject, as everyone's case is different. You try can deleting the affected room entirely, then restarting your server to clear caches. You can then try to rejoin the room (if it was a room with other homeservers in it). Any power levels your users had in the room will remain in-tact.

ktpx · 2023-08-25T23:12:20Z

I have this issue, and i see no way to fix this in the above.

clokep · 2023-09-25T17:29:38Z

As @DMRobertson announced earlier this month:

In other news, for the last few releases Erik has been working to fix database corruption problems relating to room retention and purging historical events. He put out a call for testing in the Synapse Announcements room last Friday. It is not risk-free, but if anyone is feeling bold and is prepared to handle potential data loss, we would appreciate testing and feedback.

If you're following this issue then do please give it a test and let us know if you have issues.

@erikjohnston said:

I'm fairly confident that the bugs we know about have been fixed

the problem is all the races we don't know about

OTOH, if no one is complaining then hopefully that is good enough

steadfasterX · 2023-09-26T07:52:03Z

how to test?

schildbach · 2023-10-16T08:55:50Z

I've repeatedly got this error in the log, and I'm running Synapse 1.94.0, so the issue seems not entirely fixed:

synapse_1   | 2023-10-16 08:31:45,686 - synapse.federation.federation_server - 1258 - ERROR - _process_incoming_pdus_in_room_inner-558 - Failed to handle PDU $RcbEyjzcF6ie0jTUyFILm4eN3iOzO72VpCdsc3ezSp4
synapse_1   | Traceback (most recent call last):
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/handlers/federation_event.py", line 285, in on_receive_pdu
synapse_1   |     await self._get_missing_events_for_pdu(
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/logging/opentracing.py", line 916, in _wrapper
synapse_1   |     return await func(*args, **kwargs)
synapse_1   |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/handlers/federation_event.py", line 804, in _get_missing_events_for_pdu
synapse_1   |     await self._process_pulled_events(origin, missing_events, backfilled=False)
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/logging/opentracing.py", line 916, in _wrapper
synapse_1   |     return await func(*args, **kwargs)
synapse_1   |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/handlers/federation_event.py", line 937, in _process_pulled_events
synapse_1   |     await _process_new_pulled_events(fresh_events)
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/handlers/federation_event.py", line 894, in _process_new_pulled_events
synapse_1   |     await self._process_pulled_event(origin, ev, backfilled=backfilled)
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/logging/opentracing.py", line 916, in _wrapper
synapse_1   |     return await func(*args, **kwargs)
synapse_1   |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/logging/opentracing.py", line 916, in _wrapper
synapse_1   |     return await func(*args, **kwargs)
synapse_1   |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/handlers/federation_event.py", line 1000, in _process_pulled_event
synapse_1   |     await self._process_received_pdu(
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/logging/opentracing.py", line 916, in _wrapper
synapse_1   |     return await func(*args, **kwargs)
synapse_1   |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/handlers/federation_event.py", line 1441, in _process_received_pdu
synapse_1   |     await self._run_push_actions_and_persist_event(event, context, backfilled)
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/logging/opentracing.py", line 916, in _wrapper
synapse_1   |     return await func(*args, **kwargs)
synapse_1   |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/handlers/federation_event.py", line 2215, in _run_push_actions_and_persist_event
synapse_1   |     await self.persist_events_and_notify(
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/handlers/federation_event.py", line 2276, in persist_events_and_notify
synapse_1   |     ) = await self._storage_controllers.persistence.persist_events(
synapse_1   |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/logging/opentracing.py", line 916, in _wrapper
synapse_1   |     return await func(*args, **kwargs)
synapse_1   |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/storage/controllers/persist_events.py", line 421, in persist_events
synapse_1   |     ret_vals = await yieldable_gather_results(enqueue, partitioned.items())
synapse_1   |                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/util/async_helpers.py", line 307, in yieldable_gather_results
synapse_1   |     raise dfe.subFailure.value from None
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/twisted/internet/defer.py", line 1993, in _inlineCallbacks
synapse_1   |     result = context.run(
synapse_1   |              ^^^^^^^^^^^^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/twisted/python/failure.py", line 518, in throwExceptionIntoGenerator
synapse_1   |     return g.throw(self.type, self.value, self.tb)
synapse_1   |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/storage/controllers/persist_events.py", line 416, in enqueue
synapse_1   |     return await self._event_persist_queue.add_to_queue(
synapse_1   |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/storage/controllers/persist_events.py", line 239, in add_to_queue
synapse_1   |     res = await make_deferred_yieldable(end_item.deferred.observe())
synapse_1   |           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/storage/controllers/persist_events.py", line 282, in handle_queue_loop
synapse_1   |     ret = await self._per_item_callback(room_id, item.task)
synapse_1   |           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/storage/controllers/persist_events.py", line 362, in _process_event_persist_queue_task
synapse_1   |     return await self._persist_event_batch(room_id, task)
synapse_1   |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/storage/controllers/persist_events.py", line 680, in _persist_event_batch
synapse_1   |     res = await self._get_new_state_after_events(
synapse_1   |           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/storage/controllers/persist_events.py", line 880, in _get_new_state_after_events
synapse_1   |     old_state_groups = {
synapse_1   |                        ^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/storage/controllers/persist_events.py", line 881, in <setcomp>
synapse_1   |     event_id_to_state_group[evid] for evid in old_latest_event_ids
synapse_1   |     ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
synapse_1   | KeyError: '$Kk0mN4xEMgXaJ0AciM_TNMG1h8Eo1ZBLeRLAuqd0tEk'
synapse_1   | 
synapse_1   | The above exception was the direct cause of the following exception:
synapse_1   | 
synapse_1   | Traceback (most recent call last):
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/federation/federation_server.py", line 1248, in _process_incoming_pdus_in_room_inner
synapse_1   |     await self._federation_event_handler.on_receive_pdu(
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/handlers/federation_event.py", line 289, in on_receive_pdu
synapse_1   |     raise Exception(
synapse_1   | Exception: Error fetching missing prev_events for $RcbEyjzcF6ie0jTUyFILm4eN3iOzO72VpCdsc3ezSp4: '$Kk0mN4xEMgXaJ0AciM_TNMG1h8Eo1ZBLeRLAuqd0tEk'

Wenri · 2023-10-18T04:37:22Z

I also got the missing prev_events error after purge history API calls. Running Synapse 1.93.0. It seems the problem still exists. I posted a message in Synapse Admins, but I guess it got buried in other messages.

The first error in the log after the purge history is:

2023-10-08 12:39:42,836 - synapse.federation.federation_server - 1260 - ERROR - _process_incoming_pdus_in_room_inner-10514 - Failed to handle PDU $VLE4qs4mVWxD2Wup11xH0lqXINR9nb8CFo_Oys4GpgM
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/synapse/federation/federation_server.py", line 1250, in _process_incoming_pdus_in_room_inner
    await self._federation_event_handler.on_receive_pdu(
  File "/usr/local/lib/python3.11/site-packages/synapse/handlers/federation_event.py", line 328, in on_receive_pdu
    await self._process_received_pdu(origin, pdu, context)
...
  File "/usr/local/lib/python3.11/site-packages/synapse/storage/database.py", line 205, in commit
    self.conn.commit()
psycopg2.errors.ForeignKeyViolation: insert or update on table "event_forward_extremities" violates foreign key constraint "event_forward_extremities_event_id"
DETAIL:  Key (event_id)=($0WM-Av1vnFc_yJFA-X_d3MWqJY5mpW0mAomWXtkcm8w) is not present in table "events".

Maybe It is time to reopen this issue.

schildbach · 2023-10-18T10:50:27Z

I've repeatedly got this error in the log, and I'm running Synapse 1.94.0, so the issue seems not entirely fixed…

I remedied my problem by force-removing the affected room. There is a certain chance this particular event is a left-over from a time where the issue wasn't yet fixed.

I'll let you know if it crops up again.

schildbach · 2023-12-03T23:20:41Z

It just cropped up again, after the usual database maintenance.

This was referenced Aug 8, 2022

retention comes with insufficient warnings #13475

Closed

Forward extremity event missing from database, leading to No state group for unknown or outlier event #12507

Closed

richvdh added S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. labels Aug 8, 2022

richvdh mentioned this issue Aug 8, 2022

"No state group for unknown or outlier event" error in federation #13026

Closed

H-Shay added the X-Needs-Discussion label Aug 9, 2022

squahtx mentioned this issue Aug 11, 2022

Add a warning to retention documentation regarding the possibility of database corruption #13497

Merged

erikjohnston removed the X-Needs-Discussion label Aug 11, 2022

richvdh mentioned this issue Aug 15, 2022

ERROR - POST-758856 - Failed handle request via 'RoomMembershipRestServlet' – when trying to leave room #13524

Closed

DMRobertson added O-Occasional Affects or can be seen by some users regularly or most users rarely A-Admin-API A-Message-Retention-Policies Issues related to Synapse's support of MSC1763 (message retention policies) labels Aug 24, 2022

anoadragon453 mentioned this issue Oct 12, 2022

Remove _get_events_cache check optimisation from _have_seen_events_dict #14161

Merged

anoadragon453 mentioned this issue Oct 18, 2022

Avoid checking the event cache when backfilling events #14164

Merged

aaronraimist mentioned this issue Nov 8, 2022

Error during MatrixClient request GET /_matrix/client/r0/joined_rooms: 401 Unauthorized spantaleev/matrix-docker-ansible-deploy#2243

Open

babolivier mentioned this issue Nov 21, 2022

MSC1763: Proposal for specifying configurable message retention periods matrix-org/matrix-spec-proposals#1763

Open

DMRobertson mentioned this issue Nov 29, 2022

An event was sent twice down /sync (and down application service txns?) consistently across clients; ghost notifications #14539

Open

archlinux-github pushed a commit to archlinux/infrastructure that referenced this issue Feb 16, 2023

matrix: Disable retention

dff25b3

Deleting events causes database corruption, see e.g. matrix-org/synapse#11738 and matrix-org/synapse#13476.

MadLittleMods mentioned this issue May 19, 2023

Add automatic purge after all users forget a room #15488

Merged

4 tasks

erikjohnston added a commit that referenced this issue Jun 8, 2023

Clear event caches when we purge history (#15609)

c485ed1

This should help a little with #13476 --------- Co-authored-by: Patrick Cloke <[email protected]>

This was referenced Jun 9, 2023

Add foreign key constraint to event_forward_extremities. #15751

Merged

Add ability to wait for locks and add locks to purge history / room deletion #15791

Merged

MadLittleMods mentioned this issue Jun 23, 2023

Add basic read/write lock #15782

Merged

MadLittleMods added the A-Corruption Things that have led to unexpected state in Synapse or the database label Jun 26, 2023

MadLittleMods mentioned this issue Jun 26, 2023

Unable to leave room. KeyError #15796

Closed

erikjohnston mentioned this issue Jul 26, 2023

Add test to ensure push/notifications work after room purge matrix-org/sytest#1361

Merged

erikjohnston added a commit that referenced this issue Jul 31, 2023

Add ability to wait for locks and add locks to purge history / room d…

ae55cc1

…eletion (#15791) c.f. #13476

This was referenced Aug 3, 2023

[500] Internal Server Error when trying to leave room #15967

Closed

Make the "don't use retention" warning less subtle #16062

Closed

clokep mentioned this issue Sep 25, 2023

Remove warnings from the docs about using message retention. #16382

Merged

clokep linked a pull request Sep 25, 2023 that will close this issue

Remove warnings from the docs about using message retention. #16382

Merged

clokep closed this as completed in #16382 Sep 28, 2023

This was referenced Jan 30, 2024

Unable to leave a certrain room (KeyError). element-hq/synapse#9260

Open

Issue joining rooms element-hq/synapse#15863

Open

KeyError on recent background updates element-hq/synapse#16054

Open

NicolasBuquet mentioned this issue Mar 13, 2024

Configure per-room retention period for messages tchapgouv/tchap-ios#239

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deleting events causes database corruption #13476

Deleting events causes database corruption #13476

richvdh commented Aug 8, 2022 •

edited

Loading

richvdh commented Aug 8, 2022

dklimpel commented Aug 8, 2022

dklimpel commented Aug 8, 2022

richvdh commented Aug 8, 2022

richvdh commented Aug 15, 2022

rettichschnidi commented Aug 18, 2022

anoadragon453 commented Oct 18, 2022

Dan-Sun commented Oct 28, 2022 •

edited

Loading

4nd3r commented Oct 28, 2022

DMRobertson commented Oct 28, 2022

Dan-Sun commented Oct 28, 2022 •

edited

Loading

steadfasterX commented Feb 2, 2023

erikjohnston commented May 17, 2023 •

edited by DMRobertson

Loading

mcalinghee commented Jun 8, 2023

samuel-p commented Jul 15, 2023

toshanmugaraj commented Aug 1, 2023 •

edited

Loading

anoadragon453 commented Aug 1, 2023

ktpx commented Aug 25, 2023

clokep commented Sep 25, 2023 •

edited

Loading

steadfasterX commented Sep 26, 2023

schildbach commented Oct 16, 2023

Wenri commented Oct 18, 2023

schildbach commented Oct 18, 2023

schildbach commented Dec 3, 2023

Deleting events causes database corruption #13476

Deleting events causes database corruption #13476

Comments

richvdh commented Aug 8, 2022 • edited Loading

richvdh commented Aug 8, 2022

dklimpel commented Aug 8, 2022

dklimpel commented Aug 8, 2022

richvdh commented Aug 8, 2022

richvdh commented Aug 15, 2022

rettichschnidi commented Aug 18, 2022

anoadragon453 commented Oct 18, 2022

Dan-Sun commented Oct 28, 2022 • edited Loading

4nd3r commented Oct 28, 2022

DMRobertson commented Oct 28, 2022

Dan-Sun commented Oct 28, 2022 • edited Loading

Leaving API error

Delete room error

steadfasterX commented Feb 2, 2023

erikjohnston commented May 17, 2023 • edited by DMRobertson Loading

mcalinghee commented Jun 8, 2023

samuel-p commented Jul 15, 2023

toshanmugaraj commented Aug 1, 2023 • edited Loading

anoadragon453 commented Aug 1, 2023

ktpx commented Aug 25, 2023

clokep commented Sep 25, 2023 • edited Loading

steadfasterX commented Sep 26, 2023

schildbach commented Oct 16, 2023

Wenri commented Oct 18, 2023

schildbach commented Oct 18, 2023

schildbach commented Dec 3, 2023

richvdh commented Aug 8, 2022 •

edited

Loading

Dan-Sun commented Oct 28, 2022 •

edited

Loading

Dan-Sun commented Oct 28, 2022 •

edited

Loading

erikjohnston commented May 17, 2023 •

edited by DMRobertson

Loading

toshanmugaraj commented Aug 1, 2023 •

edited

Loading

clokep commented Sep 25, 2023 •

edited

Loading