Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Deleting events causes database corruption #13476

Closed
richvdh opened this issue Aug 8, 2022 · 24 comments · Fixed by #16382
Closed

Deleting events causes database corruption #13476

richvdh opened this issue Aug 8, 2022 · 24 comments · Fixed by #16382
Labels
A-Admin-API A-Corruption Things that have led to unexpected state in Synapse or the database A-Message-Retention-Policies Issues related to Synapse's support of MSC1763 (message retention policies) O-Occasional Affects or can be seen by some users regularly or most users rarely roadmap S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@richvdh
Copy link
Member

richvdh commented Aug 8, 2022

Deleting events - whether you do it via the purge history admin API, the delete room admin API, or by enabling retention on your server - has potential to corrupt the database.

In short, the delete operations can race against other operations (notably event persistence), leading the database to get into an inconsistent state. The main reason is that the delete operations do not invalidate the in-memory caches, but in general there are a bunch of race conditions.

Examples include:

@richvdh richvdh added S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. labels Aug 8, 2022
@richvdh
Copy link
Member Author

richvdh commented Aug 8, 2022

@dklimpel
Copy link
Contributor

dklimpel commented Aug 8, 2022

Is it a solution to delete the entire cache before deleting? And would that make sense?

@richvdh
Copy link
Member Author

richvdh commented Aug 8, 2022

Is it a solution to delete the entire cache before deleting? And would that make sense?

Sadly that's not sufficient, since the cache might be immediately repopulated before the delete actually happens.

Is this related to?

I wouldn't say so. Those are the history purge deleting less stuff than they should; they do not cause inconsistency between Synapse and the database and cannot cause incorrect data in the database.

@richvdh
Copy link
Member Author

richvdh commented Aug 15, 2022

the delete operations can race against other operations (notably event persistence)

Incidentally: I think purging a complete room with no local members should be safe from races, but there are still cache coherency bugs (#11521).

@rettichschnidi
Copy link
Contributor

I am a homeserver admin affected by this - database is ever growing and deleting 90+ days old messages is a requirement (privacy aware NGO). Anything I can do to help getting this issue resolved?

@DMRobertson DMRobertson added O-Occasional Affects or can be seen by some users regularly or most users rarely A-Admin-API A-Message-Retention-Policies Issues related to Synapse's support of MSC1763 (message retention policies) labels Aug 24, 2022
@anoadragon453
Copy link
Member

Something like #13916 would help in the case of correctly invalidating the EventsWorkerStore._get_event_cache and EventsWorkerStore._event_ref caches upon fully purging a room. Currently, those caches aren't keyed by room ID, and thus invalidating all event entries related to a room in those caches is not possible.

This leads to coherency bugs such as #11521.

@Dan-Sun
Copy link

Dan-Sun commented Oct 28, 2022

We had a massive increase in the database in the last 15 days, about 50 GB. Now it's over 102 GB with about 200 users (20-30 really use it).

We had retention enabled (before the massive increase) to minimize the history for the user and have more storage at the end.

Now we are having trouble getting rid of some rooms, especially the first three:

              room_id            | num_rows  
----------------------------------------------------------
 Matrix HQ:matrix.org		 |   400673076
 BlackArch:matrix.org		 |     7139888
 Arch Linux:archlinux.org	 |     4855957

The #matrix:matrix.org room is deleted and not listed via admin api anymore but still in the database. Any chance to get rid off the few rows? :)

Is this problem now related to this issue here? We have blocked the matrix room for our users until we can solve the problem.

@4nd3r
Copy link
Contributor

4nd3r commented Oct 28, 2022

Little offtopic, but I wonder how matrix.org handles this massive amount of data? Constantly increasing storage requirement is the main reason I don't "idle" in huge public rooms using my HS account. I really hope that core team prioritizes this issue, because after many years, storage requirement starts to get out of hand 😢

@DMRobertson
Copy link
Contributor

The #matrix:matrix.org room is deleted and not listed via admin api anymore but still in the database.

@Dan-Sun How exactly did you delete the room?

@Dan-Sun
Copy link

Dan-Sun commented Oct 28, 2022

@DMRobertson
I tried API v1 and 2 (also using force_purge) and both gave me that there were still users in that room that could not be kicked out. One of them was a deactivated user. So I reactivated this user and tried to leave the room manually and via API, which ended with an internal server error.
I asked Travis to try to kick the user for me. I don't know what actually happened in the end, but the user was no longer in the room and I ran API v2 again. It quickly finished the process. And that was really about it. Right now, I'm running the API v2 command every 2 seconds since yesterday. Sometimes it shows that it is still purging, but the row number doesn't change.

We've so far 7 rooms which are still in the database but not via API listed.

Leaving API error

Traceback (most recent call last):
  File "/usr/lib/python3.10/site-packages/synapse/http/server.py", line 306, in _async_render_wrapper
    callback_return = await self._async_render(request)
  File "/usr/lib/python3.10/site-packages/synapse/http/server.py", line 512, in _async_render
    callback_return = await raw_callback_return
  File "/usr/lib/python3.10/site-packages/synapse/rest/client/room.py", line 995, in on_POST
    await self.room_member_handler.update_membership(
  File "/usr/lib/python3.10/site-packages/synapse/handlers/room_member.py", line 575, in update_membership
    result = await self.update_membership_locked(
  File "/usr/lib/python3.10/site-packages/synapse/handlers/room_member.py", line 1032, in update_membership_locked
    return await self._local_membership_update(
  File "/usr/lib/python3.10/site-packages/synapse/handlers/room_member.py", line 437, in _local_membership_update
    result_event = await self.event_creation_handler.handle_new_client_event(
  File "/usr/lib/python3.10/site-packages/synapse/util/metrics.py", line 113, in measured_func
    r = await func(self, *args, **kwargs)
  File "/usr/lib/python3.10/site-packages/synapse/handlers/message.py", line 1382, in handle_new_client_event
    result, _ = await make_deferred_yieldable(
  File "/usr/lib/python3.10/site-packages/twisted/internet/defer.py", line 1696, in _inlineCallbacks
    result = context.run(gen.send, result)
  File "/usr/lib/python3.10/site-packages/synapse/handlers/message.py", line 1479, in _persist_events
    event = await self.persist_and_notify_client_events(
  File "/usr/lib/python3.10/site-packages/synapse/handlers/message.py", line 1871, in persist_and_notify_client_events
    ) = await self._storage_controllers.persistence.persist_events(
  File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 414, in persist_events
    ret_vals = await yieldable_gather_results(enqueue, partitioned.items())
  File "/usr/lib/python3.10/site-packages/synapse/util/async_helpers.py", line 291, in yieldable_gather_results
    raise dfe.subFailure.value from None
  File "/usr/lib/python3.10/site-packages/twisted/internet/defer.py", line 1692, in _inlineCallbacks
    result = context.run(
  File "/usr/lib/python3.10/site-packages/twisted/python/failure.py", line 518, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 409, in enqueue
    return await self._event_persist_queue.add_to_queue(
  File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 240, in add_to_queue
    res = await make_deferred_yieldable(end_item.deferred.observe())
  File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 283, in handle_queue_loop
    ret = await self._per_item_callback(room_id, item.task)
  File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 355, in _process_event_persist_queue_task
    return await self._persist_event_batch(room_id, task)
  File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 673, in _persist_event_batch
    res = await self._get_new_state_after_events(
  File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 878, in _get_new_state_after_events
    old_state_groups = {
  File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 879, in <setcomp>
    event_id_to_state_group[evid] for evid in old_latest_event_ids
KeyError: '$9n5rfTl0fNUdVLztun-SdVdWd_zSYjKhgR60p7SUxQs'

Delete room error

Traceback (most recent call last):
  File "/usr/lib/python3.10/site-packages/synapse/handlers/room.py", line 1879, in shutdown_room
    _, stream_id = await self.room_member_handler.update_membership(
  File "/usr/lib/python3.10/site-packages/synapse/handlers/room_member.py", line 575, in update_membership
    result = await self.update_membership_locked(
  File "/usr/lib/python3.10/site-packages/synapse/handlers/room_member.py", line 1032, in update_membership_locked
    return await self._local_membership_update(
  File "/usr/lib/python3.10/site-packages/synapse/handlers/room_member.py", line 437, in _local_membership_update
    result_event = await self.event_creation_handler.handle_new_client_event(
  File "/usr/lib/python3.10/site-packages/synapse/util/metrics.py", line 113, in measured_func
    r = await func(self, *args, **kwargs)
  File "/usr/lib/python3.10/site-packages/synapse/handlers/message.py", line 1382, in handle_new_client_event
    result, _ = await make_deferred_yieldable(
  File "/usr/lib/python3.10/site-packages/twisted/internet/defer.py", line 1696, in _inlineCallbacks
    result = context.run(gen.send, result)
  File "/usr/lib/python3.10/site-packages/synapse/handlers/message.py", line 1479, in _persist_events
    event = await self.persist_and_notify_client_events(
  File "/usr/lib/python3.10/site-packages/synapse/handlers/message.py", line 1871, in persist_and_notify_client_events
    ) = await self._storage_controllers.persistence.persist_events(
  File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 414, in persist_events
    ret_vals = await yieldable_gather_results(enqueue, partitioned.items())
  File "/usr/lib/python3.10/site-packages/synapse/util/async_helpers.py", line 291, in yieldable_gather_results
    raise dfe.subFailure.value from None
  File "/usr/lib/python3.10/site-packages/twisted/internet/defer.py", line 1692, in _inlineCallbacks
    result = context.run(
  File "/usr/lib/python3.10/site-packages/twisted/python/failure.py", line 518, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 409, in enqueue
    return await self._event_persist_queue.add_to_queue(
  File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 240, in add_to_queue
    res = await make_deferred_yieldable(end_item.deferred.observe())
  File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 283, in handle_queue_loop
    ret = await self._per_item_callback(room_id, item.task)
  File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 355, in _process_event_persist_queue_task
    return await self._persist_event_batch(room_id, task)
  File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 673, in _persist_event_batch
    res = await self._get_new_state_after_events(
  File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 878, in _get_new_state_after_events
    old_state_groups = {
  File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 879, in <setcomp>
    event_id_to_state_group[evid] for evid in old_latest_event_ids
KeyError: 'KEY'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/site-packages/synapse/handlers/pagination.py", line 677, in _shutdown_and_purge_room
    raise SynapseError(
synapse.api.errors.SynapseError: 400: Users are still joined to this room

At the moment we're having these errors:

2022-10-25 01:49:48,161 - synapse.federation.federation_server - 1216 - ERROR - _process_incoming_pdus_in_room_inner-5694 - Failed to handle PDU $5ot2eqxV4FTQQuHYwchhZHvNraGc-In7uL3hDH0PUKY
Traceback (most recent call last):
     File "/usr/lib/python3.10/site-packages/synapse/handlers/federation_event.py", line 276, in on_receive_pdu
       await self._get_missing_events_for_pdu(
     File "/usr/lib/python3.10/site-packages/synapse/handlers/federation_event.py", line 752, in _get_missing_events_for_pdu
       await self._process_pulled_events(origin, missing_events, backfilled=False)
     File "/usr/lib/python3.10/site-packages/synapse/handlers/federation_event.py", line 800, in _process_pulled_events
       await self._process_pulled_event(origin, ev, backfilled=backfilled)
    File "/usr/lib/python3.10/site-packages/synapse/handlers/federation_event.py", line 875, in _process_pulled_event 
       await self._process_received_pdu(
     File "/usr/lib/python3.10/site-packages/synapse/handlers/federation_event.py", line 1281, in _process_received_pdu
       await self._run_push_actions_and_persist_event(event, context, backfilled)
     File "/usr/lib/python3.10/site-packages/synapse/handlers/federation_event.py", line 2125, in _run_push_actions_and_persist_event
       await self.persist_events_and_notify(
     File "/usr/lib/python3.10/site-packages/synapse/handlers/federation_event.py", line 2186, in persist_events_and_notify
       ) = await self._storage_controllers.persistence.persist_events(  
     File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 414, in persist_events
       ret_vals = await yieldable_gather_results(enqueue, partitioned.items())
     File "/usr/lib/python3.10/site-packages/synapse/util/async_helpers.py", line 291, in yieldable_gather_results
       raise dfe.subFailure.value from None
     File "/usr/lib/python3.10/site-packages/twisted/internet/defer.py", line 1692, in _inlineCallbacks
       result = context.run(
     File "/usr/lib/python3.10/site-packages/twisted/python/failure.py", line 518, in throwExceptionIntoGenerator
       return g.throw(self.type, self.value, self.tb)
     File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 409, in enqueue
       return await self._event_persist_queue.add_to_queue(
     File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 240, in add_to_queue
       res = await make_deferred_yieldable(end_item.deferred.observe())
     File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 283, in handle_queue_loop
       ret = await self._per_item_callback(room_id, item.task)
     File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 355, in _process_event_persist_queue_task
       return await self._persist_event_batch(room_id, task)
     File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 673, in _persist_event_batch
       res = await self._get_new_state_after_events(
     File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 878, in _get_new_state_after_events
       old_state_groups = {
     File "/usr/lib/python3.10/site-packages/synapse/storage/controllers/persist_events.py", line 879, in <setcomp>
       event_id_to_state_group[evid] for evid in old_latest_event_ids
   KeyError: '$NK5I-eXXgK-2xm85ZGewUnFBFWZC5YkA3iacyFuCvqs'
   
   The above exception was the direct cause of the following exception:
   
   Traceback (most recent call last):
     File "/usr/lib/python3.10/site-packages/synapse/federation/federation_server.py", line 1206, in _process_incoming_pdus_in_room_inner
       await self._federation_event_handler.on_receive_pdu(
     File "/usr/lib/python3.10/site-packages/synapse/handlers/federation_event.py", line 280, in on_receive_pdu
       raise Exception(
   Exception: Error fetching missing prev_events for $5ot2eqxV4FTQQuHYwchhZHvNraGc-In7uL3hDH0PUKY: '$NK5I-eXXgK-2xm85ZGewUnFBFWZC5YkA3iacyFuCvqs'

And a lot of those

2022-10-25 01:56:11,270 - synapse.handlers.device - 1129 - WARNING - _maybe_retry_device_resync-114 - Failed to handle device list update for @USER: 403: Forbidden     
2022-10-25 01:56:11,388 - synapse.http.matrixfederationclient - 709 - WARNING - federation_transaction_transmission_loop-254477 - {PUT-O-2497351} [otherServer] Request failed: PUT matrix://otherServer/_matrix/federation/v1/s        end/1666633335267: HttpResponseException('403: Forbidden')

@steadfasterX
Copy link

Dunno if that would be technically possible but if we could have a maintenance mode (which takes the above into account and so stops almost all communication, federation etc) the retention tasks could be done safely, no?

I would greatly accept these regular "downtimes" to avoid the mentioned issues but disabling retention is impossible for me..

archlinux-github pushed a commit to archlinux/infrastructure that referenced this issue Feb 16, 2023
Deleting events causes database corruption, see e.g.
matrix-org/synapse#11738 and
matrix-org/synapse#13476.
@erikjohnston
Copy link
Member

erikjohnston commented May 17, 2023

I think we can mitigate this a lot by taking the following steps:

  • Drop all caches related to events when we purge history. We should also purge state caches when we delete a room. c.f. Clear event caches when we purge history #15609 for WIP PR.
  • Before inserting into event_forward_extremities/event_backward_extremities/etc tables, check that the events still exist in the DB (by doing another select). We already do this when inserting new state groups for example.
    • We should ideally go through the list of tables that have an event ID column and do the same thing there.
    • If the check fails we should raise an error. This will likely cause us to drop the events entirely, for clients means returning an error, for federation means dropping them.
    • These checks could also take the form of adding a foreign key constraint to the relevant columns
    • Add foreign key constraint to event_forward_extremities. #15751
  • Add a cross-worker read/write style lock to ensure that we don't purge history while: 1) creating/persisting an event, 2) backfilling, 3) handling new inbound federation events. Note the cross-worker locks are best effort, so need to be combined with check above to be safe. This would take two steps:
    • Add support for waiting on locks to current mechanism. Easiest way is to sit in a loop that tries to acquire the lock every e.g. 100ms, and add a hook to notifier/replication to notify when locks are dropped.
    • Add support for read/write locks (trying to ensure reads are fast). This would likely need a different table to the existing lock mechanism.
  • Fix POST /read_markers to handle the case where existing read markers point to deleted events. We try and read the event to stop clients from replacing read markers with older events, so its probably safe to assume that if we no longer have the existing event then its sufficiently old that we skip the check. Handle missing previous read marker event #15464
  • Fix push after purging history Push notifications stop working when a history purge deletes the event a read receipt is pointing at #12880. There may be a similar issue with calculating notification counts.

Note: there are likely more races here that we don't know about, and even if we didn't I don't think implementing all the above steps will necessary fix all the known brokenness. It should massively reduce the frequency of the problems though.

erikjohnston added a commit that referenced this issue Jun 8, 2023
This should help a little with #13476

---------

Co-authored-by: Patrick Cloke <[email protected]>
@mcalinghee
Copy link
Contributor

Hi,

I understand the change #15609 will help in terms of reducing the race condition on the DB (correct me if I am wrong)

When upgrading to synapse 1.83.0, we have faced the situation where the DELETE statement in the event table were blocked whereas it was not the case before with synapse 1.59.0

Having in mind that the purge jobs can cause database corruptions, we are disabling the retention policy.

Do you advice that we can re-activate the retention policy after #15609 ?

@samuel-p
Copy link
Contributor

I just stumbled across this issue this morning, when I had some outages on my server. Is there anything I can do to find the affected rooms and/or to fix the database again - except restoring a backup?

@toshanmugaraj
Copy link

toshanmugaraj commented Aug 1, 2023

Instead of deleting from the DB, can it be just marked as purged and filtered out while trying to access through API call. Deactivated events, something similar to deactivated users. Pagination should stop once it reaches the purged event in a timeline.

@anoadragon453
Copy link
Member

@toshanmugaraj what you're describing are effectively "redacted" events. These are events that maintain their position in the room's Directed Acyclic Graph (DAG), yet the actual message content and other sensitive data have been removed.

Taking this approach to removing data indeed sidesteps the problems in this issue, but this is issue is talking about deleting events entirely -- something you may still want to do to save on disk space. The message retention feature in Synapse does currently delete events entirely. That makes sense to do for the space saving benefit, but is overkill if you're just looking to remove sensitive data from your database.

There is currently work being done by the team to close this issue, meaning that even deleting an event entirely shouldn't cause database corruption. It's a difficult thing to test for success however, as the problem is mainly caused by race conditions.

@samuel-p It's difficult to give general advice on the subject, as everyone's case is different. You try can deleting the affected room entirely, then restarting your server to clear caches. You can then try to rejoin the room (if it was a room with other homeservers in it). Any power levels your users had in the room will remain in-tact.

@ktpx
Copy link

ktpx commented Aug 25, 2023

I have this issue, and i see no way to fix this in the above.

@clokep
Copy link
Member

clokep commented Sep 25, 2023

As @DMRobertson announced earlier this month:

In other news, for the last few releases Erik has been working to fix database corruption problems relating to room retention and purging historical events. He put out a call for testing in the Synapse Announcements room last Friday. It is not risk-free, but if anyone is feeling bold and is prepared to handle potential data loss, we would appreciate testing and feedback.

If you're following this issue then do please give it a test and let us know if you have issues.


@erikjohnston said:

I'm fairly confident that the bugs we know about have been fixed

the problem is all the races we don't know about

OTOH, if no one is complaining then hopefully that is good enough

@clokep clokep linked a pull request Sep 25, 2023 that will close this issue
@steadfasterX
Copy link

how to test?

@schildbach
Copy link

I've repeatedly got this error in the log, and I'm running Synapse 1.94.0, so the issue seems not entirely fixed:

synapse_1   | 2023-10-16 08:31:45,686 - synapse.federation.federation_server - 1258 - ERROR - _process_incoming_pdus_in_room_inner-558 - Failed to handle PDU $RcbEyjzcF6ie0jTUyFILm4eN3iOzO72VpCdsc3ezSp4
synapse_1   | Traceback (most recent call last):
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/handlers/federation_event.py", line 285, in on_receive_pdu
synapse_1   |     await self._get_missing_events_for_pdu(
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/logging/opentracing.py", line 916, in _wrapper
synapse_1   |     return await func(*args, **kwargs)
synapse_1   |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/handlers/federation_event.py", line 804, in _get_missing_events_for_pdu
synapse_1   |     await self._process_pulled_events(origin, missing_events, backfilled=False)
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/logging/opentracing.py", line 916, in _wrapper
synapse_1   |     return await func(*args, **kwargs)
synapse_1   |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/handlers/federation_event.py", line 937, in _process_pulled_events
synapse_1   |     await _process_new_pulled_events(fresh_events)
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/handlers/federation_event.py", line 894, in _process_new_pulled_events
synapse_1   |     await self._process_pulled_event(origin, ev, backfilled=backfilled)
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/logging/opentracing.py", line 916, in _wrapper
synapse_1   |     return await func(*args, **kwargs)
synapse_1   |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/logging/opentracing.py", line 916, in _wrapper
synapse_1   |     return await func(*args, **kwargs)
synapse_1   |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/handlers/federation_event.py", line 1000, in _process_pulled_event
synapse_1   |     await self._process_received_pdu(
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/logging/opentracing.py", line 916, in _wrapper
synapse_1   |     return await func(*args, **kwargs)
synapse_1   |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/handlers/federation_event.py", line 1441, in _process_received_pdu
synapse_1   |     await self._run_push_actions_and_persist_event(event, context, backfilled)
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/logging/opentracing.py", line 916, in _wrapper
synapse_1   |     return await func(*args, **kwargs)
synapse_1   |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/handlers/federation_event.py", line 2215, in _run_push_actions_and_persist_event
synapse_1   |     await self.persist_events_and_notify(
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/handlers/federation_event.py", line 2276, in persist_events_and_notify
synapse_1   |     ) = await self._storage_controllers.persistence.persist_events(
synapse_1   |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/logging/opentracing.py", line 916, in _wrapper
synapse_1   |     return await func(*args, **kwargs)
synapse_1   |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/storage/controllers/persist_events.py", line 421, in persist_events
synapse_1   |     ret_vals = await yieldable_gather_results(enqueue, partitioned.items())
synapse_1   |                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/util/async_helpers.py", line 307, in yieldable_gather_results
synapse_1   |     raise dfe.subFailure.value from None
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/twisted/internet/defer.py", line 1993, in _inlineCallbacks
synapse_1   |     result = context.run(
synapse_1   |              ^^^^^^^^^^^^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/twisted/python/failure.py", line 518, in throwExceptionIntoGenerator
synapse_1   |     return g.throw(self.type, self.value, self.tb)
synapse_1   |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/storage/controllers/persist_events.py", line 416, in enqueue
synapse_1   |     return await self._event_persist_queue.add_to_queue(
synapse_1   |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/storage/controllers/persist_events.py", line 239, in add_to_queue
synapse_1   |     res = await make_deferred_yieldable(end_item.deferred.observe())
synapse_1   |           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/storage/controllers/persist_events.py", line 282, in handle_queue_loop
synapse_1   |     ret = await self._per_item_callback(room_id, item.task)
synapse_1   |           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/storage/controllers/persist_events.py", line 362, in _process_event_persist_queue_task
synapse_1   |     return await self._persist_event_batch(room_id, task)
synapse_1   |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/storage/controllers/persist_events.py", line 680, in _persist_event_batch
synapse_1   |     res = await self._get_new_state_after_events(
synapse_1   |           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/storage/controllers/persist_events.py", line 880, in _get_new_state_after_events
synapse_1   |     old_state_groups = {
synapse_1   |                        ^
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/storage/controllers/persist_events.py", line 881, in <setcomp>
synapse_1   |     event_id_to_state_group[evid] for evid in old_latest_event_ids
synapse_1   |     ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
synapse_1   | KeyError: '$Kk0mN4xEMgXaJ0AciM_TNMG1h8Eo1ZBLeRLAuqd0tEk'
synapse_1   | 
synapse_1   | The above exception was the direct cause of the following exception:
synapse_1   | 
synapse_1   | Traceback (most recent call last):
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/federation/federation_server.py", line 1248, in _process_incoming_pdus_in_room_inner
synapse_1   |     await self._federation_event_handler.on_receive_pdu(
synapse_1   |   File "/usr/local/lib/python3.11/site-packages/synapse/handlers/federation_event.py", line 289, in on_receive_pdu
synapse_1   |     raise Exception(
synapse_1   | Exception: Error fetching missing prev_events for $RcbEyjzcF6ie0jTUyFILm4eN3iOzO72VpCdsc3ezSp4: '$Kk0mN4xEMgXaJ0AciM_TNMG1h8Eo1ZBLeRLAuqd0tEk'

@Wenri
Copy link

Wenri commented Oct 18, 2023

I also got the missing prev_events error after purge history API calls. Running Synapse 1.93.0. It seems the problem still exists. I posted a message in Synapse Admins, but I guess it got buried in other messages.

The first error in the log after the purge history is:

2023-10-08 12:39:42,836 - synapse.federation.federation_server - 1260 - ERROR - _process_incoming_pdus_in_room_inner-10514 - Failed to handle PDU $VLE4qs4mVWxD2Wup11xH0lqXINR9nb8CFo_Oys4GpgM
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/synapse/federation/federation_server.py", line 1250, in _process_incoming_pdus_in_room_inner
    await self._federation_event_handler.on_receive_pdu(
  File "/usr/local/lib/python3.11/site-packages/synapse/handlers/federation_event.py", line 328, in on_receive_pdu
    await self._process_received_pdu(origin, pdu, context)
...
  File "/usr/local/lib/python3.11/site-packages/synapse/storage/database.py", line 205, in commit
    self.conn.commit()
psycopg2.errors.ForeignKeyViolation: insert or update on table "event_forward_extremities" violates foreign key constraint "event_forward_extremities_event_id"
DETAIL:  Key (event_id)=($0WM-Av1vnFc_yJFA-X_d3MWqJY5mpW0mAomWXtkcm8w) is not present in table "events".

Maybe It is time to reopen this issue.

@schildbach
Copy link

I've repeatedly got this error in the log, and I'm running Synapse 1.94.0, so the issue seems not entirely fixed…

I remedied my problem by force-removing the affected room. There is a certain chance this particular event is a left-over from a time where the issue wasn't yet fixed.

I'll let you know if it crops up again.

@schildbach
Copy link

It just cropped up again, after the usual database maintenance.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Admin-API A-Corruption Things that have led to unexpected state in Synapse or the database A-Message-Retention-Policies Issues related to Synapse's support of MSC1763 (message retention policies) O-Occasional Affects or can be seen by some users regularly or most users rarely roadmap S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

Successfully merging a pull request may close this issue.