-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Deleting events causes database corruption #13476
Comments
Other related issues:
|
Is it a solution to delete the entire cache before deleting? And would that make sense? |
Sadly that's not sufficient, since the cache might be immediately repopulated before the delete actually happens.
I wouldn't say so. Those are the history purge deleting less stuff than they should; they do not cause inconsistency between Synapse and the database and cannot cause incorrect data in the database. |
Incidentally: I think purging a complete room with no local members should be safe from races, but there are still cache coherency bugs (#11521). |
I am a homeserver admin affected by this - database is ever growing and deleting 90+ days old messages is a requirement (privacy aware NGO). Anything I can do to help getting this issue resolved? |
Something like #13916 would help in the case of correctly invalidating the This leads to coherency bugs such as #11521. |
We had a massive increase in the database in the last 15 days, about 50 GB. Now it's over 102 GB with about 200 users (20-30 really use it). We had retention enabled (before the massive increase) to minimize the history for the user and have more storage at the end. Now we are having trouble getting rid of some rooms, especially the first three:
The #matrix:matrix.org room is deleted and not listed via admin api anymore but still in the database. Any chance to get rid off the few rows? :) Is this problem now related to this issue here? We have blocked the matrix room for our users until we can solve the problem. |
Little offtopic, but I wonder how matrix.org handles this massive amount of data? Constantly increasing storage requirement is the main reason I don't "idle" in huge public rooms using my HS account. I really hope that core team prioritizes this issue, because after many years, storage requirement starts to get out of hand 😢 |
@Dan-Sun How exactly did you delete the room? |
@DMRobertson We've so far 7 rooms which are still in the database but not via API listed. Leaving API error
Delete room error
At the moment we're having these errors:
And a lot of those
|
Dunno if that would be technically possible but if we could have a maintenance mode (which takes the above into account and so stops almost all communication, federation etc) the retention tasks could be done safely, no? I would greatly accept these regular "downtimes" to avoid the mentioned issues but disabling retention is impossible for me.. |
Deleting events causes database corruption, see e.g. matrix-org/synapse#11738 and matrix-org/synapse#13476.
I think we can mitigate this a lot by taking the following steps:
Note: there are likely more races here that we don't know about, and even if we didn't I don't think implementing all the above steps will necessary fix all the known brokenness. It should massively reduce the frequency of the problems though. |
This should help a little with #13476 --------- Co-authored-by: Patrick Cloke <[email protected]>
Hi, I understand the change #15609 will help in terms of reducing the race condition on the DB (correct me if I am wrong) When upgrading to synapse 1.83.0, we have faced the situation where the Having in mind that the purge jobs can cause database corruptions, we are disabling the retention policy. Do you advice that we can re-activate the retention policy after #15609 ? |
I just stumbled across this issue this morning, when I had some outages on my server. Is there anything I can do to find the affected rooms and/or to fix the database again - except restoring a backup? |
Instead of deleting from the DB, can it be just marked as purged and filtered out while trying to access through API call. Deactivated events, something similar to deactivated users. Pagination should stop once it reaches the purged event in a timeline. |
@toshanmugaraj what you're describing are effectively "redacted" events. These are events that maintain their position in the room's Directed Acyclic Graph (DAG), yet the actual message content and other sensitive data have been removed. Taking this approach to removing data indeed sidesteps the problems in this issue, but this is issue is talking about deleting events entirely -- something you may still want to do to save on disk space. The message retention feature in Synapse does currently delete events entirely. That makes sense to do for the space saving benefit, but is overkill if you're just looking to remove sensitive data from your database. There is currently work being done by the team to close this issue, meaning that even deleting an event entirely shouldn't cause database corruption. It's a difficult thing to test for success however, as the problem is mainly caused by race conditions. @samuel-p It's difficult to give general advice on the subject, as everyone's case is different. You try can deleting the affected room entirely, then restarting your server to clear caches. You can then try to rejoin the room (if it was a room with other homeservers in it). Any power levels your users had in the room will remain in-tact. |
I have this issue, and i see no way to fix this in the above. |
As @DMRobertson announced earlier this month:
If you're following this issue then do please give it a test and let us know if you have issues. @erikjohnston said:
|
how to test? |
I've repeatedly got this error in the log, and I'm running Synapse 1.94.0, so the issue seems not entirely fixed:
|
I also got the missing prev_events error after purge history API calls. Running Synapse 1.93.0. It seems the problem still exists. I posted a message in Synapse Admins, but I guess it got buried in other messages. The first error in the log after the purge history is:
Maybe It is time to reopen this issue. |
I remedied my problem by force-removing the affected room. There is a certain chance this particular event is a left-over from a time where the issue wasn't yet fixed. I'll let you know if it crops up again. |
It just cropped up again, after the usual database maintenance. |
Deleting events - whether you do it via the purge history admin API, the delete room admin API, or by enabling retention on your server - has potential to corrupt the database.
In short, the delete operations can race against other operations (notably event persistence), leading the database to get into an inconsistent state. The main reason is that the delete operations do not invalidate the in-memory caches, but in general there are a bunch of race conditions.
Examples include:
No state group for unknown or outlier event
#12507 (comment)), or outlier forward extremities ("No state group for unknown or outlier event" error in federation #13026). Also ERROR - POST-758856 - Failed handle request via 'RoomMembershipRestServlet' – when trying to leave room #13524.The text was updated successfully, but these errors were encountered: