-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Event missing from event_to_state_groups
, causing /sync
to fail with "No state group for unknown or outlier event"
#12549
Comments
Thanks for reporting! One question, just to clarify: on the last line of your log it shows the line Second, could you run this query on your homeserver and let me know if it returns any rows? select * from event_forward_extremities efe left join events e using (event_id) where e.event_id is null; |
select * from event_forward_extremities efe left join events e using (event_id) where e.event_id is null; for the record; that what you are asking is another issue which rich van der horst appears to be working on: #12507 (comment) |
Yes sorry I forgot to mention that this
This query returns zero rows. |
Right, I am verifying that this issue is separate from that issue by making sure the query returned no rows. |
Error message appears to have been introduced in #12191 |
@Ezwen , @MparkG, thanks for reporting this. As with #12507, it suggests that there is some sort of corruption in your database, but something slightly different to that. I'd like to try to understand what has happened here. Please could you take one of the event ids from the error report (the XXXXX) and run this query: select outlier from events where event_id='$XXXX'; Please could you also share all the log lines relating to the problematic request. Taking the example of @Ezwen's logs in the first message, these will be identified by the request id grep GET-1273 homeserver.log |
/sync
fails with "No state group for unknown or outlier event"
I tried with both event ids that cause this error, and I get
It turns out I obtain this stack trace not only through To make sure you get everything, I ran the following query on my logs:
Then I redacted a few "private" homeserver hostnames from the logs (renamed hope this helps! |
Hrm. I think the only way that can possibly happen is if an event is deleted from the database. Apologies if I've asked this before, but:
|
i get no rows either. Grep finds nothing, but log is only at warning level... in my case i had turned on retention for a while, thought it would work (given the option exists without even a warning of instability) I suspect something got deleted,too. |
I have never enabled retention
I did use the delete room API a very long time ago, to delete the |
Other news: following the tracks of @MparkG, I've updated from 1.54.0 to 1.56.0, and everything is going well. This confirms that the problem only appears with the 1.56.0→1.57.0 update. |
I noticed that on my homeserver #synapse:matrix.org hadn't received new messages for some days. When trying to leave this room from element, it failed with an "internal server error". In my homeserver's logs I found a lot of those "No state group for unknown or outlier event" messages. Deleting the room via the HTTP API didn't succeed either. |
Ok, then I'm mystified. All I can suggest is that you turn on DEBUG logging, and I can look at the logs to see if there are any clues. @Ezwen: if you're happy to share debug logs, could you contact me at |
This is unsurprising. Synapse 1.57 added the additional consistency check (#12191) which is now detecting an existing problem. |
Following further discussion with @Ezwen; it looks like the bad event is in his The following query will show affected events; it should return no rows for a healthy database. (Note that it has to scan the entire select e.* from events e left join rejections r using (event_id) left join event_to_state_groups esg using (event_id) where not e.outlier and esg.event_id is null and r.event_id is null; Unfortunately, as with #12507, I have no idea what could have caused it. Unlike #12507, my only real idea for a workaround is to delete the room (https://matrix-org.github.io/synapse/latest/admin_api/rooms.html#version-2-new-version) and re-join. |
/sync
fails with "No state group for unknown or outlier event"event_to_state_groups
, causing /sync
to fail with "No state group for unknown or outlier event"
ok @MparkG seems you have yet a different problem. Feel free to contact me at |
I'm also seeing the same stack trace as OP. I can provide direct access to my server and database if that helps. None of my clients are able to sync at all. The latest message my always-on client have cashed is Friday 2:46:54 pm in case this helps track down the broken event elsewhere. For now I'm downgrading to 1.56
|
I'm also getting this exception, and none of the debug queries above return any rows for me either. I think the issue is that I deleted the #matrix:matrix.org room on my server. I had previously attempted to leave it in my client, but that only returned internal server errors. |
With a lot of help from @richvdh , I managed to setup a fix/workaround for this problem, which is to delete all rooms that contain "faulty" events.
select e.room_id, count(*)
from events e left join rejections r using (event_id) left join event_to_state_groups esg using (event_id)
where not e.outlier and esg.event_id is null and r.event_id is null
group by e.room_id; (If you have zero results here, then it probably means you are not affected by this specific issue)
curl -H "Authorization: Bearer <access token>" 'https://<homeserver>/_synapse/admin/v2/rooms/<room_id>' -X DELETE -d '{"purge": true}' Note that this means history will be lost if said room was not shared with other accounts over federation − make backups of affected rooms if required!
curl -H "Authorization: Bearer <access token>" 'http://<homeserver>/_synapse/admin/v2/rooms/delete_status/<token>' -X GET If it says If it says If it says curl -H "Authorization: Bearer <access token>" 'https://<homeserver>/_synapse/admin/v2/rooms/<room_id>' -X DELETE -d '{"force_purge": true}' Note: after using
Side note: in my specific case, the overall problem was mixed with a database corruption problem. What helped for this was to dump+restore of the postgre instance. Then I was able to proceed with the recipe above without major problems. |
I am also getting bit by this bug (it seems), though none of the above queries have returned any rows from my DB. Further, I'm fairly certain I don't have corruption, as I just recently started a completely fresh database instance. Here is my sanitized stack trace:
This started out afflicting my bots (go-neb and now wordle), but the error persists when I try to sync with my regular user with |
I'm going to close this issue as I believe it was very specific to @Ezwen's situation (he had underlying postgres corruption leading to missing rows in Other people seeing the "No state group for unknown or outlier event":
Note that it's generally easier for us to handle duplicate reports than it is to unpick an issue which contains reports of multiple similar but slightly different problems. In other words: it's better to open a new issue than comment on an existing one unless you are sure that you have the same problem. |
Description
My homeserver encounters a case when upgrading from 1.54.0 to v1.57.0, where synchronization of messages stops working and where these kinds of errors appear in the logs:
It is possible that these errors started appearing after users started playing with thread messages in recent versions of Element, both while the server was in 1.54.0 and when it was upgraded to 1.57.0.
When rolling back to 1.54.0, the problem disappears and everything goes back to normal.
Steps to reproduce
One way to reproduce the error, with my setup:
The synchronization never completes, and instead errors such as those shown above appear in the logs.
Version information
Version: 1.57.0 (when the bug appeared)
Install method: docker
The text was updated successfully, but these errors were encountered: