Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Event missing from event_to_state_groups, causing /sync to fail with "No state group for unknown or outlier event" #12549

Closed
Ezwen opened this issue Apr 26, 2022 · 23 comments
Labels
S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@Ezwen
Copy link

Ezwen commented Apr 26, 2022

Description

My homeserver encounters a case when upgrading from 1.54.0 to v1.57.0, where synchronization of messages stops working and where these kinds of errors appear in the logs:

2022-04-21 17:02:28,383 - synapse.http.server - 100 - ERROR - GET-1273- Failed handle request via 'SyncRestServlet': <XForwardedForRequest at 0x7fbb5103e100 method='GET' uri='/_matrix/client/r0/sync?fil
ter=5&timeout=0&_cacheBuster=1650553335653' clientproto='HTTP/1.1' site='8008'>                                                                                                                                                              
Traceback (most recent call last):                                                 
  File "/usr/local/lib/python3.9/site-packages/twisted/internet/defer.py", line 1656, in _inlineCallbacks                                                                                                
    result = current_context.run(                 
  File "/usr/local/lib/python3.9/site-packages/twisted/python/failure.py", line 489, in throwExceptionIntoGenerator                                                                                      
    return g.throw(self.type, self.value, self.tb)                          
  File "/usr/local/lib/python3.9/site-packages/synapse/util/caches/response_cache.py", line 246, in cb                                                                                                   
    return await callback(*args, **kwargs)                                                                                                                                                               
  File "/usr/local/lib/python3.9/site-packages/synapse/handlers/sync.py", line 355, in _wait_for_sync_for_user                                                                                           
    result: SyncResult = await self.current_sync_for_user(                                                                                                                                                
  File "/usr/local/lib/python3.9/site-packages/synapse/handlers/sync.py", line 406, in current_sync_for_user                                                                                             
    sync_result = await self.generate_sync_result(                                                                                                                                                        
  File "/usr/local/lib/python3.9/site-packages/synapse/handlers/sync.py", line 1114, in generate_sync_result                                                                                             
    res = await self._generate_sync_entry_for_rooms(
  File "/usr/local/lib/python3.9/site-packages/synapse/handlers/sync.py", line 1625, in _generate_sync_entry_for_rooms                                                                                   
    await concurrently_execute(handle_room_entries, room_entries, 10)
  File "/usr/local/lib/python3.9/site-packages/synapse/util/async_helpers.py", line 233, in concurrently_execute                                                                                         
    await yieldable_gather_results(               
  File "/usr/local/lib/python3.9/site-packages/synapse/util/async_helpers.py", line 274, in yieldable_gather_results                                                                                     
    raise dfe.subFailure.value from None  
  File "/usr/local/lib/python3.9/site-packages/twisted/internet/defer.py", line 1656, in _inlineCallbacks                                                                                                
    result = current_context.run(                         
  File "/usr/local/lib/python3.9/site-packages/twisted/python/failure.py", line 489, in throwExceptionIntoGenerator                                                                                      
    return g.throw(self.type, self.value, self.tb)
  File "/usr/local/lib/python3.9/site-packages/synapse/util/async_helpers.py", line 225, in _concurrently_execute_inner                                                                                  
    await maybe_awaitable(func(value))              
  File "/usr/local/lib/python3.9/site-packages/synapse/handlers/sync.py", line 1614, in handle_room_entries                                                                                              
    await self._generate_room_entry(                                 
  File "/usr/local/lib/python3.9/site-packages/synapse/handlers/sync.py", line 2060, in _generate_room_entry                                                                                             
    batch = await self._load_filtered_recents(
  File "/usr/local/lib/python3.9/site-packages/synapse/handlers/sync.py", line 592, in _load_filtered_recents                                                                                            
    loaded_recents = await filter_events_for_client(
  File "/usr/local/lib/python3.9/site-packages/synapse/visibility.py", line 85, in filter_events_for_client                                                                                              
    event_id_to_state = await storage.state.get_state_for_events(
  File "/usr/local/lib/python3.9/site-packages/synapse/storage/state.py", line 671, in get_state_for_events                                                                                              
    event_to_groups = await self.stores.main._get_state_group_for_events(event_ids)
  File "/usr/local/lib/python3.9/site-packages/twisted/internet/defer.py", line 1660, in _inlineCallbacks                                                                                                
    result = current_context.run(gen.send, result)
  File "/usr/local/lib/python3.9/site-packages/synapse/storage/databases/main/state.py", line 331, in _get_state_group_for_events                                                                        
    raise RuntimeError("No state group for unknown or outlier event %s" % e)
RuntimeError: No state group for unknown or outlier event XXXXXXXX 

It is possible that these errors started appearing after users started playing with thread messages in recent versions of Element, both while the server was in 1.54.0 and when it was upgraded to 1.57.0.

When rolling back to 1.54.0, the problem disappears and everything goes back to normal.

Steps to reproduce

One way to reproduce the error, with my setup:

  • Upgrade from 1.54.0 to 1.57.0
  • Open a fresh Element, log in the homeserver, and wait for the initial synchronization.

The synchronization never completes, and instead errors such as those shown above appear in the logs.

Version information

  • Homeserver: my own private homeserver
  • Version: 1.57.0 (when the bug appeared)

  • Install method: docker

  • Platform: debian
@H-Shay
Copy link
Contributor

H-Shay commented Apr 26, 2022

Thanks for reporting! One question, just to clarify: on the last line of your log it shows the line RuntimeError: No state group for unknown or outlier event XXXXXXXX. Is the log actually returning exactly that line, or have you edited it and substituted XXXXXXXX for a specific event id(s) (an event id would look something like $1638282339138PTbRx:matrix.org)?

Second, could you run this query on your homeserver and let me know if it returns any rows?

select * from event_forward_extremities efe left join events e using (event_id) where e.event_id is null;

@H-Shay H-Shay added the X-Needs-Info This issue is blocked awaiting information from the reporter label Apr 26, 2022
@MparkG
Copy link

MparkG commented Apr 26, 2022

Thanks for reporting! One question, just to clarify: on the last line of your log it shows the line RuntimeError: No state group for unknown or outlier event XXXXXXXX. Is the log actually returning exactly that line, or have you edited it and substituted XXXXXXXX for a specific event id(s) (an event id would look something like $1638282339138PTbRx:matrix.org)?

Second, could you run this query on your homeserver and let me know if it returns any rows?

select * from event_forward_extremities efe left join events e using (event_id) where e.event_id is null;

select * from event_forward_extremities efe left join events e using (event_id) where e.event_id is null;
No rows found.

for the record; that what you are asking is another issue which rich van der horst appears to be working on: #12507 (comment)

@Ezwen
Copy link
Author

Ezwen commented Apr 26, 2022

Thanks for reporting! One question, just to clarify: on the last line of your log it shows the line RuntimeError: No state group for unknown or outlier event XXXXXXXX. Is the log actually returning exactly that line, or have you edited it and substituted XXXXXXXX for a specific event id(s) (an event id would look something like $1638282339138PTbRx:matrix.org)?

Yes sorry I forgot to mention that this XXXXXXXX is simply a redaction I made as I did not want to share the hostnames of the homeservers in question. It is indeed something like $1638282339138PTbRx:matrix.org, and after a quick lookup I can tell that two events are appearing in the logs: one whose id is from own homeserver, and one whose id is from another homeserver.

Second, could you run this query on your homeserver and let me know if it returns any rows?

select * from event_forward_extremities efe left join events e using (event_id) where e.event_id is null;

This query returns zero rows.

@H-Shay H-Shay added X-Needs-Discussion and removed X-Needs-Info This issue is blocked awaiting information from the reporter labels Apr 26, 2022
@H-Shay
Copy link
Contributor

H-Shay commented Apr 26, 2022

for the record; that what you are asking is another issue which rich van der horst appears to be working on

Right, I am verifying that this issue is separate from that issue by making sure the query returned no rows.

@H-Shay
Copy link
Contributor

H-Shay commented Apr 26, 2022

Error message appears to have been introduced in #12191

@richvdh
Copy link
Member

richvdh commented Apr 27, 2022

@Ezwen , @MparkG, thanks for reporting this. As with #12507, it suggests that there is some sort of corruption in your database, but something slightly different to that.

I'd like to try to understand what has happened here. Please could you take one of the event ids from the error report (the XXXXX) and run this query:

select outlier from events where event_id='$XXXX';

Please could you also share all the log lines relating to the problematic request. Taking the example of @Ezwen's logs in the first message, these will be identified by the request id GET-1273, so you could find them with:

grep GET-1273 homeserver.log

@richvdh richvdh added S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. X-Needs-Info This issue is blocked awaiting information from the reporter and removed X-Needs-Discussion labels Apr 27, 2022
@richvdh richvdh changed the title After upgrade to 1.57.0, duing synchronisations: "No state group for unknown or outlier event" /sync fails with "No state group for unknown or outlier event" Apr 27, 2022
@Ezwen
Copy link
Author

Ezwen commented Apr 27, 2022

@Ezwen , @MparkG, thanks for reporting this. As with #12507, it suggests that there is some sort of corruption in your database, but something slightly different to that.

I'd like to try to understand what has happened here. Please could you take one of the event ids from the error report (the XXXXX) and run this query:

select outlier from events where event_id='$XXXX';

I tried with both event ids that cause this error, and I get 0 rows each time.

Please could you also share all the log lines relating to the problematic request. Taking the example of @Ezwen's logs in the first message, these will be identified by the request id GET-1273, so you could find them with:

grep GET-1273 homeserver.log

It turns out I obtain this stack trace not only through GET-1273 errors! I had not noticed that last time.

To make sure you get everything, I ran the following query on my logs:

sudo journalctl -S "10 days ago" CONTAINER_NAME=synapse | grep -i -C 50 "ERROR - GET-" > ~/tmp/synapse-errors.txt

Then I redacted a few "private" homeserver hostnames from the logs (renamed HOMESERVER_A, HOMESERVER_B, etc), and here is the result: synapse-errors.txt

hope this helps!

@H-Shay H-Shay removed the X-Needs-Info This issue is blocked awaiting information from the reporter label Apr 27, 2022
@richvdh
Copy link
Member

richvdh commented Apr 27, 2022

I tried with both event ids that cause this error, and I get 0 rows each time.

Hrm. I think the only way that can possibly happen is if an event is deleted from the database.

Apologies if I've asked this before, but:

@MparkG
Copy link

MparkG commented Apr 27, 2022

select outlier from events where event_id='$XXXX';

i get no rows either.

Grep finds nothing, but log is only at warning level...

in my case i had turned on retention for a while, thought it would work (given the option exists without even a warning of instability)
then turned it off again the first time i downdated to 1.56.

I suspect something got deleted,too.
but the weird thing for me is: how does synapse find that event id, if its no where to be found manually in the db?
could it be a remaining id from some redis cached data, incoherent with the data in the db which may be deleted?
given that redis stores data persistently (appearently...) it would persist version changes, synapse restarts, etc...

@Ezwen
Copy link
Author

Ezwen commented Apr 28, 2022

do you have retention enabled on your server?

I have never enabled retention

Have you used the delete room API or the purge history API on this room, either directly or via some control panel or "maintenance" script?

I did use the delete room API a very long time ago, to delete the #matrix:matrix.org room from my server (as it was using too much resources to remain in sync with it, at the time). But I would be surprised that this would be the cause, especially given that the two missing events are from private homeservers that I am familiar with, and not "random" events from diverse homeservers found in #matrix:matrix.org.

@Ezwen
Copy link
Author

Ezwen commented Apr 28, 2022

Other news: following the tracks of @MparkG, I've updated from 1.54.0 to 1.56.0, and everything is going well. This confirms that the problem only appears with the 1.56.0→1.57.0 update.

@ddt-consult
Copy link

I noticed that on my homeserver #synapse:matrix.org hadn't received new messages for some days. When trying to leave this room from element, it failed with an "internal server error". In my homeserver's logs I found a lot of those "No state group for unknown or outlier event" messages. Deleting the room via the HTTP API didn't succeed either.
After downgrading my homeserver from 1.57.0 to 1.56.0 I could leave #synapse:matrix.org and delete it.

@richvdh
Copy link
Member

richvdh commented Apr 28, 2022

Ok, then I'm mystified. All I can suggest is that you turn on DEBUG logging, and I can look at the logs to see if there are any clues.

@Ezwen: if you're happy to share debug logs, could you contact me at @richvdh:sw1v.org?

@richvdh
Copy link
Member

richvdh commented Apr 28, 2022

This confirms that the problem only appears with the 1.56.0→1.57.0 update.

This is unsurprising. Synapse 1.57 added the additional consistency check (#12191) which is now detecting an existing problem.

@richvdh
Copy link
Member

richvdh commented Apr 29, 2022

Following further discussion with @Ezwen; it looks like the bad event is in his events table. However, it is missing from event_to_state_groups. As I suspected, this reflects corruption in the database which may have been present for a long time, and is now being revealed by the new consistency check in Synapse 1.57.

The following query will show affected events; it should return no rows for a healthy database. (Note that it has to scan the entire events table, so may take a while to return.)

select e.* from events e left join rejections r using (event_id) left join event_to_state_groups esg using (event_id) where not e.outlier and esg.event_id is null and r.event_id is null;

Unfortunately, as with #12507, I have no idea what could have caused it. Unlike #12507, my only real idea for a workaround is to delete the room (https://matrix-org.github.io/synapse/latest/admin_api/rooms.html#version-2-new-version) and re-join.

@richvdh richvdh changed the title /sync fails with "No state group for unknown or outlier event" Event missing from event_to_state_groups, causing /sync to fail with "No state group for unknown or outlier event" Apr 29, 2022
@MparkG
Copy link

MparkG commented Apr 29, 2022

hello, it seems i dont have this particular situation.
image

ill look at the log on debug level as soon as i can..

@richvdh
Copy link
Member

richvdh commented Apr 29, 2022

ok @MparkG seems you have yet a different problem. Feel free to contact me at @richvdh:sw1v.org with debug logs.

@Twi1ightSparkle
Copy link

Twi1ightSparkle commented Apr 29, 2022

I'm also seeing the same stack trace as OP. I can provide direct access to my server and database if that helps.

None of my clients are able to sync at all.

The latest message my always-on client have cashed is Friday 2:46:54 pm in case this helps track down the broken event elsewhere.

For now I'm downgrading to 1.56
Ping me in an internal room on MonTuesday if you'd like me to upgrade again and enable debug logs or get direct access

synapse=> select e.* from events e left join rejections r using (event_id) left join event_to_state_groups esg using (event_id) where not e.outlier and esg.event_id is null and r.event_id is null;
 topological_ordering | event_id | type | room_id | content | unrecognized_keys | processed | outlier | depth | origin_server_ts | received_ts | sender | contains_url | instance_name | stream_ordering | state_key | rejection_reason 
----------------------+----------+------+---------+---------+-------------------+-----------+---------+-------+------------------+-------------+--------+--------------+---------------+-----------------+-----------+------------------
(0 rows)

@richvdh
Copy link
Member

richvdh commented May 4, 2022

Just writing it down here for the record: it turns out @MparkG had a different corruption issue (somehow an event had been modified in his database), which was detected by #12620.

@lutoma
Copy link

lutoma commented May 6, 2022

I'm also getting this exception, and none of the debug queries above return any rows for me either. I think the issue is that I deleted the #matrix:matrix.org room on my server. I had previously attempted to leave it in my client, but that only returned internal server errors.

@Ezwen
Copy link
Author

Ezwen commented May 6, 2022

With a lot of help from @richvdh , I managed to setup a fix/workaround for this problem, which is to delete all rooms that contain "faulty" events.

  1. Run the following query to know which rooms contain "faulty" events:
select e.room_id, count(*) 
  from events e left join rejections r using (event_id) left join event_to_state_groups esg using (event_id) 
  where not e.outlier and esg.event_id is null and r.event_id is null
  group by e.room_id;

(If you have zero results here, then it probably means you are not affected by this specific issue)

  1. For each room in the result, use the synapse admin API to delete it :
curl -H "Authorization: Bearer <access token>" 'https://<homeserver>/_synapse/admin/v2/rooms/<room_id>' -X DELETE -d '{"purge": true}'

Note that this means history will be lost if said room was not shared with other accounts over federation − make backups of affected rooms if required!

  1. It will give you a token as an answer, you can use it to track the progress of the task:
curl -H "Authorization: Bearer <access token>" 'http://<homeserver>/_synapse/admin/v2/rooms/delete_status/<token>' -X GET

If it says "status":"purging", wait a bit and run the last query again

If it says status":"complete", move to deleting the next room in the list

If it says "status":"failed", re-try the query with force_purge:

curl -H "Authorization: Bearer <access token>" 'https://<homeserver>/_synapse/admin/v2/rooms/<room_id>' -X DELETE -d '{"force_purge": true}'

Note: after using force_purge, tell the affected users to clean the cache in their Element sessions, otherwise these rooms will just "stay there" in the UI

  1. Run again the query from (1) ; if it's empty, then you are done!

Side note: in my specific case, the overall problem was mixed with a database corruption problem. What helped for this was to dump+restore of the postgre instance. Then I was able to proceed with the recipe above without major problems.

@leprasmurf
Copy link

leprasmurf commented May 8, 2022

I am also getting bit by this bug (it seems), though none of the above queries have returned any rows from my DB. Further, I'm fairly certain I don't have corruption, as I just recently started a completely fresh database instance.

Here is my sanitized stack trace:

2022-05-08 14:47:57,175 - INFO - synapse.access.http.8008 - 427 - GET-13038 - <redacted> - 8008 - {@<redacted>} Processed request: 3.004sec/0.001sec (0.003sec, 0.000sec) (0.000sec/0.001sec
/1) 213B 200 "GET /_matrix/client/r0/sync?access_token=<redacted>&since=s57873_4056068_76_136407_20186_69_1939_195034_12&timeout=3000 HTTP/1.1" "Python/3.8 aiohttp/3.8.1" [0 dbevts]
2022-05-08 14:47:58,751 - ERROR - synapse.http.server - 100 - GET-13048 - Failed handle request via 'SyncRestServlet': <XForwardedForRequest at 0x7fdecfe9b700 method='GET' uri='/_matrix/client/r0/sync?filter=0&timeout=30000' clientproto='HTTP/1.1' site='8008'>
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/twisted/internet/defer.py", line 1656, in _inlineCallbacks
    result = current_context.run(
  File "/usr/local/lib/python3.9/site-packages/twisted/python/failure.py", line 514, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/local/lib/python3.9/site-packages/synapse/util/caches/response_cache.py", line 246, in cb
    return await callback(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/synapse/handlers/sync.py", line 355, in _wait_for_sync_for_user
    result: SyncResult = await self.current_sync_for_user(
  File "/usr/local/lib/python3.9/site-packages/synapse/handlers/sync.py", line 406, in current_sync_for_user
    sync_result = await self.generate_sync_result(
  File "/usr/local/lib/python3.9/site-packages/synapse/handlers/sync.py", line 1113, in generate_sync_result
    res = await self._generate_sync_entry_for_rooms(
  File "/usr/local/lib/python3.9/site-packages/synapse/handlers/sync.py", line 1624, in _generate_sync_entry_for_rooms
    await concurrently_execute(handle_room_entries, room_entries, 10)
  File "/usr/local/lib/python3.9/site-packages/synapse/util/async_helpers.py", line 235, in concurrently_execute
    await yieldable_gather_results(
  File "/usr/local/lib/python3.9/site-packages/synapse/util/async_helpers.py", line 276, in yieldable_gather_results
    raise dfe.subFailure.value from None
  File "/usr/local/lib/python3.9/site-packages/twisted/internet/defer.py", line 1656, in _inlineCallbacks
    result = current_context.run(
  File "/usr/local/lib/python3.9/site-packages/twisted/python/failure.py", line 514, in throwExceptionIntoGenerator
    return g.throw(self.type, self.value, self.tb)
  File "/usr/local/lib/python3.9/site-packages/synapse/util/async_helpers.py", line 227, in _concurrently_execute_inner
    await maybe_awaitable(func(value))
  File "/usr/local/lib/python3.9/site-packages/synapse/handlers/sync.py", line 1613, in handle_room_entries
    await self._generate_room_entry(
  File "/usr/local/lib/python3.9/site-packages/synapse/handlers/sync.py", line 2059, in _generate_room_entry
    batch = await self._load_filtered_recents(
  File "/usr/local/lib/python3.9/site-packages/synapse/handlers/sync.py", line 592, in _load_filtered_recents
    loaded_recents = await filter_events_for_client(
  File "/usr/local/lib/python3.9/site-packages/synapse/visibility.py", line 85, in filter_events_for_client
    event_id_to_state = await storage.state.get_state_for_events(
  File "/usr/local/lib/python3.9/site-packages/synapse/storage/state.py", line 676, in get_state_for_events
    event_to_groups = await self._get_state_group_for_events(event_ids)
  File "/usr/local/lib/python3.9/site-packages/synapse/storage/state.py", line 808, in _get_state_group_for_events
    return await self.stores.main._get_state_group_for_events(event_ids)
  File "/usr/local/lib/python3.9/site-packages/twisted/internet/defer.py", line 1660, in _inlineCallbacks
    result = current_context.run(gen.send, result)
  File "/usr/local/lib/python3.9/site-packages/synapse/storage/databases/main/state.py", line 332, in _get_state_group_for_events
    raise RuntimeError("No state group for unknown or outlier event %s" % e)
RuntimeError: No state group for unknown or outlier event $<redacted>

This started out afflicting my bots (go-neb and now wordle), but the error persists when I try to sync with my regular user with curl -s --header "Authorization: Bearer ${access_token}" -XGET https://<redacted>/_matrix/client/r0/sync

@richvdh
Copy link
Member

richvdh commented May 11, 2022

I'm going to close this issue as I believe it was very specific to @Ezwen's situation (he had underlying postgres corruption leading to missing rows in event_to_state_groups).

Other people seeing the "No state group for unknown or outlier event":

  1. Please upgrade to 1.59.0rc1 (or wait another week for 1.59.0 to be released), and see if the symptoms change: Add a consistency check on events read from the database #12620 provides additional diagnostics.
  2. Please open a new issue describing the symptoms you see (including relevant logs), which diagnostic steps you have tried (ie, which queries you have tried, and their results).

Note that it's generally easier for us to handle duplicate reports than it is to unpick an issue which contains reports of multiple similar but slightly different problems. In other words: it's better to open a new issue than comment on an existing one unless you are sure that you have the same problem.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

No branches or pull requests

8 participants