Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

In one room: "User not in room" when sending messages, "Not a known room" when trying to leave #9666

Closed
Ezwen opened this issue Mar 22, 2021 · 32 comments
Labels
S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@Ezwen
Copy link

Ezwen commented Mar 22, 2021

Description

Since an upgrade of Synapse to 1.30.0 (and a server restart), I encounter significant problems in a specific room (which before worked without any problem). Here are the symptoms:

  • If I try to send a message in the room using Element, I get the message User @XXX:YYY not in room FhnTYZhdgIipdYUdar:otherserver.org (None), where @XXX:YYY is my userid.
  • If I try to leave the room, I get the message Not a known room
  • I can still read the history, and if someone sends a message to the room, I can see it as well.

Steps to reproduce

  • install synapse 1.30.0
  • reboot synapse
  • try to send a message into one of my rooms (note: only this specific room triggers these symptoms)

What I expect: the message is sent successfully
What I get: the error message shown above

Version information

  • Homeserver: my own homeserver

If not matrix.org:

  • Version: 1.30.0

  • Install method: Docker

  • Platform: Debian 9 (stretch), bare metal
@Ezwen
Copy link
Author

Ezwen commented Mar 22, 2021

When I try sending a message in said room, I have this message in my synapse logs:

synapse.handlers.message - 945 - WARNING - PUT-3537- Denying new event <FrozenEvent event_id='$161644359715JklbV:myserver.fr', type='m.room.message', state_key=None> b
ecause 403: User @XXX:YYY not in room !FhnTYZhdgIipdYUdar:otherserver.org (None)

@Ezwen
Copy link
Author

Ezwen commented Mar 22, 2021

I have checked using the synapse admin API, and my user account is definitely in the room.

@erikjohnston
Copy link
Member

That is odd, seems like the server is confused about the state of the room. Can you check that you have rows for the room in the event_forward_extremities table please?

@erikjohnston erikjohnston added T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. X-Needs-Info This issue is blocked awaiting information from the reporter labels Mar 23, 2021
@Ezwen
Copy link
Author

Ezwen commented Mar 23, 2021

Here is what I have:

$ docker-compose exec db psql --user matrix_synapse -c "SELECT * FROM event_forward_extremities WHERE room_id='!FhnTYZhdgIipdYUdar:otherserver.org'"
            event_id             |              room_id
---------------------------------+-----------------------------------
 $16164587839FTJkr:otherserver.org | !FhnTYZhdgIipdYUdar:otherserver.org
(1 row)

(as in all my other messages, I've replaced the real server name by otherserver.org)

@erikjohnston
Copy link
Member

Humm. Odd. If you do an API hit to /_matrix/client/r0/rooms/{roomId}/context/{eventId} with that room/event do you see your own membership in there?

The only thing I can think of is that somehow the servers view of the "current state" has diverged from the actual state of the room.

@Ezwen
Copy link
Author

Ezwen commented Mar 23, 2021

I obtain a rather big JSON file, with events and states, where should I look for my membership?

I can only find my user id (@XXX:YYY) in /state/0/content/users, where it seems all users with a certain power level are listed − this would be consistent with the fact that @XXX:YYY is an Administrator in this room.

@erikjohnston
Copy link
Member

You should see an event with type: m.room.member and state_key: @XXX:YY with a content.membership: join, if not then that would explain why the server thinks you're not in the room.

Does _matrix/client/r0/rooms/{roomId}/state return a membership event for you?

@Ezwen
Copy link
Author

Ezwen commented Mar 23, 2021

You should see an event with type: m.room.member and state_key: @XXX:YY with a content.membership: join, if not then that would explain why the server thinks you're not in the room.

I see! Well indeed I cannot find any event matching the description you make here, so that would explain.

Does _matrix/client/r0/rooms/{roomId}/state return a membership event for you?

It gives me this result:

{"errcode":"M_FORBIDDEN","error":"User @XXX:YYY not in room !FhnTYZhdgIipdYUdar:otherserver.org, and room previews are disabled"}

@erikjohnston
Copy link
Member

erikjohnston commented Mar 23, 2021

The plot thickens. Can you try doing:

SELECT * FROM current_state_events WHERE room_id = ?

and see if there are a) any rows and b) if the user ID appears and if so what the membership is

@Ezwen
Copy link
Author

Ezwen commented Mar 23, 2021

There are lots of rows, including all join events for all members of the room, and including a row with my own membership:

 $1572562716128RmfHF:myserver.fr    | !FhnTYZhdgIipdYUdar:otherserver.org | m.room.member                | @XXX:myserver.fr                 | join

(note: since the beginning I wrote both YYY and myserver.fr but both are in fact the same.)

@erikjohnston
Copy link
Member

Ok, so seems like:

  1. The current state of the room thinks you're in the room (which should be the same as the state at the forward extremities)
  2. The state at the forward extremity doesn't think you're in the room
  3. Other servers think you're in the room

I'm slightly perplexed how 1 and 2 can both be true. It's possible to get out of sync with other servers due to state resets, as I note that its a v1 room.

I'm going to have to have a think about how to unwedge this.

@Ezwen
Copy link
Author

Ezwen commented Mar 23, 2021

@erikjohnston Thank you very much for taking all this time to investigate :). If there is anything more I can do to help, please do not hesitate to ask.

I suppose there is no easy workaround to manually fix this in the database?

@erikjohnston
Copy link
Member

I suppose there is no easy workaround to manually fix this in the database?

Not off the top of my head, editing the current_state_events table would probably also require manually adding stuff to the current_state_deltas table. You could try having someone kick you from the room, which might work?

@Ezwen
Copy link
Author

Ezwen commented Mar 23, 2021

I suppose there is no easy workaround to manually fix this in the database?

Not off the top of my head, editing the current_state_events table would probably also require manually adding stuff to the current_state_deltas table. You could try having someone kick you from the room, which might work?

I've thought of doing that, but unfortunately I am an admin in this room, therefore un-kickable (at least AFAIK?)
:(

@Ezwen
Copy link
Author

Ezwen commented Mar 25, 2021

Out of curiosity, is there any plan to someday add a synapse admin command to "fix inconsistencies in the database", such as this one? It seems to me it could prove useful in many situations.

@Ezwen
Copy link
Author

Ezwen commented Mar 25, 2021

Quick update: I just discovered that I have this very same problem (with the same same error messages) in another matrix room. Since I am not an admin in this second room, I have asked the admin to kick me and to reinvite me, and this did solve the problem for this room as you expected!

Now in my case the problem only remain in this room where I am admin and therefore un-kickable.

@Ezwen
Copy link
Author

Ezwen commented Mar 27, 2021

Second quick update: I found a third matrix room with a probably (?) related problem.

It is a 1-1 room that I hadn't used in a while, and I tried sending a message and I got this error in Element desktop: No create event in auth events. The other person, who is registered on the same homeserver as I, has the same error when sending a message in this room.

On synapse's side, I have errors that look like this:

synapse.handlers.message - 945 - WARNING - PUT-177186- Denying new event <FrozenEvent event_id='$1616850716377rkoye:myserver.fr', type='m.room.message', state_key=None
> because 403: No create event in auth events

@erikjohnston
Copy link
Member

Thanks for the updates! Very odd that this has happened to multiple rooms now, though they're both v1 rooms so its possible that they're somehow being state reset (though it seems unlikely for a 1:1 room).

Out of curiosity, is there any plan to someday add a synapse admin command to "fix inconsistencies in the database", such as this one? It seems to me it could prove useful in many situations.

No, not really. In the general case it'd be hard/impossible to do, and take a very long time, but it would be plausible in this case.

I've thought of doing that, but unfortunately I am an admin in this room, therefore un-kickable (at least AFAIK?)

Has anyone else joined/left or otherwise changed state? I think that any state change might be enough to fix it, though possibly not.

@Ezwen
Copy link
Author

Ezwen commented Mar 31, 2021

Has anyone else joined/left or otherwise changed state? I think that any state change might be enough to fix it, though possibly not.

I've tried leaving/joining the room with a completely separate account, but unfortunately that did not change the situation for the first account.

What other kind of state change should I test?

@Ezwen
Copy link
Author

Ezwen commented Jun 17, 2021

Quick update about strange things that happened recently:

  • In one room where the problem was occurring, the problem solved itself mysteriously. I can now read and write messages once more. I have no idea what was the cause of this change.
  • However, in another room, the problem recently started happening (cannot write messages, "Not a known room" error when trying to leave the room, etc.), with something new: I also cannot receive new messages in that room anymore. Therefore, I cannot read nor write any event, while still being somehow "in" the room.

Additional information:

  • I am currently running synapse 1.36.0 and element 1.7.29.
  • The error message in the synapse logs when trying to write the the aforementioned room:
synapse_1         | 2021-06-17 09:49:01,637 - synapse.handlers.message - 972 - WARNING - PUT-164- Denying new event <FrozenEvent event_id='$16239161411cRWAz:server1.org', type='m.room.message', state_key=None> because 403: User @xxx:server1.org not in room !JiNJVSfIzCRjsKrudF:server2.org (None)

@richvdh richvdh removed the X-Needs-Info This issue is blocked awaiting information from the reporter label Jul 22, 2021
@erikjohnston erikjohnston removed the T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. label Jul 26, 2021
@erikjohnston erikjohnston added S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. S-Minor Blocks non-critical functionality, workarounds exist. and removed S-Major Major functionality / product severely impaired, no satisfactory workaround. labels Aug 23, 2021
@erikjohnston
Copy link
Member

Sorry for the delay in getting back with an update. This is an odd situation which is possibly due to state resets in room version 1(?). Unfortunately, we haven't had the time to do a deeper investigation into this. While the issue is concerning, we're not seeing reports from anyone else and so it hasn't made its way to the top of our priority list.

Next steps here would be to spend time looking at the state changes that are happening and generally what the room DAG is doing (on both servers) to try and see what is going on.

@piegamesde
Copy link

we're not seeing reports from anyone else

I do have the same issue 👋 cannot send messages, cannot leave. Room version is 6, and it was probably caused by some desync in the Gitter bridge (the broken room is a Gitter DM).

@Ezwen
Copy link
Author

Ezwen commented Aug 23, 2021

Sorry for the delay in getting back with an update. This is an odd situation which is possibly due to state resets in room version 1(?). Unfortunately, we haven't had the time to do a deeper investigation into this. While the issue is concerning, we're not seeing reports from anyone else and so it hasn't made its way to the top of our priority list.

No worries, I perfectly understand, I just wanted to keep you updated :)

Next steps here would be to spend time looking at the state changes that are happening and generally what the room DAG is doing (on both servers) to try and see what is going on.

Unfortunately I am personally unable to do that, but I'll remain subscribed to the issue thread in case there are updates.

@onekopaka
Copy link

I've seen the issue in a handful of rooms I administer that are on version 6, which have Half-Shot's discord bridge in them, as well as one room has the matrix-org IRC bridge.

My personal homeserver does not see all the users while the community one does see all users. Interestingly, after I try inviting users I cannot see from my personal homeserver, I will see their typing notifications but not their actual messages. I get similar entries in my logs as this comment both before and after attempting to invite them.

To try to resolve the issue, the two rooms I was having the most noticeable problems on, I tried leaving and rejoining and this did not fix the problem of not seeing those users. Looks like maybe I need to use the Delete Room Admin API to fully clear out my local view that is busted.

Additionally, some users on other homeservers have become unable to interact with the room, until I kick them and they rejoin. One user on another homeserver was completely unable to accept any invitation to one of these rooms, and they ended up creating an account on our community homeserver instead. After looking in the database, that homeserver seemed to have properly received the invitation event.

erikjohnston mentioned looking at state changes, what exactly should I be looking for?

@richvdh
Copy link
Member

richvdh commented Nov 11, 2021

Thanks for reporting. Unfortunately diagnosing this is going to require detailed familiarity with the state storage schema. Given it's only affecting a small number of servers, I'm afraid it's unlikely we'll be able to find time to investigate in the near future.

@svierne
Copy link

svierne commented Feb 8, 2022

I think I'm experiencing the same thing right now. :(

The Arch Linux channel got spammed by hundreds of bots a few days ago and since then I cannot leave the room nor send any messages. The behaviour is exactly as described in OP's first post. I can provide more details if needed.

Is there really no workaround I can try other than asking the Arch Linux staff to kick me? O_o

Thanks in advance for any update on this! (:

Edit: Actually, I cannot see any new messages posted by others.

@Joecstarr
Copy link

I was having similar behavior. I'm not positive exactly what solved it, but I'll try to document what I did.

The issue first appeared in a room ("clipboard") where I was the only user. I wasn't able to send, receive, leave, etc. Then for some reason it "spread" to the primary room my wife and I use. Neither of us could do anything in the room.

Same error messages/behavior mentioned above:

  • SynapseError: 403 - No create event in auth events
  • SynapseError: 403 - User not in room
  • SynapseError: 400 - Server not in room
  • SynapseError: 404 - No known servers

I tried to use the "dev tools" in element to force a room update. No luck.
I tried to use an appservice to force a state change. No luck.
I used the admin API to try to force a state change. No luck.

I gave up and was restoring an old dump of the database and noticed an error when importing the dump.

pg_restore: error: could not execute query: ERROR:  could not create unique index "room_account_data_uniqueness"
DETAIL:  Key (user_id, room_id, account_data_type)=(<userID>, <clipboard_roomID>, m.fully_read) is duplicated.
Command was: ALTER TABLE ONLY public.room_account_data
    ADD CONSTRAINT room_account_data_uniqueness UNIQUE (user_id, room_id, account_data_type);

So I went back to the "broken" database and deleted the extra row. Everything seems to work now, both the "main" room and my "clipboard" room.

The errors only started when I added a database backup script. I think that might be the root cause for me. I lost access to the "main" room around the time the backup script should run. Maybe there's a race condition somewhere.

@Ezwen
Copy link
Author

Ezwen commented May 6, 2022

While dealing with another issue #12549, I ended up having to use the synapse admin API to delete the problematic rooms with force_purge : true, like this:

curl -H "Authorization: Bearer <access token>" 'https://<homeserver>/_synapse/admin/v2/rooms/<room_id>' -X DELETE -d '{"force_purge": true}'

If the room was shared over federation, you can try joining it again, if it is allowed in the room's parameters.

@richvdh
Copy link
Member

richvdh commented May 11, 2022

I think we can close this now. It turned out that @Ezwen had been suffering corruption in the underlying database (with errors from postgres such as missing chunk number 0 for toast value 143975 in pg_toast_16631, which will explain the erratic behaviour reported here.

@richvdh richvdh closed this as completed May 11, 2022
@ZhenyaPav
Copy link

@richvdh Could you advise on what steps should be taken to fix the db corruption in such case? I'm having an identical issue right now.

@skibbipl
Copy link

Same here, any help would be appreciated.

@ZhenyaPav
Copy link

Same here, any help would be appreciated.

I think I solved the issue myself by removing my user from the room in the database, then purging it

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

No branches or pull requests

9 participants