Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Faster joins: incoming federation requests during resync may be incorrectly rejected #13288

Closed
richvdh opened this issue Jul 15, 2022 · 8 comments · Fixed by #13823
Closed

Faster joins: incoming federation requests during resync may be incorrectly rejected #13288

richvdh opened this issue Jul 15, 2022 · 8 comments · Fixed by #13823
Assignees
Labels
A-Federated-Join joins over federation generally suck A-Federation T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@richvdh
Copy link
Member

richvdh commented Jul 15, 2022

If a remote server sends us a federation request (such as get_missing_events, or requesting a space summary) during a state resync, we may well reject that request due to not thinking the remote server is in the room.

Possibly we need to define some return code to say "sorry, can't auth you right now".

@richvdh richvdh added this to the Faster joins (further work) milestone Jul 15, 2022
@squahtx squahtx added A-Federation A-Federated-Join joins over federation generally suck T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. labels Jul 15, 2022
@richvdh
Copy link
Member Author

richvdh commented Jul 15, 2022

Possibly we need to define some return code to say "sorry, can't auth you right now".

(or just rely on the list of servers we get during the join dance)

@erikjohnston
Copy link
Member

erikjohnston commented Jul 29, 2022

Note that if we send out an event (as per #12997) we must respond to get_missing_events correctly or we risk having our event rejected (due to the remote server not being able to get the prev events)

@reivilibre
Copy link
Contributor

reivilibre commented Sep 6, 2022

Note that if we send out an event (as per #12997) we must respond to get_missing_events correctly or we risk having our event rejected (due to the remote server not being able to get the prev events)

This is a very valid point. (I also think this is an interoperability landmine; it would be good for the spec to be very clear about what exactly a homeserver needs to be able to answer when sending an event! — I don't see anywhere where it says the sending homeserver has to be able to produce the prev_events of an event it is sending..).

We could track which events are prev_events of the events that we have sent out during the partial join and allow these to be returned to get_missing_events. This supposes that this is sufficient for getting the other server to accept our event, but it would be nice if this was rigorously defined in specification!

Possibly we need to define some return code to say "sorry, can't auth you right now".

This sounds like it might be nasty, because we can't retroactively create an error code that non-MSC2706-implementing homeservers will understand. Maybe this is OK anyway though, because other homeservers have to treat us as potentially malicious and they need to be equipped to handle strange error responses from us by falling back to other servers as appropriate.

(or just rely on the list of servers we get during the join dance)

Note that this can fall out of sync with reality. This is probably still OK

If a new server joins the room and we receive the event, we can handle that.
But we can't handle a server leaving the room, because we may see a leave event for one of the server's users, but we need to know whether it's the last user from that server (which by implication means that we need to know all of that server's users, at least in the current model).

@reivilibre
Copy link
Contributor

Given that we can't conclusively track the set of servers in the room (because we will always have to have this indeterminate state arising from a server's user leaving the room and us not knowing if it was the last user for their server), it seems like it would be simplest to just not bother (in cases where that's acceptable — answering queries for event sending auth purposes is a different matter and may need special casing so that we accept relevant queries from the servers that we send our events to, such as /get_missing_events).

This makes it tempting to indeed just define an error code that says 'can't auth you right now'. Since our server must still be syncing with another server to be successful, then it follows that there must be another server available to answer the remote homeserver's query. Either that, or the room is scuppered since we'll never be able to complete the state sync.

If we want to, we can later enhance this by answering queries where we can and where the membership of the requesting remote homeserver is determinate. However, requesting homeservers need to be able to cope with uncooperative homeservers anyway, so they should already have logic to try against other homeservers. I suggest that we just let them re-use that, for simplicity.

#12997 will then unfortunately have to gain some special casing, probably along the lines of 'we will answer relevant /get_missing_events queries for servers that we send our events out to'.

@reivilibre
Copy link
Contributor

reivilibre commented Sep 13, 2022

Here's a thought: can we modify partial-state joins so that we can keep track of which servers are in the room?

How about changing, in the MSC3706 spec,

servers_in_room: A new field of type [string], listing the servers active in the room (ie, those with joined members) before the join.

to

servers_in_room: A new field of type {string: integer}, listing the servers active in the room (ie, those with joined members) before the join and giving the number of joined members from that server.

Both of these are abstract interpretations of the set of users in the room — in the sense that they tell you something useful about the set of users in the room, but only the latter is updatable (or at least it seems that way to me):
Assuming that we, as a partial-stated homeserver, can correctly witness join→{leave, kick, ban} and {invite, kick, leave, ø}→join transitions, we can keep updated with the list of servers in the room, whilst withstanding membership changes, just by keeping a tally of the number of users per server. If the server drops down to zero, they're out. New users joining from unknown servers start the count as zero.

@reivilibre
Copy link
Contributor

reivilibre commented Sep 15, 2022

Here's a thought: can we modify partial-state joins so that we can keep track of which servers are in the room?

This was somewhat shot down: when we complete our join after doing a partial join, we may find that unrejected events should become rejected and some rejected events may become unrejected. In other words, we may accept some events which it turns out we should have not.
This erodes my confidence that we can track servers in the room correctly, so for now I would not be in favour of doing that (or at least, not using the results of that for authorisation) ­— instead keeping things simple, we'll just require servers to have fallback behaviour (which they should already have anyway) to find fully-joined servers to answer their queries.
We'll have to make an exception for the checks that other servers want to make on the events that we ourselves send into the room.

I wonder if it's actually OK to send our own events into the current list of hypothesised joined servers — it seems to me that the most important aspect is that the user is aware, but will this actually be the case? If we're basing the server memberships on a list received out-of-band from the room's state, it doesn't hold that the user will see joined members for all the servers in the room. (is this right?)
Do we need to narrow down event sending to public rooms where this isn't a concern?

@richvdh
Copy link
Member Author

richvdh commented Sep 21, 2022

Note that if we send out an event (as per #12997) we must respond to get_missing_events correctly or we risk having our event rejected (due to the remote server not being able to get the prev events)

but... in that particular case, we know the list of servers we sent the event to, so can't we just use the same list to decide whether they are allowed to get_missing_events?

@richvdh
Copy link
Member Author

richvdh commented Sep 21, 2022

Note that if we send out an event (as per #12997) we must respond to get_missing_events correctly or we risk having our event rejected (due to the remote server not being able to get the prev events)

but... in that particular case, we know the list of servers we sent the event to, so can't we just use the same list to decide whether they are allowed to get_missing_events?

@reivilibre points out that this runs the risk of us divulging room content to a server that has since left the room, but also makes a good suggestion for how to avoid this: ensure that all events that we create during the room resync are maintained in a fork of the DAG starting at our join.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Federated-Join joins over federation generally suck A-Federation T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
4 participants