-
Notifications
You must be signed in to change notification settings - Fork 379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial stab at documenting soft fail #1641
Changes from 12 commits
1b366b1
6a035cc
cafe706
ed9f05d
7b28b19
6445899
43d2d82
f3f1151
0d9b882
ceb7494
377e02a
56b2887
40bc911
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -570,6 +570,123 @@ transaction request to be responded to with an error response. | |
result in the user being considered joined. | ||
|
||
|
||
Soft failure | ||
++++++++++++ | ||
|
||
.. admonition:: Rationale | ||
|
||
It is important that we prevent users from evading bans (or other power | ||
restrictions) by creating events which reference old parts of the DAG. For | ||
example, a banned user could continue to send messages to a room by having | ||
their server send events which reference the event before they were banned. | ||
Note that such events are entirely valid, and we cannot simply reject them, as | ||
it is impossible to distinguish such an event from a legitimate one which has | ||
been delayed. We must therefore accept such events and let them participate in | ||
state resolution and the federation protocol as normal. However, servers may | ||
choose not to send such events on to their clients, so that end users won't | ||
actually see the events. | ||
|
||
When this happens it is often fairly obvious to servers, as they can see that | ||
the new event doesn't actually pass auth based on the "current state" (i.e. | ||
the resolved state across all forward extremities). While the event is | ||
technically valid, the server can choose to not notify clients about the new | ||
event. | ||
|
||
This discourages servers from sending events that evade bans etc. in this way, | ||
as end users won't actually see the events. | ||
|
||
|
||
When the homeserver receives a new event over federation it should also check | ||
whether the event passes auth checks based on the current state of the room (as | ||
well as based on the state at the event). If the event does not pass the auth | ||
checks based on the *current state* of the room (but does pass the auth checks | ||
based on the state at that event) it should be "soft failed". | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the root of much of the confusion on this PR is the fact that soft-failed events participate in state res as normal is buried relatively deep. How about appending: "Otherwise it participates in state resolution as normal (and we rely on the state resolution algorithm to avoid malicious events influencing the state of the room)" |
||
|
||
When an event is "soft failed" it should not be relayed to the client nor be | ||
referenced by new events created by the homeserver (i.e. they should not be | ||
added to the server's list of forward extremities of the room). Soft failed | ||
events are otherwise handled as usual. | ||
|
||
|
||
.. NOTE:: | ||
|
||
Soft failed events participate in state resolution as normal if further events | ||
are received which reference it. It is the job of the state resolution | ||
algorithm to ensure that malicious events cannot be injected into the room | ||
state via this mechanism. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've deliberately not read the other reviews on this, so apologies if i'm re-treading the same ground, but reading this pretty naively makes me think: "Why should the arrival of a new event which references the soft-failed event cause the soft-failed event to suddenly be unfailed? If an attacker can just send another event immediately after the previous one which references it in order to unfail it, what's the point of the soft-fail in the first place?" I assume the answer here is "it doesn't unfail the first event", but this really isn't clear to a casual observer - especially as it says the soft-failed event will be sent to clients & participate in state res, which makes it sound pretty unfailed. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've tried to clarify further, but its worth noting a few things:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is the intended legitimate use-case for first ignoring an event, excluding it from state, and then accepting it for inclusion? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think it's really "excluded" from state per se. State is defined with respect to an event, so if no event references the soft-failed event, then there isn't really any place where the soft-failed event would show up in state anyways (other than in the state with respect to the soft-failed event, but if we aren't doing anything with it yet, then we can just ignore it for now). One example of a situation where this occurs is: Alice is on server A, Bob is on server B. Alice and Bob are in some room, along with some other servers. Alice is admin, and Bob is a mod. Bob gets taken over by mind-controlling aliens, starts spewing random garbage into the room, and sets the room topic to nonsense. Alice mutes Bob (which also deops him in the room). The mute happens at roughly the same time as Bob setting the room topic, such that some servers might see the mute before the topic change, and some servers might see the topic change before the mute. If every server sees the same order of events, then everything is fine. However, say that Carol's server receives Bob's first topic change before it receives the mute event, and they send a "WTF?" message, which references that topic change as a prev_event. What is the state at that message event? All servers should see the same state for that event, so the topic change has to be taken into consideration when resolving the state at that event. (This would also have the side effect of letting Alice know that some users in the room are seeing a new topic, as she would otherwise not know unless she was a server admin and trawled through the DAG.) Now if Alice sends a message in reply to Carol's message, which references (possibly via other events) both Carol's event and the mute event, then ideally state resolution would resolve the topic back to the pre-alien state. But dealing with this situation at this point is up to the state resolution algorithm. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But that isn't really an example of a soft-fail. The state of the room at the point of every event in your example was always valid and nothing was ever ignored until being overridden by having the v2 algorithm determine a winning branch at the event which joined Alice's to Bob+Carol. That's fine, but there's more being described here. First, some notion of soft-fail already exists in synapse and is owed some documentation (considering the title of this issue). The behavior apparent in synapse persists events which are invalid against some "soft" logic test. This is in contrast with a "hard" logic test: e.g. an invalid signature causes an unambiguous rejection of an event. The "soft" failure instead can be triggered by violating room logic: e.g. sending an event requiring a power_level not granted to the sender. Soft-failure causes the event to be persisted and maintained, but inert without affecting the room. Soft-failed events are not immediately included in federation state and backfill responses, but are presented to servers by the make_join endpoint as a so-called "forward extremity" to be referenced in the next join event. Thus a soft-failed event has the potential to be integrated in the DAG by other servers who don't know or care to agree to its failure -- those cases founding various incoherencies and the so-called "state reset" phenomenon. What is being described here by @erikjohnston is logic which immediately considers an event in violation of some rule which renders it inert (ignored), but at the same time it must also be persisted to maintain a traceable graph. Even if a choice is made to not persist it, other servers may persist it and include it as a sole reference in the graph. I'll quote the text to be specific:
@uhoreg This is describing a failure condition at the point of the event. Your example, in contrast, had no such failure condition at the point the event (the topic change by Bob was valid when Bob issued it, and when Carol witnessed it).
This is describing behavior which renders the event inert at the point of the event. @uhoreg your example in contrast has already relayed the event to the client and has considered it a valid referable for further events issued by Bob and Carol.
Here's where it gets hairy. This is where the server shifts its philosophy from being an independent thinker, having ignored the effects of the event and refusing to gossip it, to being something more of a bovine crowd-follower. While @uhoreg describes very desirable eventually consistent logic blessed by the v2 algorithm I don't think it quite completes the documentation of this soft-failure behavior as written nor even the purpose of this design. There are further complications here: foremost an ambiguity of what gets sent to the client, when and why (an action which cannot really be rolled back in the current protocol, though I wish it were possible) thus leading to a lot of bad things. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I'll actually concede that in the current system (even v2) there is no limitation to further state resolutions unfailing a soft-failed event perhaps even recursing all the way back to the room's create event. I still maintain that without any realistic limitation this is trouble on many levels :( There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
That may very well be acceptable, but it's a degradation of the robustness gained from soft-failure; in a system structured as a [nearly] linked-list it's important that bugs, regressions, spec pitfalls and implementation ambiguities don't split the room. This can be accidental or it can be exploited for denial of service. So before the rules become more rigid this should be given some thought. Realistically though, especially in the current system, it's very plausible that a majority faction of servers will accept a history containing a bad event with a single reference and "move on" for any number of reasons and your server is either left to deal with that or sit in a deep freeze at the point right before that bad event. Since this isn't bitcoin, and specifically we're dealing with communication, perhaps robustness-oriented solutions aren't such a bad philosophy. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Let's consider just this case then for a moment; so we're just considering events which started as valid but transitioned to invalid due to a revelation brought by a later event:
Such an event transitioned from valid to invalid because the light-cone was enlarged to invalidate whatever had been validating it (i.e. its closest power_levels event). Now we're holding it in limbo in case a further revelation enlarges the light-cone invalidating the event which invalidated our failed event, transitioning it back to valid. The crux of the problem here is with the last part: "transitioning it back to valid" (or as you said "pulls them in"). What is the formality that determines the event ought to become valid again after having transitioned to invalid from initially being valid. This cherry-picking seems arbitrary at this point. Most eventually consistent systems invalidate entire branches unless very specific granular recombination behavior is specified. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
These statements appear in conflict to me. The soft-failed event can be part of the DAG and reachable from some later event which revoked whatever initially auth'ed it... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
FTR, I looked up what the spec says about what to do with events that don't pass auth based on the state at the event itself, which looks like it persists the events but basically turns them into no-ops (if they're state events, they never affect state). So, it seems to be closer to what @jevolk suggests than what I had said.
The proposal is written from the point of view of a specific server. In the situation described in this proposal, an event (e.g. Bob's topic change) only gets soft-failed (or whatever terminology we're using now) if the event that revoked privileges (e.g. Alice muting Bob) arrives at that server before the event that gets soft-failed. So from the point of view of that server, the event does not really transition from valid to invalid. Bob's topic change arrives at a server that thinks that Bob should currently be muted in the room, and the server needs to decide what to do with that event. So at this point, the server does not know if the event is valid or not. The proposal here is that in that situation, the server should hold onto it but avoid referencing it or passing it on for now. If it is a state event, and the server receives new information indicating that the event was not a moderation evasion event (where the "new information" is in the form of some other server sending a valid event that references it, suggesting that the server had received the topic change event before the muting event), then it will consider the soft-failed event in state resolution. |
||
|
||
.. NOTE:: | ||
|
||
Because soft failed state events participate in state resolution as normal, it | ||
is possible for such events to appear in the current state of the room. In | ||
that case the client should be told about the soft failed event in the usual | ||
way (e.g. by sending it down in the ``state`` section of a sync response). | ||
|
||
|
||
.. NOTE:: | ||
|
||
A soft failed event should be returned in response to federation requests | ||
where appropriate (e.g. in ``/event/<event_id>``). Note that soft failed | ||
events are returned in ``/backfill`` and ``/get_missing_events`` responses | ||
only if the requests include events referencing the soft failed events. | ||
|
||
|
||
.. admonition:: Example | ||
|
||
As an example consider the event graph:: | ||
|
||
A | ||
/ | ||
B | ||
|
||
where ``B`` is a ban of a user ``X``. If the user ``X`` tries to set the topic | ||
by sending an event ``C`` while evading the ban:: | ||
|
||
A | ||
/ \ | ||
B C | ||
|
||
servers that receive ``C`` after ``B`` should soft fail event ``C``, and so | ||
will neither relay ``C`` to its clients nor send any events referencing ``C``. | ||
|
||
If later another server sends an event ``D`` that references both ``B`` and | ||
``C`` (this can happen if it received ``C`` before ``B``):: | ||
|
||
A | ||
/ \ | ||
B C | ||
\ / | ||
D | ||
|
||
then servers will handle ``D`` as normal. ``D`` is sent to the servers' | ||
clients (assuming ``D`` passes auth checks). The state at ``D`` may resolve to | ||
a state that includes ``C``, in which case clients should also to be told that | ||
the state has changed to include ``C``. (*Note*: This depends on the exact | ||
state resolution algorithm used. In the original version of the algorithm | ||
``C`` would be in the resolved state, whereas in latter versions this may not | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. whilst this is more accurate than the proposed change, it comes across as very cryptic. can we ground it in concrete examples by saying "in later versions the algorithm will try to prioritise the ban over the topic change" or something? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was really trying to avoid talking about all the different state res algorithms and what they do, and just saying "hey, multiple versions exist"
I sort of see what you mean, but I don't think that is any less cryptic for those who don't know about states resolution algorithms tbh |
||
be the case.) | ||
|
||
Note that this is essentially equivalent to the situation where one server | ||
doesn't receive ``C`` at all, and so asks another server for the state of the | ||
``C`` branch. | ||
|
||
Let's go back to the graph before ``D`` was sent:: | ||
|
||
A | ||
/ \ | ||
B C | ||
|
||
If all the servers in the room saw ``B`` before ``C`` and so soft fail ``C``, | ||
then any new event ``D'`` will not reference ``C``:: | ||
|
||
A | ||
/ \ | ||
B C | ||
| | ||
D | ||
|
||
|
||
Retrieving event authorization information | ||
++++++++++++++++++++++++++++++++++++++++++ | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think that giving this a name other than "soft failure" might help. Maybe "event quarantine"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, quite possibly. Though event quarantine in my mind feels more severe than "failure" tbh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally like "quarantine" better than "soft failure"