Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial stab at documenting soft fail #1641

Merged
merged 13 commits into from
Oct 26, 2018
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions scripts/css/nature.css
Original file line number Diff line number Diff line change
Expand Up @@ -284,3 +284,8 @@ div.admonition-rationale {
border: 1px solid #ccc;
}

div.admonition-example {
background-color: #eef;
border: 1px solid #ccc;
}

117 changes: 117 additions & 0 deletions specification/server_server_api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -570,6 +570,123 @@ transaction request to be responded to with an error response.
result in the user being considered joined.


Soft failure
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think that giving this a name other than "soft failure" might help. Maybe "event quarantine"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, quite possibly. Though event quarantine in my mind feels more severe than "failure" tbh

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally like "quarantine" better than "soft failure"

++++++++++++

.. admonition:: Rationale

It is important that we prevent users from evading bans (or other power
restrictions) by creating events which reference old parts of the DAG. For
example, a banned user could continue to send messages to a room by having
their server send events which reference the event before they were banned.
Note that such events are entirely valid, and we cannot simply reject them, as
it is impossible to distinguish such an event from a legitimate one which has
been delayed. We must therefore accept such events and let them participate in
state resolution and the federation protocol as normal. However, servers may
choose not to send such events on to their clients, so that end users won't
actually see the events.

When this happens it is often fairly obvious to servers, as they can see that
the new event doesn't actually pass auth based on the "current state" (i.e.
the resolved state across all forward extremities). While the event is
technically valid, the server can choose to not notify clients about the new
event.

This discourages servers from sending events that evade bans etc. in this way,
as end users won't actually see the events.


When the homeserver receives a new event over federation it should also check
whether the event passes auth checks based on the current state of the room (as
well as based on the state at the event). If the event does not pass the auth
checks based on the *current state* of the room (but does pass the auth checks
based on the state at that event) it should be "soft failed".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the root of much of the confusion on this PR is the fact that soft-failed events participate in state res as normal is buried relatively deep. How about appending:

"Otherwise it participates in state resolution as normal (and we rely on the state resolution algorithm to avoid malicious events influencing the state of the room)"


When an event is "soft failed" it should not be relayed to the client nor be
referenced by new events created by the homeserver (i.e. they should not be
added to the server's list of forward extremities of the room). Soft failed
events are otherwise handled as usual.


.. NOTE::

Soft failed events participate in state resolution as normal if further events
are received which reference it. It is the job of the state resolution
algorithm to ensure that malicious events cannot be injected into the room
state via this mechanism.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've deliberately not read the other reviews on this, so apologies if i'm re-treading the same ground, but reading this pretty naively makes me think:

"Why should the arrival of a new event which references the soft-failed event cause the soft-failed event to suddenly be unfailed? If an attacker can just send another event immediately after the previous one which references it in order to unfail it, what's the point of the soft-fail in the first place?"

I assume the answer here is "it doesn't unfail the first event", but this really isn't clear to a casual observer - especially as it says the soft-failed event will be sent to clients & participate in state res, which makes it sound pretty unfailed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to clarify further, but its worth noting a few things:

  1. These events aren't being rejected, just ignored. If a legitimate event comes in that references them then they will start becoming part of the state of the room and we should probably not ignore e.g. if the soft failed events become part of state. State resolution v2 will make it a lot harder for this to allow abuse, fwiw

  2. If an attacker sends a second event then that'll almost certainly be soft failed and ignored too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the intended legitimate use-case for first ignoring an event, excluding it from state, and then accepting it for inclusion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's really "excluded" from state per se. State is defined with respect to an event, so if no event references the soft-failed event, then there isn't really any place where the soft-failed event would show up in state anyways (other than in the state with respect to the soft-failed event, but if we aren't doing anything with it yet, then we can just ignore it for now).

One example of a situation where this occurs is: Alice is on server A, Bob is on server B. Alice and Bob are in some room, along with some other servers. Alice is admin, and Bob is a mod. Bob gets taken over by mind-controlling aliens, starts spewing random garbage into the room, and sets the room topic to nonsense. Alice mutes Bob (which also deops him in the room). The mute happens at roughly the same time as Bob setting the room topic, such that some servers might see the mute before the topic change, and some servers might see the topic change before the mute. If every server sees the same order of events, then everything is fine.

However, say that Carol's server receives Bob's first topic change before it receives the mute event, and they send a "WTF?" message, which references that topic change as a prev_event. What is the state at that message event? All servers should see the same state for that event, so the topic change has to be taken into consideration when resolving the state at that event. (This would also have the side effect of letting Alice know that some users in the room are seeing a new topic, as she would otherwise not know unless she was a server admin and trawled through the DAG.)

Now if Alice sends a message in reply to Carol's message, which references (possibly via other events) both Carol's event and the mute event, then ideally state resolution would resolve the topic back to the pre-alien state. But dealing with this situation at this point is up to the state resolution algorithm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that isn't really an example of a soft-fail. The state of the room at the point of every event in your example was always valid and nothing was ever ignored until being overridden by having the v2 algorithm determine a winning branch at the event which joined Alice's to Bob+Carol. That's fine, but there's more being described here.

First, some notion of soft-fail already exists in synapse and is owed some documentation (considering the title of this issue). The behavior apparent in synapse persists events which are invalid against some "soft" logic test. This is in contrast with a "hard" logic test: e.g. an invalid signature causes an unambiguous rejection of an event. The "soft" failure instead can be triggered by violating room logic: e.g. sending an event requiring a power_level not granted to the sender.

Soft-failure causes the event to be persisted and maintained, but inert without affecting the room. Soft-failed events are not immediately included in federation state and backfill responses, but are presented to servers by the make_join endpoint as a so-called "forward extremity" to be referenced in the next join event. Thus a soft-failed event has the potential to be integrated in the DAG by other servers who don't know or care to agree to its failure -- those cases founding various incoherencies and the so-called "state reset" phenomenon.

What is being described here by @erikjohnston is logic which immediately considers an event in violation of some rule which renders it inert (ignored), but at the same time it must also be persisted to maintain a traceable graph. Even if a choice is made to not persist it, other servers may persist it and include it as a sole reference in the graph. I'll quote the text to be specific:

If the event does not pass the auth checks it should be "soft failed".

@uhoreg This is describing a failure condition at the point of the event. Your example, in contrast, had no such failure condition at the point the event (the topic change by Bob was valid when Bob issued it, and when Carol witnessed it).

When an event is "soft failed" it should not be relayed to the client nor be referenced by new events created by the homeserver.

This is describing behavior which renders the event inert at the point of the event. @uhoreg your example in contrast has already relayed the event to the client and has considered it a valid referable for further events issued by Bob and Carol.

If an event is received that references the soft failed event then the new event should be handled as usual. Soft failed state events participate in state resolution, and so can appear in the state of events that reference the soft failed state event.

Here's where it gets hairy. This is where the server shifts its philosophy from being an independent thinker, having ignored the effects of the event and refusing to gossip it, to being something more of a bovine crowd-follower. While @uhoreg describes very desirable eventually consistent logic blessed by the v2 algorithm I don't think it quite completes the documentation of this soft-failure behavior as written nor even the purpose of this design. There are further complications here: foremost an ambiguity of what gets sent to the client, when and why (an action which cannot really be rolled back in the current protocol, though I wish it were possible) thus leading to a lot of bad things.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some reason to be unfailed due to some later revelation and further state resolution ... violates the fundamental light-cone property of the DAG.

I'll actually concede that in the current system (even v2) there is no limitation to further state resolutions unfailing a soft-failed event perhaps even recursing all the way back to the room's create event. I still maintain that without any realistic limitation this is trouble on many levels :(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if no servers accept it, then no valid events should reference it. So I think that any events that reference it should be rejected as well.

That may very well be acceptable, but it's a degradation of the robustness gained from soft-failure; in a system structured as a [nearly] linked-list it's important that bugs, regressions, spec pitfalls and implementation ambiguities don't split the room. This can be accidental or it can be exploited for denial of service. So before the rules become more rigid this should be given some thought.

Realistically though, especially in the current system, it's very plausible that a majority faction of servers will accept a history containing a bad event with a single reference and "move on" for any number of reasons and your server is either left to deal with that or sit in a deep freeze at the point right before that bad event. Since this isn't bitcoin, and specifically we're dealing with communication, perhaps robustness-oriented solutions aren't such a bad philosophy.

Copy link
Contributor

@jevolk jevolk Oct 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is specifically dealing with events that pass auth based on the state at the event itself.

Let's consider just this case then for a moment; so we're just considering events which started as valid but transitioned to invalid due to a revelation brought by a later event:

events that are soft-failed are held off to the side ... and don't actually get attached to the DAG until some other (valid) event pulls them in.

Such an event transitioned from valid to invalid because the light-cone was enlarged to invalidate whatever had been validating it (i.e. its closest power_levels event). Now we're holding it in limbo in case a further revelation enlarges the light-cone invalidating the event which invalidated our failed event, transitioning it back to valid.

The crux of the problem here is with the last part: "transitioning it back to valid" (or as you said "pulls them in"). What is the formality that determines the event ought to become valid again after having transitioned to invalid from initially being valid. This cherry-picking seems arbitrary at this point. Most eventually consistent systems invalidate entire branches unless very specific granular recombination behavior is specified.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

events that are soft-failed are held off to the side ... and don't actually get attached to the DAG

This section is specifically dealing with events that pass auth based on the state at the event itself.

This proposal is to deal with users who used to have some privileges in a room, but had their privileges revoked.

These statements appear in conflict to me. The soft-failed event can be part of the DAG and reachable from some later event which revoked whatever initially auth'ed it...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That may very well be acceptable, but it's a degradation of the robustness gained from soft-failure; in a system structured as a [nearly] linked-list it's important that bugs, regressions, spec pitfalls and implementation ambiguities don't split the room. This can be accidental or it can be exploited for denial of service. So before the rules become more rigid this should be given some thought.

FTR, I looked up what the spec says about what to do with events that don't pass auth based on the state at the event itself, which looks like it persists the events but basically turns them into no-ops (if they're state events, they never affect state). So, it seems to be closer to what @jevolk suggests than what I had said.

Such an event transitioned from valid to invalid because the light-cone was enlarged to invalidate whatever had been validating it (i.e. its closest power_levels event). Now we're holding it in limbo in case a further revelation enlarges the light-cone invalidating the event which invalidated our failed event, transitioning it back to valid.

The proposal is written from the point of view of a specific server. In the situation described in this proposal, an event (e.g. Bob's topic change) only gets soft-failed (or whatever terminology we're using now) if the event that revoked privileges (e.g. Alice muting Bob) arrives at that server before the event that gets soft-failed. So from the point of view of that server, the event does not really transition from valid to invalid. Bob's topic change arrives at a server that thinks that Bob should currently be muted in the room, and the server needs to decide what to do with that event. So at this point, the server does not know if the event is valid or not. The proposal here is that in that situation, the server should hold onto it but avoid referencing it or passing it on for now. If it is a state event, and the server receives new information indicating that the event was not a moderation evasion event (where the "new information" is in the form of some other server sending a valid event that references it, suggesting that the server had received the topic change event before the muting event), then it will consider the soft-failed event in state resolution.


.. NOTE::

Because soft failed state events participate in state resolution as normal, it
is possible for such events to appear in the current state of the room. In
that case the client should be told about the soft failed event in the usual
way (e.g. by sending it down in the ``state`` section of a sync response).


.. NOTE::

A soft failed event should be returned in response to federation requests
where appropriate (e.g. in ``/event/<event_id>``). Note that soft failed
events are returned in ``/backfill`` and ``/get_missing_events`` responses
only if the requests include events referencing the soft failed events.


.. admonition:: Example

As an example consider the event graph::

A
/
B

where ``B`` is a ban of a user ``X``. If the user ``X`` tries to set the topic
by sending an event ``C`` while evading the ban::

A
/ \
B C

servers that receive ``C`` after ``B`` should soft fail event ``C``, and so
will neither relay ``C`` to its clients nor send any events referencing ``C``.

If later another server sends an event ``D`` that references both ``B`` and
``C`` (this can happen if it received ``C`` before ``B``)::

A
/ \
B C
\ /
D

then servers will handle ``D`` as normal. ``D`` is sent to the servers'
clients (assuming ``D`` passes auth checks). The state at ``D`` may resolve to
a state that includes ``C``, in which case clients should also to be told that
the state has changed to include ``C``. (*Note*: This depends on the exact
state resolution algorithm used. In the original version of the algorithm
``C`` would be in the resolved state, whereas in latter versions this may not
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whilst this is more accurate than the proposed change, it comes across as very cryptic. can we ground it in concrete examples by saying "in later versions the algorithm will try to prioritise the ban over the topic change" or something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was really trying to avoid talking about all the different state res algorithms and what they do, and just saying "hey, multiple versions exist"

"in later versions the algorithm will try to prioritise the ban over the topic change"

I sort of see what you mean, but I don't think that is any less cryptic for those who don't know about states resolution algorithms tbh

be the case.)

Note that this is essentially equivalent to the situation where one server
doesn't receive ``C`` at all, and so asks another server for the state of the
``C`` branch.

Let's go back to the graph before ``D`` was sent::

A
/ \
B C

If all the servers in the room saw ``B`` before ``C`` and so soft fail ``C``,
then any new event ``D'`` will not reference ``C``::

A
/ \
B C
|
D


Retrieving event authorization information
++++++++++++++++++++++++++++++++++++++++++

Expand Down