-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MSC2162: Signaling Errors at Bridges #2162
base: old_master
Are you sure you want to change the base?
Changes from all commits
3131afe
3b468c7
6606c84
c6b8c08
cf6723c
909f0c0
3ef997c
766e9dc
4e852c7
6556024
4fe8ffe
0212729
6eeb102
ab27cca
d0cd9d4
9e1f20a
54e0546
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,312 @@ | ||
# Signaling Errors at Bridges | ||
|
||
Sometimes bridges just silently swallow messages and other events. This proposal | ||
enables bridges to communicate that something went wrong and gives clients the | ||
option to give feedback to their users. Clients are given the possibility to | ||
retry a failed event and bridges can signal the success of the retry. | ||
|
||
## Proposal | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The proposal doesn't explain who sends the error event. Is it the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The thing we want here is that “the bridge” does send the message, which is not a concept that maps straight to Matrix afaik. Instead we always need a proxy user for the bridge. There are two parts to get this right: Who is eligible to represent the bridge and how to make sure this info came from the bridge? This maps to the problems of authorization and authentication. @Half-Shot mentioned he had a proposal for this via the room state, so it might be a good idea to piggyback on that. (If it is a proposal I assume there is nothing else usable for us out there.) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm failing to see the correlation between this and authorization for bridges (I also don't know what proposal that is). Bridges have a namespace of users and a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One user representing the bridge does send the message. It depends on who is in the room, so the answer is both of them. We can't simply say the bridge bot user as it is sometimes not joined e.g. in 1:1 conversations. Then the virtual user of your communication partner does represent the bridge and it should send the bridge error. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This causes problems (as does the regex later on) because clients won't be able to do sanity checking on errors. They don't have a concept of bridges or appservices, and would be unable to see that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The proposal should cover how much we care about random users impersonating bridges or bridges lying about their namespaces, and how we protect against that if we do care (we should). |
||
|
||
Bridges might come into a situation where there is nothing more they can do to | ||
successfully deliver an event to the foreign network they are connected to. Then | ||
they should be able to inform the originating room of the event about this | ||
delivery error. The user in turn should be able to instruct the bridge to retry | ||
sending the message that was presented him as failed; the bridge should have the | ||
ability to mark an error as being revoked. | ||
|
||
If [MSC 1410: Rich | ||
Bridging](https://github.com/matrix-org/matrix-doc/issues/1410) is utilized for | ||
this proposal it would additionally give the benefits of | ||
|
||
- trimming the number of properties required in each bridge error event by | ||
separately providing these general infos about the bridge in the room state instead. | ||
- not requiring users representing the bridge to have admin power levels | ||
(see [Rights management](#rights-management)). | ||
|
||
### Bridge error event | ||
|
||
This document proposes the addition of a new room event with type | ||
`m.bridge_error`. It is sent by the bridge and references an event previously | ||
V02460 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
sent in the same room, by that marking the original event as “failed to deliver” | ||
for all users of a bridge. The new event type utilizes reference aggregations | ||
([MSC | ||
1849](https://github.com/matrix-org/matrix-doc/blob/matthew/msc1849/proposals/1849-aggregations.md#relation-types)) | ||
to establish the relation to the event its delivery it is marking as failed. | ||
There is no need for a new endpoint as the existing `/send` endpoint will be | ||
utilized. | ||
|
||
Additional information contained in the event are the name of the bridged | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can actually drop this now, when #2346 gets merged :) |
||
network (e.g. “Discord” or “Telegram”) and a regex array¹ describing the | ||
affected users (e.g. `@discord_.*:example.org`). This regex array should be | ||
similar to the one any Application Service uses for marking its reserved user | ||
namespace. By providing this information clients can inform their users who in | ||
the room was affected by the error and for which network the error occurred. | ||
|
||
*Those two fields will not be required if the variant with [MSC 1410: Rich | ||
Bridging](https://github.com/matrix-org/matrix-doc/issues/1410) is adopted. In | ||
this case the same information is stored alongside other bridge metadata in the | ||
room state* | ||
|
||
There are some common reasons why an error occurred. These are encoded in the | ||
`reason` attribute and can contain the following types: | ||
|
||
* `m.event_not_handled` Generic error type for when an event can not be handled | ||
V02460 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
by the bridge. It is used as a fallback when there is no other more specific | ||
reason. | ||
|
||
* `m.event_too_old` A message will – with enough time passed – fall out of its | ||
original context. In this case the bridge might decide that the event is too | ||
old and emit this error. | ||
|
||
* `m.foreign_network_error` The bridge was doing its job fine, but the foreign | ||
network permanently refused to handle the event. | ||
|
||
* `m.unknown_event` The bridge is not able to handle events of this type. It is | ||
totally legitimate to “handle” an event by doing nothing and not throwing this | ||
error. It is at the discretion of the bridge author to find a good balance | ||
between informing the user and preventing unnecessary spam. Throwing this | ||
error only for some subtypes of an event is fine. | ||
|
||
V02460 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* `m.bridge_unavailable` The homeserver couldn't reach the bridge. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. as a subclass to this which has shown to be problematic in recent weeks: When the homeserver is also dead, the users on other homeservers will see the message as delivered when in fact it is not. I don't know if it makes total sense here given the traffic concern, but maybe flipping this proposal around for positive reactions to messages when they are delivered? Maybe a new kind of or maybe we train the general public that the bridge sending a read receipt is fine? Presumably these ideas have already been covered, so I'm curious as to what the decisions were that led to it not being used. |
||
|
||
* `m.no_permission` The bridge wanted to handle an event, but didn't have the | ||
permission to do so. | ||
|
||
The bridge error can provide a `time_to_permanent` field. If this field is | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. or |
||
present it gives the time in milliseconds one has to wait before declaring the | ||
bridge error as permanent. As long as an error is younger than this time, the | ||
client can expect the possibility of the error being revoked. If a bridge error | ||
is permanent, it should not be revoked anymore. In case this field is missing, | ||
the error will never be considered permanent. | ||
|
||
Notes: | ||
|
||
- Nothing prevents multiple bridge error events to relate to the same event. | ||
This should be pretty common as a room can be bridged to more than one network | ||
at a time. | ||
|
||
- A bridge might choose to handle bridge error events, but this should never | ||
result in emitting a new bridge error as this could lead to an endless | ||
recursion. | ||
|
||
The need for this proposal arises from a gap between the Matrix network and | ||
other foreign networks it bridges to. Matrix with its eventual consistency is | ||
unique in having a message delivery guarantee. Because of this property there is | ||
no need in the Matrix network itself to model the failure of message delivery. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Arguably there is for some obscure cases. For instance, if you set up a bridge in a read only room, the bridge might not be able to post messages. However, the foreign network likely doesn't support indicating failure so the room is the next best option to flag a potential problem. How it does that starts stepping into very questionable territory (extensible profiles, auth rules which support a limited set of per-user events, etc). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd probably de-scope this for the sake of this MSC landing any time soon. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But if we are de-scoping, the MSC text needs to be updated to reflect that and the edge cases that, like @turt2live mentioned, can still be encountered. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am still not quite sure if I properly get what you are saying, but I think it is now discussed in the Rights management section. |
||
This need only arises for interactions with foreign networks where message | ||
delivery might fail. This proposal extends Matrix to be aware of these error | ||
cases. | ||
|
||
Additionally there might be some operational restrictions of bridges which might | ||
make it necessary for them to refrain from handling an event, e.g. when hitting | ||
memory limits. In this case the new event type can be used as well. | ||
|
||
This is an example of how the new bridge error might look: | ||
|
||
``` | ||
{ | ||
"type": "m.bridge_error", | ||
"content": { | ||
"network: "Discord", | ||
"affected_users": ["@discord_.*:example.org"], | ||
"reason": "m.bridge_unavailable", | ||
"time_to_permanent": 900, | ||
"m.relationship": { | ||
"rel_type": "m.reference", | ||
"event_id": "$some:event.id" | ||
} | ||
} | ||
} | ||
``` | ||
|
||
\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\ | ||
¹ Or similar – see [Security Considerations](#security-considerations) | ||
|
||
### Retries and error revocation | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Retries might be better suited for a dedicated MSC given the complexity. |
||
|
||
Providing a way to retry a failed message delivery gives the sender control over | ||
the importance of her message. An extra procedure for a retry is necessary as | ||
the message might have been delivered to some users (those not on the bridge) | ||
and this would produce duplicate messages for them. | ||
|
||
A retry request is posted by the client to the room for all bridges to see it, | ||
referencing the original event. By inspecting the sender of all related | ||
`m.bridge_error` events, under all bridges the correct one can find out that it | ||
is responsible. The responsible bridge re-fetches the original event and retries | ||
to deliver it. | ||
|
||
A successful retry should be communicated by revoking (not redacting) the | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I assume the reason we redact is because we still want the original error information to be given to clients (i.e. which bridge was it, when did the error happen etc. However that wasn't clear until I read a few paragraphs down. If this is the reason, could you briefly mention it here? |
||
original error that made the retry necessary. Revocation is done by an event | ||
with the type `m.bridge_error_revoke` which references the original event. The | ||
error(s) having a sender of the same bridge as the revocation event are | ||
considered revoked. Clients can show a revocation message e.g. as “Delivered to | ||
Discord at 14:52.” besides the original event. | ||
|
||
On an unsuccessful retry the bridge may edit the error's content to reflect the | ||
new state, e.g. because the type of error changed or to communicate the new | ||
time. | ||
|
||
Example of the new retry events: | ||
|
||
``` | ||
{ | ||
"type": "m.bridge_retry", | ||
"content": { | ||
"m.relationship": { | ||
"rel_type": "m.reference", | ||
"event_id": "$original:event.id" | ||
} | ||
} | ||
} | ||
``` | ||
|
||
``` | ||
{ | ||
"type": "m.bridge_error_revoke", | ||
"content": { | ||
"m.relationship": { | ||
"rel_type": "m.reference", | ||
"event_id": "$original:event.id" | ||
} | ||
} | ||
} | ||
``` | ||
|
||
Overview of the relations between the different event types: | ||
|
||
``` | ||
m.references | ||
________________ _____________________ | ||
V02460 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| | | | | ||
| Original Event |-+-| Bridge Error | | ||
|________________| | |_____________________| | ||
| _____________________ | ||
| | | | ||
+-| Retry Request | | ||
| |_____________________| | ||
| _____________________ | ||
| | | | ||
+-| Bridge Error Revoke | | ||
|_____________________| | ||
``` | ||
|
||
A retry might not make much sense for every kind of error e.g. retrying | ||
`m.unknown_event` will probably result in the same error again. Clients may | ||
choose to disable retry options for those cases, but it is not restricted | ||
otherwise. | ||
|
||
### Special case: Unavailable bridge | ||
|
||
In the case the bridge is down or otherwise disconnected from the homeserver, it | ||
naturally has no way to inform its users about the unavailability. In this case | ||
the homeserver can stand in as an agent for the bridge and answer requests in | ||
its absence. | ||
|
||
For this to happen, the homeserver will send out a bridge error event in the | ||
moment a transaction delivery to the bridge failed. The clients at this point | ||
will start showing an error. When the bridge comes back online it will encounter | ||
a higher-than-normal load as all events accumulated over the downtime are | ||
flooding in. To handle this scenario well, the bridge will want to simply | ||
discard all messages older than a given threshold and not bother with sending | ||
any answer back. | ||
|
||
By including a timeout in the `time_to_permanent` field of the event, the client | ||
will know without further feedback from the homeserver or bridge when the | ||
message won't be delivered anymore. | ||
|
||
For those events still accepted by the bridge, the error must be revoked by a | ||
`m.bridge_error_revoke` as described in the previous chapter. | ||
|
||
**Note:** For this to work, the homeserver is required to impersonate a user of | ||
the bridge as it has no agent of its own. The impersonated user would be the | ||
bridge bot user or one of the virtual users in the bridge's namespace. | ||
|
||
### Rights management | ||
|
||
Only bridges should be allowed to send bridge errors and revocations. | ||
|
||
Utilizing the rights system of the room provides a good approximation to this | ||
behavior. It is fine to use it under the assumptions that | ||
|
||
- `m.bridge_error` and `m.bridge_error_revoke` require admin power levels. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. by default they don't, unless you are expecting this to go into a whole new room version (which is a much harder sell) |
||
- there is always the bridge bot user or a virtual user in the bridge's | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. clients cannot check this |
||
namespace present in the room. | ||
- at least one of those users possesses admin power level. | ||
- all users with admin power levels are trusted. | ||
|
||
In short, this requires giving bridges admin power levels in a room and trusting | ||
them to restrict their actions to their own business. It is enough to have one | ||
privileged bridge user in the room. In public rooms this is most commonly the | ||
bridge bot user with admin power level available and in 1:1 conversations it is | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. More commonly the bridge does not have any kind of power in the room. When bridges are admins, they are often added through Scalar which makes this decision for them - the bridges themselves do not acquire power to operate. There's also plenty of bridges which are not represented in Scalar, which has lead to a majority of rooms not having appropriate permissions for all bridges. |
||
the puppeted conversation partner which does generally have admin power levels | ||
as well. | ||
|
||
As long as the above assumptions are met, it is fine to not explicitly denote | ||
bridges and bridge users as such and simply rely on the power levels for access | ||
control to the new events. | ||
|
||
An alternative for the above solution is the adoption of [MSC 1410: Rich | ||
Bridging](https://github.com/matrix-org/matrix-doc/issues/1410). It stores | ||
information about users affiliation to a bridge in the room state. Instead of | ||
checking power levels of users, rich bridging can be utilized by checking the | ||
room state and only allow valid representatives of the bridge to send bridge | ||
errors and their revocations. This alternative has the advantage of not | ||
requiring agents of the bridge to be powerful. They would be verifiable and | ||
could be trusted without any restrictions regarding their power levels. | ||
|
||
## Tradeoffs | ||
|
||
Without this proposal, bridges could still inform users in a room that a | ||
delivery failed by simply sending a plain message event from a bot account. This | ||
possibility carries the disadvantage of conveying no special semantic meaning | ||
with the consequence of clients not being able to adapt their presentation. | ||
|
||
A fixed set of error types might be too restrictive to express every possible | ||
condition. An alternative would be a free-form text for an error message. This | ||
brings the problems of less semantic meaning and a requirement for | ||
internationalization with it. In this proposal a generic error type is provided | ||
for error cases not considered in this MSC. | ||
|
||
The nature of a retry request from a client to the bridge lends it more to an | ||
ephemeral type of transport than something permanent like a PDU, but it was | ||
advised against it for The Spec doesn't make implementations of new EDU types | ||
easy. Applications Services in general don't allow listening to EDUs, so further | ||
changes to The Spec would be necessary before following the probably more | ||
appropriate route here. | ||
|
||
A new event type `m.bridge_error_revoke` is introduced for revoking a bridge | ||
error. Alternatively it could be considered to redact the bridge error event, | ||
which would eliminate the need for the revocation event and would make this | ||
proposal a little simpler. The disadvantage of this approach is the missing | ||
transparency and context of who had which information at which point in time. | ||
This additional information should make for a better user experience. | ||
|
||
## Potential issues | ||
|
||
When the foreign network is not the cause of the error signaled but the bridge | ||
itself (maybe under load), there might be an argument that responding to failed | ||
messages increases the pressure. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another potential issue is that this doesn't convey any error information if messages failed to send due to the bridge being down completely (as the bridge is unable to send the error messages). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When the bridge comes back online, it will receive the missed events from the HS, so they might be handled after all. This would be only temporary and by that explicitly not covered by this proposal. The big thing to tackle here would be a mechanism to signal delivery delays which would add to the core Matrix network as well. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I still think this is something that should be addressed or at least mentioned in the proposal. If the homeserver cannot send the event to the bridge, it should send an error event on its behalf (which the bridge can later redact). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Seems like it is mentioned here? https://github.com/matrix-org/matrix-doc/pull/2162/files#diff-3fc0af60441d4268c3ff475f9e03fb4cR192 |
||
## Security considerations | ||
|
||
Sending a custom regex with an event might open the doors for attacking a | ||
homeserver and/or a client by exposing a direct pathway to the complex code of a | ||
regex parser. Additionally sending arbitrary complex regexes might make Matrix | ||
more vulnerable to DoS attacks. To mitigate these risks it might be sensible to | ||
only allow a more restricted subset of regular expressions by e.g. requiring a | ||
maximal length or falling back to simple globbing. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This should be worked out in this MSC, as we'll state in the spec whether a field supports full regex or only simple globbing. @Half-Shot would a bridge ever need more than globbing for calling out affected users? Currently application service registration allows for full regex parsing (https://matrix.org/docs/spec/application_service/unstable#registration). But this is on the bridge side, and thus if it kills the homeserver, it was the homeserver operator that was at fault for using a bad registration file. Things are entirely different from the C-S API side. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If using globbing, there would be a conversion needed from the AS regex, which it should be based on. As the regex language is more powerful than globbing, some simplifications/hacks/heuristics are required there. Or having the bridge user add it in two different forms manually… |
||
|
||
When utilizing power levels instead of building on [MSC 1410: Rich | ||
Bridging](https://github.com/matrix-org/matrix-doc/issues/1410) a malicious user | ||
who has enough power to send `m.bridge_error` or `m.bridge_error_revoke` is able | ||
to impersonate a bridge. She will be able to wrongly mark messages as failed to | ||
deliver or revoke errors when they were not successfully retried. | ||
|
||
## Conclusion | ||
|
||
In this document an event is proposed for bridges to signal errors and a way to | ||
retry and revoke those errors. The event informs the affected room about which | ||
message errored for which reason; it gives information about the affected users | ||
and the bridged network. By implementing the proposal Matrix users will get more | ||
insight into the state of their (un)delivered messages and thus they will become | ||
less frustrated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposal does not cover how bridges de-flag errors (eventual success in sending a message). I am assuming they redact their original error event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this proposal is only handling the case of when the bridge gives up trying to send.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that sounds sub-par tbh. We'd need a retry mechanism so that users aren't left stranded, or at the very least support redaction as a way to indicate clearing of the error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a good chance that a message will eventually be delivered, I don't think it belongs in this proposal. We should try to unify that case with a general “delivery delay notification” solution for the whole Matrix universe so the work has to be done only once. I am currently writing a bit about what I have in mind about those “delivery delay notifications” and there can discussion about that as well. (Also not quite sure where to have it then.) In the case of a message not being delivered with a high probability and just backing off in rare circumstances, redacting a permanent error might be adequate.
Until now I assumed the error is final and there is no retry, just a manual resend of the message. Could a bridge get the redaction and refetch the original event? Or might it be possible to simulate a resend with a no-op edit? If there is no satisfying way already, one could of course add another event type which is ignored by everyone but the bridge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Homeservers are expected to keep retrying the appservice until it comes back alive, but that can easily be hours or even days before the service responds. Most bridges nowadays have a condition for received messages where ti just drops messages which are too old, but between the time the bridge went down and the time the message was ignored the user's message was not delivered without notification.
Limiting the scope of the proposal to just fatal errors doesn't really help with communicating the bridge's status because there's many more temporary failures that people expect to hear about due to the nature of realtime communications. It's bad enough we already get complaints when it takes more than 10 seconds to send a message through 4 different points of failure to the remote network.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am coming round to the idea of sending a temporary failure PDU for things which have failed to send and are in a retry queue of some kind. Redacting that would imply it's been sent.
Separately there is a question of if this proposal should cover how the user can indicate they want to retry a message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully coming around enough to give the OK 😇
I'd be uncomfortable with this going into the spec if it only communicated permanent failures, because permanent failures are rare.