Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSC2162: Signaling Errors at Bridges #2162

Open
wants to merge 17 commits into
base: old_master
Choose a base branch
from

Conversation

V02460
Copy link

@V02460 V02460 commented Jul 9, 2019

Add a new room event for signaling permanent errors occurring at bridges. References MSC1849.

Rendered

Signed-off-by: Kai A. Hiller <[email protected]>
@turt2live turt2live added proposal A matrix spec change proposal proposal-in-review labels Jul 9, 2019
@turt2live turt2live self-requested a review July 9, 2019 18:44
@Ralith
Copy link
Contributor

Ralith commented Jul 9, 2019

Is the somewhat dangerous affected_users field really necessary? It seems like the main objective here is to allow clients to flag messages that were not delivered by a bridge, and that can be done without regular expressions.

@V02460
Copy link
Author

V02460 commented Jul 10, 2019

It depends on what is deemed necessary. You are right that it is the main objective to flag a message. That gives the user the information that something went wrong and adding the network name, the reason or the affected users adds more information to that.

A key question is what helps a user and what does not. Adding this general event type helps definitively imo as it shows a previously invisible problem. The question now is how a user can act upon that and this should determine what is included and what not. Here is what came to my mind of what might be a user's reaction:

  • Trying to resend the message No additional info required.
  • Reaching users via a different channel affected_users required.
  • Blaming Bug report to the bridge/the network network, reason, required

(Maybe an informed user is a goal in itself as well.)

By leaving out the affected_users attribute the second option wouldn't be possible anymore¹. These benefits must be weighted against the problems it may cause security- or otherwise. I don't find myself in the position to judge how severe of a problem a regex/regex-like addition to the protocol really is. Maybe its not so bad after all, maybe it's a really Bad Idea™. I would like to hear from more people about those points, so we can come to a conclusion.

¹ On a semantic level anyway. A user might be able to guess from the network name and the peoples nick name endings e.g. (Discord) that these were the ones affected.

@Ralith
Copy link
Contributor

Ralith commented Jul 10, 2019

I agree that it's useful to be able to tell which users in a room are there via a bridge. It might make more sense for that to be handled separately from error reporting, e.g. as done by the widely deployed IRC flair.

`m.bridge_error`. It is sent by the bridge and references an event previously
sent in a room, by that marking it as “failed to deliver” for all users of a
bridge. The new event type utilizes reference aggregations ([MSC
1849](https://github.com/matrix-org/matrix-doc/blob/matthew/msc1849/proposals/1849-aggregations.md))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heads up that @ara4n is rewriting this proposal, but I don't think it will affect this one https://github.com/matrix-org/matrix-doc/pull/2154/files

@V02460
Copy link
Author

V02460 commented Jul 10, 2019

I agree that it's useful to be able to tell which users in a room are there via a bridge. It might make more sense for that to be handled separately from error reporting, e.g. as done by the widely deployed IRC flair.

That might be enough. Just wanted to note that this is slightly ambiguous as there might be more than one IRC bridge in a room.

Copy link
Member

@uhoreg uhoreg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable so far. Before it can be accepted, details about the regex format would need to be finalized.


Additional information contained in the event are the name of the bridged
network (e.g. “Discord” or “Telegram”) and a regex¹ describing the affected
users (e.g. `@discord_*:example.org`). This regex should be similar to the one
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's a regex, then it should be @discord_.*:example.org.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to jump in and say my preference is to keep it as a simple glob expression, rather than having to define the rules around regexes. I doubt we plan to do more than simple matching?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to define what kind of glob it is then, as we use two types in Matrix. Also if it's not a regex, don't call it a regex.

Being able to reuse the namespace from the registration is a strong argument for supporting proper regex here (which is better defined than globs).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(which is better defined than globs).

FWIW, my impression is the opposite. There are several different syntaxes for regexes. But personally, I don't really care which one is picked, as long as it's well defined.

When the foreign network is not the cause of the error signaled but the bridge
itself (maybe under load), there might be an argument that responding to failed
messages increases the pressure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another potential issue is that this doesn't convey any error information if messages failed to send due to the bridge being down completely (as the bridge is unable to send the error messages).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the bridge comes back online, it will receive the missed events from the HS, so they might be handled after all. This would be only temporary and by that explicitly not covered by this proposal. The big thing to tackle here would be a mechanism to signal delivery delays which would add to the core Matrix network as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this is something that should be addressed or at least mentioned in the proposal. If the homeserver cannot send the event to the bridge, it should send an error event on its behalf (which the bridge can later redact).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

proposals/2162-signaling-errors-at-bridges.md Show resolved Hide resolved
proposals/2162-signaling-errors-at-bridges.md Outdated Show resolved Hide resolved
proposals/2162-signaling-errors-at-bridges.md Show resolved Hide resolved
@Ralith
Copy link
Contributor

Ralith commented Jul 10, 2019

Just wanted to note that this is slightly ambiguous as there might be more than one IRC bridge in a room.

A more formal/structured scheme for mapping users to bridges might be nice, yeah, but I really think it would be best done independently of error reporting, since it is otherwise useful and avoids a complicated and potentially hazardous feature here.

Another problem that comes to mind is spoofing. How can we guarantee that reported errors are genuine? The most obvious answer is "configure power levels such that only the bridge bots or an admin can send the event," but that's not a strong guarantee, and risks misconfiguration. One solution might be to always identify bridges by their matrix-side bridge bot. Then errors sent by the bridge bot are inherently associated with the bridge in question, and things like human-readable network identifiers and the set of affected users can be determined robustly based on that information.

@Half-Shot
Copy link
Contributor

One solution might be to always identify bridges by their matrix-side bridge bot. Then errors sent by the bridge bot are inherently associated with the bridge in question, and things like human-readable network identifiers and the set of affected users can be determined robustly based on that information.

This was my assumption about how it was supposed to work, and we should codify this in the proposal unless @V02460 has other ideas.

I had a proposal a long time ago which mapped bot mxids <=> bridge metadata in the room state.

@Ralith
Copy link
Contributor

Ralith commented Jul 10, 2019

In particular, in the presence of such a mechanism I think both the "network" and "affected_users" keys would be best replaced by indirect lookup through that same mechanism, because it provides a single necessarily consistent source of truth.

Copy link
Member

@turt2live turt2live left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking very much in the right direction! It'd be good to also cover why EDUs, state events, or toDevice messages weren't picked as alternatives in the Tradeoffs section.

proposals/2162-signaling-errors-at-bridges.md Outdated Show resolved Hide resolved
proposals/2162-signaling-errors-at-bridges.md Outdated Show resolved Hide resolved

Additional information contained in the event are the name of the bridged
network (e.g. “Discord” or “Telegram”) and a regex¹ describing the affected
users (e.g. `@discord_*:example.org`). This regex should be similar to the one
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to define what kind of glob it is then, as we use two types in Matrix. Also if it's not a regex, don't call it a regex.

Being able to reuse the namespace from the registration is a strong argument for supporting proper regex here (which is better defined than globs).

by the bridge. It is used as a fallback when there is no other more specific
reason.

* `m.event_too_old` When the foreign network does not support timestamp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh as a bridge operator I'd rather drop the message on the floor than bother with signalling this error. I personally already run my bridges such that anything older than 15 minutes doesn't even get sent to the bridges to reduce traffic, useless processing, and context in the foreign network. Although there's the occasional question about why some messages don't get sent, I don't think it's worth spending the homeserver's time to advertise the failure.

Given these errors are associated with individual events, and time differences between events usually only happen after a major homeserver problem, this is just asking for a self-inflicted denial of service when the bridge is supposed to send thousands of errors into rooms. It's not very often that one or two events are too old - most of the time it's batches of hours worth of conversation.

tldr: I don't think we need this error code as it leads to self-inflicted denial of service.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm mixed on this, I absolutely agree with @turt2live that this definitely going to explode things though I do also see the benefit in signalling why a bunch of messages were never sent. The best thing I can come up with is a special event that defines the boundaries between a bridge stopping the processing of events, and beginning.

However, I don't think you can just specify a start and end point in terms of the dag and you'd need to list every missed event :/

proposals/2162-signaling-errors-at-bridges.md Outdated Show resolved Hide resolved
proposals/2162-signaling-errors-at-bridges.md Outdated Show resolved Hide resolved
When the foreign network is not the cause of the error signaled but the bridge
itself (maybe under load), there might be an argument that responding to failed
messages increases the pressure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this is something that should be addressed or at least mentioned in the proposal. If the homeserver cannot send the event to the bridge, it should send an error event on its behalf (which the bridge can later redact).

enables bridges to communicate that something went wrong and gives clients the
option to give feedback to their users.

## Proposal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposal does not cover how bridges de-flag errors (eventual success in sending a message). I am assuming they redact their original error event.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this proposal is only handling the case of when the bridge gives up trying to send.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that sounds sub-par tbh. We'd need a retry mechanism so that users aren't left stranded, or at the very least support redaction as a way to indicate clearing of the error.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a good chance that a message will eventually be delivered, I don't think it belongs in this proposal. We should try to unify that case with a general “delivery delay notification” solution for the whole Matrix universe so the work has to be done only once. I am currently writing a bit about what I have in mind about those “delivery delay notifications” and there can discussion about that as well. (Also not quite sure where to have it then.) In the case of a message not being delivered with a high probability and just backing off in rare circumstances, redacting a permanent error might be adequate.

Until now I assumed the error is final and there is no retry, just a manual resend of the message. Could a bridge get the redaction and refetch the original event? Or might it be possible to simulate a resend with a no-op edit? If there is no satisfying way already, one could of course add another event type which is ignored by everyone but the bridge.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Homeservers are expected to keep retrying the appservice until it comes back alive, but that can easily be hours or even days before the service responds. Most bridges nowadays have a condition for received messages where ti just drops messages which are too old, but between the time the bridge went down and the time the message was ignored the user's message was not delivered without notification.

Limiting the scope of the proposal to just fatal errors doesn't really help with communicating the bridge's status because there's many more temporary failures that people expect to hear about due to the nature of realtime communications. It's bad enough we already get complaints when it takes more than 10 seconds to send a message through 4 different points of failure to the remote network.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am coming round to the idea of sending a temporary failure PDU for things which have failed to send and are in a retry queue of some kind. Redacting that would imply it's been sent.

Separately there is a question of if this proposal should cover how the user can indicate they want to retry a message.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully coming around enough to give the OK 😇

I'd be uncomfortable with this going into the spec if it only communicated permanent failures, because permanent failures are rare.

enables bridges to communicate that something went wrong and gives clients the
option to give feedback to their users.

## Proposal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposal doesn't explain who sends the error event. Is it the sender_localpart? a user in the namespace? Both of these options have consequences which make them non-ideal: the sender_localpart might not be in the room in the case of a puppet bridge and a user in the namespace is an undefined thing: anyone can claim they are in a namespace.

Copy link
Author

@V02460 V02460 Jul 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing we want here is that “the bridge” does send the message, which is not a concept that maps straight to Matrix afaik. Instead we always need a proxy user for the bridge. There are two parts to get this right: Who is eligible to represent the bridge and how to make sure this info came from the bridge? This maps to the problems of authorization and authentication. @Half-Shot mentioned he had a proposal for this via the room state, so it might be a good idea to piggyback on that. (If it is a proposal I assume there is nothing else usable for us out there.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm failing to see the correlation between this and authorization for bridges (I also don't know what proposal that is). Bridges have a namespace of users and a sender_localpart: which user is supposed to send it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One user representing the bridge does send the message. It depends on who is in the room, so the answer is both of them.

We can't simply say the bridge bot user as it is sometimes not joined e.g. in 1:1 conversations. Then the virtual user of your communication partner does represent the bridge and it should send the bridge error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This causes problems (as does the regex later on) because clients won't be able to do sanity checking on errors. They don't have a concept of bridges or appservices, and would be unable to see that @alice:example.org doesn't have appropriate permissions over the Freenode IRC bridge.

enables bridges to communicate that something went wrong and gives clients the
option to give feedback to their users.

## Proposal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposal should cover how much we care about random users impersonating bridges or bridges lying about their namespaces, and how we protect against that if we do care (we should).

proposals/2162-signaling-errors-at-bridges.md Outdated Show resolved Hide resolved
enables bridges to communicate that something went wrong and gives clients the
option to give feedback to their users.

## Proposal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am coming round to the idea of sending a temporary failure PDU for things which have failed to send and are in a retry queue of some kind. Redacting that would imply it's been sent.

Separately there is a question of if this proposal should cover how the user can indicate they want to retry a message.

by the bridge. It is used as a fallback when there is no other more specific
reason.

* `m.event_too_old` When the foreign network does not support timestamp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm mixed on this, I absolutely agree with @turt2live that this definitely going to explode things though I do also see the benefit in signalling why a bunch of messages were never sent. The best thing I can come up with is a special event that defines the boundaries between a bridge stopping the processing of events, and beginning.

However, I don't think you can just specify a start and end point in terms of the dag and you'd need to list every missed event :/

proposals/2162-signaling-errors-at-bridges.md Outdated Show resolved Hide resolved
proposals/2162-signaling-errors-at-bridges.md Show resolved Hide resolved
@turt2live turt2live self-requested a review August 3, 2019 17:07
proposals/2162-signaling-errors-at-bridges.md Outdated Show resolved Hide resolved
proposals/2162-signaling-errors-at-bridges.md Outdated Show resolved Hide resolved
proposals/2162-signaling-errors-at-bridges.md Outdated Show resolved Hide resolved
The need for this proposal arises from a gap between the Matrix network and
other foreign networks it bridges to. Matrix with its eventual consistency is
unique in having a message delivery guarantee. Because of this property there is
no need in the Matrix network itself to model the failure of message delivery.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably de-scope this for the sake of this MSC landing any time soon.

The need for this proposal arises from a gap between the Matrix network and
other foreign networks it bridges to. Matrix with its eventual consistency is
unique in having a message delivery guarantee. Because of this property there is
no need in the Matrix network itself to model the failure of message delivery.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if we are de-scoping, the MSC text needs to be updated to reflect that and the edge cases that, like @turt2live mentioned, can still be encountered.

proposals/2162-signaling-errors-at-bridges.md Outdated Show resolved Hide resolved
proposals/2162-signaling-errors-at-bridges.md Show resolved Hide resolved
proposals/2162-signaling-errors-at-bridges.md Outdated Show resolved Hide resolved
When the foreign network is not the cause of the error signaled but the bridge
itself (maybe under load), there might be an argument that responding to failed
messages increases the pressure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regex parser. Additionally sending arbitrary complex regexes might make Matrix
more vulnerable to DoS attacks. To mitigate these risks it might be sensible to
only allow a more restricted subset of regular expressions by e.g. requiring a
maximal length or falling back to simple globbing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be worked out in this MSC, as we'll state in the spec whether a field supports full regex or only simple globbing.

@Half-Shot would a bridge ever need more than globbing for calling out affected users? Currently application service registration allows for full regex parsing (https://matrix.org/docs/spec/application_service/unstable#registration). But this is on the bridge side, and thus if it kills the homeserver, it was the homeserver operator that was at fault for using a bad registration file. Things are entirely different from the C-S API side.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If using globbing, there would be a conversion needed from the AS regex, which it should be based on. As the regex language is more powerful than globbing, some simplifications/hacks/heuristics are required there. Or having the bridge user add it in two different forms manually…

@Ralith
Copy link
Contributor

Ralith commented Aug 6, 2019

Was there any interest in a more robust and rigorous approach to determining affected users, as previously discussed?

Kai A. Hiller added 2 commits August 6, 2019 10:22
@V02460
Copy link
Author

V02460 commented Aug 6, 2019

Was there any interest in a more robust and rigorous approach to determining affected users, as previously discussed?

I added notes about MSC 1410: Rich Bridging now, which provides exactly that. If it were already in The Spec, it would definitively be what should be used and this is the nicer way to approach it imo. The obvious hurdle here is that the MSC is not accepted yet and will probably need some more work.

Currently I am ranking the two benefits we would get from it as moderately important, but want to hear if others agree with my assessment.

A way forward could be to implement the current behavior and swap out the parts which rely on MSC 1410 when it is ready. I think adding the new changes would be backwards compatible, so they could be just tucked on later.

between informing the user and preventing unnecessary spam. Throwing this
error only for some subtypes of an event is fine.

* `m.bridge_unavailable` The homeserver couldn't reach the bridge.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a subclass to this which has shown to be problematic in recent weeks: When the homeserver is also dead, the users on other homeservers will see the message as delivered when in fact it is not.

I don't know if it makes total sense here given the traffic concern, but maybe flipping this proposal around for positive reactions to messages when they are delivered? Maybe a new kind of m.receipt for this specific purpose.

or maybe we train the general public that the bridge sending a read receipt is fine?

Presumably these ideas have already been covered, so I'm curious as to what the decisions were that led to it not being used.

* `m.no_permission` The bridge wanted to handle an event, but didn't have the
permission to do so.

The bridge error can provide a `time_to_permanent` field. If this field is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or permanent: true instead?

\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\
¹ Or similar – see [Security Considerations](#security-considerations)

### Retries and error revocation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retries might be better suited for a dedicated MSC given the complexity.

Utilizing the rights system of the room provides a good approximation to this
behavior. It is fine to use it under the assumptions that

- `m.bridge_error` and `m.bridge_error_revoke` require admin power levels.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by default they don't, unless you are expecting this to go into a whole new room version (which is a much harder sell)

behavior. It is fine to use it under the assumptions that

- `m.bridge_error` and `m.bridge_error_revoke` require admin power levels.
- there is always the bridge bot user or a virtual user in the bridge's
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clients cannot check this

In short, this requires giving bridges admin power levels in a room and trusting
them to restrict their actions to their own business. It is enough to have one
privileged bridge user in the room. In public rooms this is most commonly the
bridge bot user with admin power level available and in 1:1 conversations it is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More commonly the bridge does not have any kind of power in the room. When bridges are admins, they are often added through Scalar which makes this decision for them - the bridges themselves do not acquire power to operate. There's also plenty of bridges which are not represented in Scalar, which has lead to a majority of rooms not having appropriate permissions for all bridges.

There is no need for a new endpoint as the existing `/send` endpoint will be
utilized.

Additional information contained in the event are the name of the bridged
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can actually drop this now, when #2346 gets merged :)

@turt2live turt2live added the kind:feature MSC for not-core and not-maintenance stuff label Apr 20, 2020
@turt2live turt2live added the needs-implementation This MSC does not have a qualifying implementation for the SCT to review. The MSC cannot enter FCP. label Jun 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:feature MSC for not-core and not-maintenance stuff needs-implementation This MSC does not have a qualifying implementation for the SCT to review. The MSC cannot enter FCP. proposal A matrix spec change proposal
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants