Skip to content
Open
Changes from 12 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
9cbe448
draft
toger5 May 10, 2024
b2b4e5e
lowercase filename
toger5 May 13, 2024
8c1340a
add note about this relying on MSC3757
toger5 May 14, 2024
8c800a6
remove the m.prefix from fields.
toger5 May 17, 2024
1dcbfce
update
toger5 Jun 5, 2024
813a21a
add foci_preferred well known section
toger5 Jun 6, 2024
d65e42e
update to reference to msc [MSC4158](https://github.com/matrix-org/ma…
toger5 Jun 20, 2024
4ab679a
add a "Reliability requirements for the room state" section
toger5 Jul 2, 2024
6d84256
use current state key format
toger5 Sep 11, 2024
d68e942
add `expires_after`
toger5 Sep 11, 2024
3b19b49
add section about send order (delayed leave event -> join event)
toger5 Sep 11, 2024
a278ff8
language fixes
ara4n Nov 25, 2024
f97e4e6
clarify that non-empty delayed membership events are invalid
ara4n Nov 25, 2024
b626d51
Latest
hughns Dec 17, 2024
ec9fa8b
json5 for legibility
hughns Dec 17, 2024
50db42a
be more specific on `member.id`
toger5 Jul 30, 2025
9bc444c
add id to session
toger5 Jul 31, 2025
33231dc
calrify that the state key cannot be used as the source of truth for …
toger5 Jul 31, 2025
cea4cdd
Update proposals/4143-matrix-rtc.md
toger5 Aug 8, 2025
d61969a
clarify what is part of this MSC and what can be found in other MSC's
toger5 Aug 26, 2025
4ad3961
major rewrite addressing a lot of feedback from offline discussions.
fkwp Oct 14, 2025
1fbd843
json5 code block formats
toger5 Oct 14, 2025
60e23d9
add unstable m.rtc.slot prefix.
toger5 Oct 14, 2025
680ef7d
fix grammar & link the MSC
ara4n Oct 14, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
278 changes: 278 additions & 0 deletions proposals/4143-matrix-rtc.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,278 @@
# MSC4143: MatrixRTC

This MSC defines the modules with which the MatrixRTC (Matrix Real Time Communication) signalling system is built.

The MatrixRTC specification is separated into different modules.

- The MatrixRTC room state that defines the state of the real time application.\
It is the source of truth for:
- Who is part of a session
- Who is connected via what technology/backend
- Metadata per device used by other participants to decide whether the streams
from this source are of interest / need to be subscribed.
- The RTC backend.
- It defines how to connect the participating peers.
- Livekit is the standard for this as of writing.
- Defines how to connect to a server/other peers, how to update the connection,
how to subscribe to different streams...
- Another planned backend is a full mesh implementation based on MSC3401.
- The RTCSession types (application) have their own per application spec.
- Calls can be done with an application of type `m.call` see (TODO: link call msc)
- The application defines all the details of the RTC experience:
- How to interpret the metadata of the member events.
- What streams to connect to.
- What data in which format to sent over the RTC channels.

This MSC will focus on the Matrix room state, which can be seen as the most high
level signalling of a call:

## Proposal

Each RTC session is made out of a collection of `m.rtc.member` state events.
Each `m.rtc.member` event defines the application type: `application`
and a `call_id`.
The first element of the state key is the `userId` and the second the `deviceId`.
(see [this proposal for state keys](https://github.com/matrix-org/matrix-spec-proposals/pull/3757#issuecomment-2099010555)
for context about second/first state key.)

### The MatrixRTC room state

Everything required for working MatrixRTC
(current session, sessions history, join/leave events, ...) only
require one event type.

A complete `m.rtc.member` state event looks like this:

```json5
// event type: "m.rtc.member"
// event key: "@user:matrix.domain_DEVICEID"
{
"application": "m.my_session_type",
"call_id": "",
"device_id": "DEVICEID",
"created_ts": Time | undefined,
"expires_after": Duration,
"focus_active": {...FOCUS_A},
"foci_preferred": [
{...FOCUS_1},
{...FOCUS_2}
]
}
```

> [!NOTE]
> This relies on [MSC3757](https://github.com/matrix-org/matrix-spec-proposals/pull/3757).
> We need to have one state event per device, hence multiple "non-overwritable" state
> events per user.

This gives us the information, that user: `@user:matrix.domain` with device `DEVICEID`
is part of an RTCSession of type `m.call` in the scope/sub-session `""` (empty
string as call id) connected over `FOCUS_A`. This is all information that is needed
for another room member to detect the running session and join it.

We include the device_id in the member content to not rely on the exact format of the state key.
In case [MSC3757](https://github.com/matrix-org/matrix-spec-proposals/pull/3757) is used it would not
be the second element of the state key array.

`created_ts` is an optional property that caches the time of creation. It is not required
for an event that, has not yet been updated, there the `origin_server_ts` is used.

> [!NOTE]
> We introduce `created_ts()` as the notation for `created_ts ?? origin_server_ts`

Once the event gets updated, the origin_server_ts needs to be copied into the `created_ts` field.
An existing `created_ts` field implies that this is a state event updating the current session
and a missing `created_ts` field implies that it is a join state event.
All membership events that belong to one member session can be grouped with the index
`created_ts()`+`device_id`. This is why the `m.rtc.member` events deliberately do NOT include a `membership_id`.

Other then the membership sessions, there is **no event** to represent a rtc session (containing all members).
Such an event would include shared information, and deciding who has authority over that is not trivial.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I continue to trip over whether it is wise to force clients to read all possible m.rtc.member events to figure out if a call is happening, and who created it.

I /think/ that a better reason for not having an m.rtc state event describing the existence of an RTC session is that you'd have to handle disconnection semantics on it similar to delayed-events for m.rtc.member... at which point, why not leverage the membership events?

However, it still feels REALLY weird to not have something in state telling you whether semantically a call is intended to be happening now (and what that sort of call is, when it began, and who initiated it) - versus having to infer it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, if the thought experiment is "what if two users both create an m.rtc state event on different forks of the DAG at the same time?" ... is that really so bad? and does aggregating m.rtc.member state actually make it better? if so, how?

In other words, we actually need to justify the lack of m.rtc state event much better here, imo. In particular, having somewhere to store the metadata about the call at the point of creation (its name, its ID, whether it's intended to be a voice/video room or a group call or a conference, etc)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a large list of reasons and I 100% support the idea to go into detail in the MSC to justify this approach.

As for this comment I will just give a couple of short examples/arguments:

The security and "trolling" surface is huge. If we have one state event we either limit who can create a call or we allow everyone to mess with the event. This can go from smaller issues like ending the call for fun to larger issues like changing where the call is happening without the members noticing it. (in a LK world at least)

But even if everyone plays fair, a call can be stopped by a delayed event because the creator failed to send the refresh event. This now disconnects everyone from the call.
Independent how we make things behave, if we have a public/shared event controlling the experience for everyone we switch from:

  • If there is a client with issues (or user that actively introduces issues) that user has a degraded experience
  • If there is a client with issues (or user that actively introduces issues) everyone experience can be broken.

In the context of matrix where there is no central entity controlling the clients the seconds seems to be the only valid option.

Instead the session is a computed value based on `m.rtc.member` events.
The list of events with the same `application` and `m.call_id` represent one session.
This array allows to compute fields such as participant count, start time, etc.

Sending an empty `m.rtc.member` event represents a leave action.
Sending a well formatted `m.rtc.member` represents a join action.

Based on the value of `application`, the event might include additional parameters
required to provide additional session parameters.

> A [thirdroom](https://thirdroom.io)-like experience could include the information of an approximate position
> on the map, so that clients can omit connecting to participants that are not in their
> area of interest.

#### Reliability requirements for the room state

Room state is a very well suited place to store the data for a MatrixRTC session, as
it allows:

- The client to determine current ongoing sessions without loading history for every room,
or doing additional work other than the sync loop that needs to run anyway.
- The client can compute/access data of past sessions without any additional redundant data.
- Sessions (start/end/participant count) are federated and there is not redundant data storage that
could result in conflicts, or can get out of sync. The room state events are part of the dag and this
is solved like any other PDU in matrix.

A challenge with using the room state to represent a session is disconnection behaviour.
If the client disconnects from a call because of a network issue,
an application crash, or a user forcefully quitting the client - then the room state cannot be updated any more.
The client is required to leave by sending a new empty state which cannot happen once connection is lost.

If the state is not updated correctly we end up with incorrect session end timestamps, and a room state that is not
correctly representing the current RTC session state. Historic and current MatrixRTC session data would be broken.

For an acceptable solution, the following requirements need to be taken into consideration:

- Room state is set to empty if the client loses connection. (A heardbeat like system is desired)
- The best source of truth for a call participation is a working connection to the SFU.
It is desired that the disconnect of the SFU is connected to the room state.
- It should be possible to update the room state without the client being online.
- All of this should still work when Matrix uses cryptographic identities (e.g.
[MSC4080](https://github.com/matrix-org/matrix-spec-proposals/pull/4080)).

[MSC4140](https://github.com/matrix-org/matrix-spec-proposals/pull/4140) proposes a concept to
delay the leave events until one of the leave conditions (heartbeat or SFU disconnect) occur
and fulfil all of the these requirements.

A MatrixRTC client has to first send/schedule the following delayed leave event:

```json5
// event type: "m.rtc.member"
// event key: "@user:matrix.domain_DEVICEID"
{
"leave_reason": "CONNECTION_LOST"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be namespaced. what other leave reasons are there?

}
```

Subsequently, the actual state event can be sent, so that we guarantee that the state will be empty eventually.
The `leave_reason` is added so clients can be more verbal about why a user disconnected from a call.

Receiving clients will be able to detect if the delayed event request was recognised by the presence of the `has_delayed_overwrite: true`
unsigned property. If the property is missing the event is invalid.

This also invalidates delayed leave events that are send with a valid membership content. They do not contain the
`has_delayed_overwrite: true` unsigned property.

#### Historic sessions

Since there is no single entry for a historic session (because of the ownership ambiguity),
historic sessions need to be computed on the client.

Each state event can either mark a join or leave:

- join: `prev_state.application != current_state.application` &&
`prev_state.m.call_id != current_state.m.call_id` &&
`current_state.application != undefined`
(where an empty `m.rtc.member` event would imply `state.application == undefined`)
- leave: `prev_state.application != current_state.application` &&
`prev_state.m.call_id != current_state.m.call_id` &&
`current_state.application == undefined`

Based on this one can find user sessions. The range between a join and a leave
event gives the specific times and duration of the session.
The collection of all overlapping user sessions with the same `call_id` and
`application` define one MatrixRTC history event.

### The RTC backend

`foci_active` and `foci_preferred` are used to communicate:

- how a user is connected to the session (`foci_active`)
- what connection method this user knows about would like to connect with.

The only enforced parameter of a `foci_preferred` or `foci_active` is `type`.
Based on the focus type a different amount of parameters might be needed to,
communicate how to connect to other users.
`foci_preferred` and `foci_active` can have different parameters so that it is,
possible to use a combination of the two to figure our that everyone is connected
with each other.

Only users with the same type can connect in one session. If a frontend does
not support the used type they cannot connect.

Each focus type will get its own MSC, describing how to get from the foci
information to establishing WebRTC connections for all participants.

- [`livekit`](www.example.com) TODO: create `livekit` focus MSC and add link here.
- [`full_mesh`](https://github.com/matrix-org/matrix-spec-proposals/pull/3401)
TODO: create `full-mesh` focus MSC based on[MSC3401](https://github.com/matrix-org/matrix-spec-proposals/pull/3401)
and add link here.

#### Sourcing `foci_preferred`

At some point participants have to decide/propose which focus they use.
Based on the focus type and use case choosing a `foci_preferred` can be different.
If possible these guidelines should be obeyed:

- If there is a relation between the `focus_active` and a preferred focus (`type: livekit` is an example for this)
it is recommended to copy _the preferred focus that relates to the current `focus_active`_ of other participants to
the start of the `foci_preferred` array of the member event.
(The exact definition of: _the preferred focus that relates to the current `focus_active`_ is part of the
specification for each focus type. For `full_mesh` for example there is no such thing as: _the preferred focus that
relates to the current `focus_active`_ )
- Homeservers can proposes `preferred_foci` via the well known. An array of preferred foci is provided behind the well
known key `m.rtc_foci`. This is defined in [MSC4158](https://github.com/matrix-org/matrix-spec-proposals/pull/4158).
They are related and it is recommended to also read
[MSC4158](https://github.com/matrix-org/matrix-spec-proposals/pull/4158) with this MSC.
Those proposals from **your own** homeserver should come next in the `foci_preferred` list of the member event.
- Clients also have the option to configure a preferred foci even though this is not recommended (see below).
Those come last in the list.

The rationale for these guidelines are:

- It is always desired to have as few focus switches as possible.
That is why the highest priority is to prefer the focus that is already in use.
- MatrixRTC is designed around the same architecture as the rest of Matrix, with
conversations being powered by many homeservers from across the network.
MatrixRTC has the same goal. To achieve a stable and healthy ecosystem
RTC infrastructure should be thought of as a part of a homeserver. It is very similar
to a turn server: mostly traffic and little cpu load.
To not end up in a world where each user is only using one central SFU but where the traffic
is split over multiple SFU's it is important that we leverage the SFU distribution similarly to the
distribution of homeservers.
For this reason the second guideline is to lookup the preferred foci from the homeserver's well_known.
- Looking up the preferred foci from a client is toxic to a federated system. If the majority of users
decide to use the same client all of the users will use one focus. This destroys the passive security mechanism that
each instance is not an interesting attack vector since it is only a fraction of the network.
Additionally it will result in poor performance if every user on Matrix would use the same focus.
There are cases where this is acceptable:
- Transitioning to MatrixRTC. Here it might be beneficial to have a client that has a fallback focus
so calls also work with homeservers not supporting it.
- For testing purposes where a different focus should be tested but one does not want to touch the .well_known
- For custom deployments that benefit from having the Focus configuration on a per client basis instead of per homeserver.

### The RTC Session types (application)

Each session type can have its own specification in how the different streams
are interpreted and even what focus type to use. This makes this proposal extremely
flexible. For instance, a Jitsi conference could be added by introducing a new `application`
and a new focus type and would be MatrixRTC compatible. It would not be compatible
with applications that do not use the Jitsi focus but clients would know that there
is an ongoing session of unknown type and unknown focus and could display/represent
this in the user interface.

To make it easy for clients to support different RTC session types, the recommended
approach is to provide a Matrix widget for each session type, so that client developers
can use the widget as the first implementation if they want to support this RTC
session type.

Each application should get its own MSC in which the all the additional
fields are explained and how the communication with the possible foci is
defined:

- [`m.call`](www.example.com) TODO: create `m.call` MSC and add link here.
Copy link
Member

@ara4n ara4n Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something which is very unclear right now is the lifecycle of a call (and the draft of MSC4196 doesn't really help).

Specifically, what has happened to m.call.invite? m.call.answer? m.call.select_answer? m.call.hangup? The intention of MSC3401 was to keep using these (as per https://github.com/matrix-org/matrix-spec-proposals/blob/matthew/group-voip/proposals/3401-group-voip.md#basic-call) even when an SFU is involved, in order to preserve signalling for:

  • Call was placed but isn't ringing yet (not that we have in legacy VoIP)
  • Call is ringing
  • Caller gave up and stopped ringing
  • Call has been answered on a device
  • Call was rejected locally
  • Call was rejected on all devices
  • Call was answered
  • Call was hung up

i.e. all the state machine transitions that go into the lifecycle of a call.

I'm very worried that this seems to have been collapsed to a single m.call.notify event from MSC4075 + m.rtc.member events here, which seems to be missing most of the above. For instance, I can't see how the caller tells the callee that it's no longer ringing them - not to mention stuff like early media (MSC3635) or ringback tones.

Legacy VoIP was already a dumbed down version of SIP, and I'm worried that the lack of signalling semantics here is going to make mimicing SIP or PSTN semantics really hard (let alone bridging to it), and would be a backwards step from legacy voip.

Copy link
Author

@toger5 toger5 Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason that information about this is so thin here is that this proposal tries to define the line between matrixRTC and the special case matrix rtc use case of calls.

The focus should be to find the minimum requirements for a rtc session so clients can implement everything from this msc and they will be able to have a rough idea what is happening in a room.

Everything that is not applicable to all kinds of sessions should be part of their dedicated msc's.
Calling in particular has a very rich setup/notification cycle that is not necessarily part of every rtc session.

Th idea very much is however to fully support this list:

  • Call was placed but isn't ringing yet (not that we have in legacy VoIP) !! This one is TBD
  • Call is ringing
  • Caller gave up and stopped ringing
  • Call has been answered on a device
  • Call was rejected locally
  • Call was rejected on all devices
  • Call was answered
  • Call was hung up

This proposals tries to not enforce this for matrixRTC in general however.

Your comment makes me wonder if there is any benefit in reusing the already existing legacy calling events for m.call.invite m.call.answer m.call.select_answer m.call.hangup ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this proposal tries to define the line between matrixRTC and the special case matrix rtc use case of calls.

are we sure this is not an unnecessary abstraction right now? we could add it later, if/when we have a use case for MatrixRTC which doesn't involve calling? and meanwhile keep the MSCs easier to follow and less fragmented?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your comment makes me wonder if there is any benefit in reusing the already existing legacy calling events for m.call.invite m.call.answer m.call.select_answer m.call.hangup ...

There may not be any advantage to reusing those events, but it's very hard to tell when I can't figure out how those semantics map onto a MatrixRTC call life cycle today. My spidey sense is that having to infer the call state machine out of m.rtc.member and m.rtc.notify events is going to be fragile, hard to reason about, and hard to extend (e.g. how would you do early media, for compatibility with MSC3635?) - rather than throwing explicit events around when things happen. But am very happy to be corrected... especially if the rationale ends up in an MSC :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we sure this is not an unnecessary abstraction right now? we could add it later, if/when we have a use case for MatrixRTC which doesn't involve calling? and meanwhile keep the MSCs easier to follow and less fragmented?

Since the ring procedure is entirely detangled from the rtc signalling and the session participation I am not sure it really would make this MSC easier to understand. We would need to state very explicitly, that the lifecycle is entirely detangled and implementations cannot rely on it in any way to actually create the session.


## Potential issues
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact we don't reference how to tell users that a call is happening (i.e. m.call.notify) is very disorienting here.


## Alternatives

## Security considerations

## Unstable prefix

The state events and the well_known key introduced in this MSC use the unstable prefix
`org.matrix.msc4143.` instead of `m.` as used in the text.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empirically we seem to be using org.matrix.msc3401.call.member rather than org.matrix.msc4143.rtc.member in Element?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation is still using the msc3401 prefix. That is wrong and will be addressed. There is still the open topic of how exactly we want to do the state keys and the event ownership and on top of that we have plans for how to index rtc member events in a better way.
The reason we changed it to (...call... ->) ...rtc... is, that we need the call namespace for the particular video call matrixRTC application (session type) of calling over MatrixRTC. Using that word for both. Matrix rtc sessions in general and calls will be confusing in the long run.


Possible values inside the `m.rtc.member` event (like `m.call`) will use a prefix defined in the
related PR (TODO create and link `m.call` application type PR)