Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MSC3401: Native Group VoIP Signalling #3401
base: main
Are you sure you want to change the base?
MSC3401: Native Group VoIP Signalling #3401
Changes from 3 commits
05fd5af
7f5ee49
083fd9a
5ee96fb
b90b85e
ed37a0d
33a64f2
7fd1ba6
669d471
48526ad
dfd4ffe
3c306cc
4d43aae
856ddc7
d109b54
07f9547
7a06ed7
32f566a
3fde32b
05b5db2
43dc42f
5635cee
b8ebe27
6b98d66
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bikeshedding warning: I'm relatively new to the WebRTC/VoIP industry, but I have never heard the term focus used in place of SFU. Is this a commonly known term? Should we be using SFU in this spec instead? Including renaming
m.foci
->m.sfus
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the reason i originally went with
foci
is because the field originally described the (mxid, deviceid) tuples where a given mxid could be contacted - which might either be a local device (for full mesh) or an SFU.However, in the current simpler draft, the only time you include this field is if you are using a conferencing focus of some kind.
But, this proposal is not meant to just be for SFUs - the device you use to focus together your view of the conference could (in future) equally be an MCU as much as an SFU. Hence using the correct more generic term of 'focus' rather than making it specific to SFU technology. For instance, the server could advertise a stream which composites together a mosaic of different feeds for a non-E2EE call... at which point it's acting as a (hybrid) MCU.
The term 'focus' comes from SIP (e.g. https://datatracker.ietf.org/doc/html/rfc3840#section-10.18) and is the standard term there for "an endpoint you connect to which mixes together other endpoints". I'm slightly inclined to keep it, to keep thing flexible for future more sophisticated foci tech.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we call it
call_focus
orstream_focus
or something a bit more descriptive than a not-well-known dictionary word?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
focus is a pretty well-known word, and foci is its plural. i don't particularly want to call it 'focuses', given that's a different word (the 3rd person present form of 'to focus'). not sure this is a showstopper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It definitely isn't a showstopper but I would like to come up with a better name if we can. It is also a bit of a red-flag that just about everything else in the MSC is calling it a SFU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While focus is a well-known word, outside of Britain its plural is 'focuses', so I would expect that a lot of people are going to be similarly confused over its meaning. Even the Cambridge Dictionary lists 'focuses' as the plural, while listing 'foci' as the formal plural in the UK.
Might it be possible to at least mention in the spec that it's used in this sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm coming around to using "foci" as the word and there are references out there in the wild for "foci" being used in SIP terminology
https://books.google.com/books?id=CyYAEAAAQBAJ&pg=PT66&lpg=PT66&dq=SIP+%22foci%22&source=bl&ots=zMu58i8hrj&sig=ACfU3U2VD7ts63JbE1HXkjWNoVRXi_3prA&hl=en&sa=X&ved=2ahUKEwjg9cT_j7r2AhUyGDQIHTtpCFoQ6AF6BAgCEAM#v=onepage&q=SIP%20%22foci%22&f=false
https://datatracker.ietf.org/doc/html/rfc4575#section-3.8
I think we should keep foci.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if these two are basically the same thing with different push rules? Is this influenced by push rules?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like there is a sort of intersection. I can see a use case where in the same room we may have "weekly sync" where we should buzz everyone and "debugging session" where people may drop in. Of course there are some rooms where I may not care about
m.ring
.Maybe it is better to rephrase this as "priority"? Intent is very vague. Intent for what?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i guess you could put this distinction into push rules, but it seems a bit simpler (especially given what a mess push rules are) to make it explicit here. After all, the difference between ringing and conferencing is not just the type of push notification you receive, but the whole UX (e.g. CallKit on iOS, or whether you display a dedicated ringing UX etc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While that is true, I feel like we shouldn't use something that are not push rules for influencing notifications
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ara4n do you see any resolution to this? I agree, it probably should be separate to the push rules. Implementations should use
m.ring
as the first clue as to whether or not to ring a device and push rules should apply on top of it. Them.ring
type also defines what UI to render in a client.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would make sense to use intentional mentions for this. If none are included, it's like a conference call. Otherwise, in conjunction with the fact that it's a call event, the client would know to start ringing and not just pinging. To ring everyone in a room, you'd simply mention
@room
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this is worth considering though, the UX nightmare might not be that bad (some clients might even work entirely with this possibility), and personally i think that putting the conf ID in a sub-field is just asking for problems (if the previous call information gets overridden by a person sending another state event for a "new" call while the last one is still in-progress.)
Why not move conf_id into the state_key, currently declare multiple calls UB and unsupported, while noting that speccing it and properly seating it would be a case for a future MSC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-opening this one because we've just had a glare-like bug on Element Call where multiple people entered the call at the same time (as you do) and multiple conferences got created in the same room. In general, we're going to want some way to handle glare of several people hitting the 'start conference call' button at the same time. Allowing multiple calls in a room means we need to handle this somehow. It's not impossible (eg. we could define some common ID for 'the' call in a room allowing you to use other IDs for other calls?) but I'd just like to check that we really want to deal with this complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also very much in favour of having the
state_key
be just""
because having multiple group calls in one room often leads to more problems rather than benefitsWith MSC3985 we now also have a separate method to create break-out rooms, so it feels like multiple calls in one room are no longer necessary
I also think we should be able to use the
m.termintated
to calculate the call lengthThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is still an issue with relying on
m.terminated
to determine the call length: If a client wants to display a timeline tile with the duration at the point where the call was ended, then it works, but if clients want to display the tile at the point that the call was started (like Element Web does), and we're reusing the same state key for all calls, it's difficult to get the duration from that event. In fact, if there's a call ongoing in the room, there's no way to tell whether a given call event is part of the current call or not, short of crawling the timeline, so clients won't know whether to label it with "call ended".With separate state keys, this is a lot easier, because it gives you a way to efficiently look up the current state of any call, current or historical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to worry about clients setting a conf state event and then falling off a cliff, leaving the user stuck as if they're in a conference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think https://github.com/matrix-org/matrix-doc/pull/3401/files#diff-a6d4763a98532010d751fbec7c898cb0fe27a757323befd0a1eb4ca443bbca4dR177 answers that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that line handles clients setting an m.call.member event and falling off a cliff. if a client sets an m.call event and falls off a cliff, however, anyone with permission to overwrite the state event in the room can go and remove the 'stuck' call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robertlong pointed out earlier that this doesn't help us if you want to know who's in a call before you join it (e.g. for showing a facepile on the roomlist or whatever). One solution could be for each client interested in checking to ping the participants via to-device keepalives, but this could get very busy (e.g. if someone starts a call in a room with 20,000 users, every online user will promptly send to-device msgs to the participants in a full mesh call to check whether they're there or not. this is also a privacy problem).
An alternative could be to not support this for full mesh calls, but instead if there's a SFU keeping track of the users participating in a call, the SFU could publish this to all interested users via to-device message. This could still be very busy though (and leaks to the SFU who's online).
Another alternative might be for the SFU to publish into the room timeline updates to the
m.call
event as users join and part (assuming the SFU has permission to write to the room). This same trick could be then be used in full-mesh by call participants to (roughly) track who else is in the call.Or we could have a dedicated per-room presence API to try to track who's online, and assume that if they're online and they claim to be participating in the call, then they're at least attempting to be present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The biggest issue I see would be spam and processing times, a lot of users joining and leaving the call at once, being propagated as dozens of state events in a room, could be a vector for abuse, by loading participant servers with unnecessary stateres and such.
I also have concerns wrt privacy, if users leave behind information after the call for when they've joined or left.
The best solution i'd see is to-device between SFUs, and to-device between SFUs and corresponding users, and have those share the load of "who's on this call right now". Have SFUs exchange information instantly, while when sharing information with users, having a small delay of aggregation (a second or such) before pushing changes in membership in the call. This may have a degraded UX experience if someone joins and starts talking in the call within that second, or within the delay of to-device, but I think that at least call membership shouldn't be stored permanently.
A SFU might even then, for example, opt to delay pushing membership information to its corresponding clients if the server is under heavy load (detected or inferred through one way or another).
All of this (using to-device) reduces the effect such an abuse vector might have (compared to state or normal events).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the idea here is that clients should identify the sender (user_id and device_id) of the to_device through the
sender_key
of the olm-encrypted message to known which peer the message is from in a full-mesh group call? Perhaps it could be valuable to spell that out.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the spec to include this now, we send the sender's
device_id
along in them.call.invite
event. But I think this is a better method? I haven't seensender_key
yet, could you point me to the docs on that? Also we're not using olm-encryption just yet so it might still be better to include thedevice_id
field in the content? Not sure.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can read about
sender_key
here and here. The idea is that you query the keys for the given user (if not fetched already) and verify that the keys match the sender_key of the olm message you received. The advantage of doing it this way is that the user id and device id can almost not be spoofed, assuming you have marked the other device/user as trusted, either manually or through cross-signing.Perhaps the current impl doesn't encrypt with olm yet, but does it make sense to spec that? Is there a good reason to offer a non-encrypted version of the signalling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Managing these streams via start/stop events seems a little prone to failure. Would it be easier to send the entire list of streams you wish to receive? This should be sufficiently small that the payload would never get too big and I think both the client/SFU logic would be simpler to manage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My reason for not doing this is that switching streams can happen very rapidly (e.g. the client could request different streams as they receive different active speaker notifications), and the idea of sending the whole list of all streams you care about every time just feels like a big waste of bandwidth. If you're in a big cascading conference with thousands of users (which this architecture could support!) do you really want to list out all the stream IDs when you want to switch from one speaker to the next?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I don't like about this proposal, is that it uses quite a few unencrypted state events. If you join a conference, you are leaking metadata about
Normal calls are not affected by that, because they don't use state event. State events of course make it easier to track, that a room is a conference room or similar, but they currently can't be encrypted and calls are imo somewhat more sensitive metadata. Verification gets around that by using relations instead of state events.
I currently can't think of a good alternative to state events and maybe one day we will get magic encrypted state events, that none figured out so far. But maybe someone has an idea or we could at least call out this issue in the encryption, potential issues or security sections?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair point. i'm assuming we will have magic encrypted state events sooner or later, however.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
. o O ( make
m.call
andm.call.member
state events with no body, but a state_key which contains an event_id for a timeline E2EE event. clients then callGET /event
on the event_id in the state_key of the state event in order to grab the encrypted contents of the event in question :P )There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or actually, keep the same state_keys as before, but just have the contents be
{ "encrypted": "$event_id" }
. (shamelessly stolen from @turt2live)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...which has now turned into #3414
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we mark this resolved? Should we add MSC3414 to the spec?