Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MSC3401: Native Group VoIP Signalling #3401
base: main
Are you sure you want to change the base?
MSC3401: Native Group VoIP Signalling #3401
Changes from all commits
05fd5af
7f5ee49
083fd9a
5ee96fb
b90b85e
ed37a0d
33a64f2
7fd1ba6
669d471
48526ad
dfd4ffe
3c306cc
4d43aae
856ddc7
d109b54
07f9547
7a06ed7
32f566a
3fde32b
05b5db2
43dc42f
5635cee
b8ebe27
6b98d66
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bikeshedding warning: I'm relatively new to the WebRTC/VoIP industry, but I have never heard the term focus used in place of SFU. Is this a commonly known term? Should we be using SFU in this spec instead? Including renaming
m.foci
->m.sfus
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the reason i originally went with
foci
is because the field originally described the (mxid, deviceid) tuples where a given mxid could be contacted - which might either be a local device (for full mesh) or an SFU.However, in the current simpler draft, the only time you include this field is if you are using a conferencing focus of some kind.
But, this proposal is not meant to just be for SFUs - the device you use to focus together your view of the conference could (in future) equally be an MCU as much as an SFU. Hence using the correct more generic term of 'focus' rather than making it specific to SFU technology. For instance, the server could advertise a stream which composites together a mosaic of different feeds for a non-E2EE call... at which point it's acting as a (hybrid) MCU.
The term 'focus' comes from SIP (e.g. https://datatracker.ietf.org/doc/html/rfc3840#section-10.18) and is the standard term there for "an endpoint you connect to which mixes together other endpoints". I'm slightly inclined to keep it, to keep thing flexible for future more sophisticated foci tech.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we call it
call_focus
orstream_focus
or something a bit more descriptive than a not-well-known dictionary word?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
focus is a pretty well-known word, and foci is its plural. i don't particularly want to call it 'focuses', given that's a different word (the 3rd person present form of 'to focus'). not sure this is a showstopper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It definitely isn't a showstopper but I would like to come up with a better name if we can. It is also a bit of a red-flag that just about everything else in the MSC is calling it a SFU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While focus is a well-known word, outside of Britain its plural is 'focuses', so I would expect that a lot of people are going to be similarly confused over its meaning. Even the Cambridge Dictionary lists 'focuses' as the plural, while listing 'foci' as the formal plural in the UK.
Might it be possible to at least mention in the spec that it's used in this sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm coming around to using "foci" as the word and there are references out there in the wild for "foci" being used in SIP terminology
https://books.google.com/books?id=CyYAEAAAQBAJ&pg=PT66&lpg=PT66&dq=SIP+%22foci%22&source=bl&ots=zMu58i8hrj&sig=ACfU3U2VD7ts63JbE1HXkjWNoVRXi_3prA&hl=en&sa=X&ved=2ahUKEwjg9cT_j7r2AhUyGDQIHTtpCFoQ6AF6BAgCEAM#v=onepage&q=SIP%20%22foci%22&f=false
https://datatracker.ietf.org/doc/html/rfc4575#section-3.8
I think we should keep foci.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How should glare be handled at the group call level in the case where multiple parties actually didn't meant to set up separate group calls in a room but just meant to call each other? For example, we could dictate that calls that have the same purpose and name should be able to replace each other in case of glare?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very good question. Any idea @ara4n?
I think because the
m.call.invite
event includes theconf_id
this is a non issue? But we've also only defined them.call.invite
for group calls under to-device messages. I guess for them.ring
intent you also need to be able to send them.call.invite
with aconf_id
set as a regular message event?In any case, I think glare is a non issue for the
m.room
orm.prompt
intent types. You both created group calls and one of you needs to join the other in the UI. However, form.ring
that involves sending invite and if you both invite each other at the same time I think we should use the same glare resolution we have for regular calls in that we compareconf_id
values.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glare can happen with any call type though if two clients decide to set
m.call
at the same time, though. I suspect we should a) add an index somewhere to futureproof for more than one call per room, b) for two calls with the same index, tiebreak between them by prioritising the m.call event with the lexicographically lowest call ID.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
m.room.power_levels
state event specifies that posting state events requires a power level of 50 by default. From a user experience standpoint, I would think it is reasonable for normal users in a room to be able to start calls in that room by default, but with the current power_levels policy it would need them.call
power level set lower. It may be desirable for room creation UX in clients to present the option to set this level upfront.Perhaps there should be a way to specify a different power level requirement for different intents as well. A Discord user would expect to be able to start a room's call freely without disturbing other members of the room ala
m.room
intent. On the other hand, anm.ring
is a much more disruptive intent that would be reserved for smaller group chats and should not normally be allowed in other kinds of rooms.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imho outside of DMs (where both users have PL100 anyway usually) calls should not be allowed for normal users. It is still a vector of spam. Just imagine having calls being started in Matrix HQ. It would just cause issues imho.
Imho it is a sane default to restrict this and need active changes to allow it in a room.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I mean about the different call intents causing different levels of disruption. You're right, obviously
m.ring
has very different impact fromm.prompt
orm.room
and the default should be to disallow that. But a room's administration may want users to be able to start calls with one intent and not the other.Unless I'm misunderstanding the purpose of
m.room
? Is the idea form.room
intent that a room would always have a call "active", even if it has no participants, ala a "voice channel" in Discord, such that a level-0 user would typically not be able to end that call ergo not need to be able to publish state events for it other thanm.call.member
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this is worth considering though, the UX nightmare might not be that bad (some clients might even work entirely with this possibility), and personally i think that putting the conf ID in a sub-field is just asking for problems (if the previous call information gets overridden by a person sending another state event for a "new" call while the last one is still in-progress.)
Why not move conf_id into the state_key, currently declare multiple calls UB and unsupported, while noting that speccing it and properly seating it would be a case for a future MSC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-opening this one because we've just had a glare-like bug on Element Call where multiple people entered the call at the same time (as you do) and multiple conferences got created in the same room. In general, we're going to want some way to handle glare of several people hitting the 'start conference call' button at the same time. Allowing multiple calls in a room means we need to handle this somehow. It's not impossible (eg. we could define some common ID for 'the' call in a room allowing you to use other IDs for other calls?) but I'd just like to check that we really want to deal with this complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also very much in favour of having the
state_key
be just""
because having multiple group calls in one room often leads to more problems rather than benefitsWith MSC3985 we now also have a separate method to create break-out rooms, so it feels like multiple calls in one room are no longer necessary
I also think we should be able to use the
m.termintated
to calculate the call lengthThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is still an issue with relying on
m.terminated
to determine the call length: If a client wants to display a timeline tile with the duration at the point where the call was ended, then it works, but if clients want to display the tile at the point that the call was started (like Element Web does), and we're reusing the same state key for all calls, it's difficult to get the duration from that event. In fact, if there's a call ongoing in the room, there's no way to tell whether a given call event is part of the current call or not, short of crawling the timeline, so clients won't know whether to label it with "call ended".With separate state keys, this is a lot easier, because it gives you a way to efficiently look up the current state of any call, current or historical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the user does not publish an(nevermind, saw the consideration at the bottom) Additionally, does a user convey they are leaving a call by publishing withm.call.member
event to leave a call?m.calls
as an empty array?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: actually track here whether the participant is joined to the call or not(!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we still have an issue with tracking participants for a given group call for displaying in the UI. How are we going to check who is in a call and scale it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These IDs are only accurate if announcing to an SFU - in full mesh, each separate call will have its own IDs.
Therefore we should probably scope these to a given 1:1 call ID?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, perhaps this is a bit of a privacy violation - why do other people in a conference need to know what my devices are called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, right. Should we remove it? (I also don't seem to find a case where we would want to let others know what are devices are called when publishing 🤔)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I can't really see a good reason we'd send this. The
purpose
ought to be enough information.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these
settings
required for some matrix-specific logic? (maybe bridges that may need this data about a stream or a track?)I'm just wondering if they are useful in a general case (like why would we need the information about a
facingMode
of a camera for intsance?). I.e. when we're talking about the SFU, the SFU does not really need to know the facing mode of a user I guess and I'm also not sure if the other call participants would benefit from this information.The only case where I assume the information about camera mode etc might be useful is when there is a specific app that runs over Matrix and needs to advertise the properties of the video/audio streams in order to implement a specific logic. But in this case, we're talking about application-specific data, i.e. something that must be the logic of the app rather than part of a [generic] Matrix protocol.
I think generally we only need the stream and track IDs, a purpose (for the use case of conference / using WebRTC for calls), and, perhaps basic information about certain tracks like the width and height of the video (theoretically it's not required, because we'll be able to access it when the track is received, but practically we would need it for the simulcast implementation on the SFU side, so such information would be useful for the conference use cases).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we specify here whether this counter is scoped to the
dest_session_id
or thecall_id
? E.g. should it be reset or not when a webrtc connection between two peers is being retried with a new peer connection? Or only be reset after refreshing the client? Hydrogen currently does not reset when retrying a peer connection, nor does Element Call I think, but I think resetting in this case, e.g. scoping it to the call_id, might actually make more sense as signalling between call_id's should be independent (you always only have one call_id per member in a group call) and there is no point in forcing clients to keep the counter in memory longer than needed, e.g. if they are not joined to the call.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the
m.call.select_answer
event makes sense for MSC 3401 calls.AIUI, the purpose of
m.call.select_answer
is to make your other devices stop ringing once you pick up on one of your devices. To_device messages only seem to be sent to devices already in the call (e.g. are present in them.call.member
state event for a given user), so they would not be sent to devices that haven't picked up yet.Perhaps to stop ringing, the other devices should just monitor the
m.call.member
event for their own user and see if it exists and a device has been added already.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plus that it happened in the past and who there-and-then participated in it, by correlating it corresponding
m.call.member
state events.