-
Notifications
You must be signed in to change notification settings - Fork 416
MSC3635: Early Media for VoIP #3635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
dbkr
wants to merge
6
commits into
main
Choose a base branch
from
dbkr/voip-early-media
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 5 commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,106 @@ | ||
| # MSC3635: Early Media for VoIP | ||
|
|
||
| In PSTN and SIP calls, media can be sent between callee and caller before the callee has accepted | ||
| the call. This allows for things like ringback tones and announcements. | ||
|
|
||
| ## Context | ||
| Early Media is already a well-established concept in SIP. Traditionally, it relies on the | ||
| decoupling of the offer and answer from the INVITE and OK messages, instead allowing the | ||
| answer to be sent in other responses to the INVITE | ||
| (https://datatracker.ietf.org/doc/html/rfc3261#page-80). This simply allows the same media | ||
| channel to be established earlier in the lifetime of the call. | ||
|
|
||
| This method of exchanging ealry media is known as the Gateway Model. However, | ||
| [RFC3960](https://datatracker.ietf.org/doc/html/rfc3960) details how this is, "seriously | ||
| limited in the presence of forking", leading to media clipping. Since Matrix is fundamentally | ||
| multi-device and multi-user, these issues may be even more prevalent. | ||
|
|
||
| Furthermore, [RFC3959](https://datatracker.ietf.org/doc/html/rfc3959) explains that application | ||
| servers may not be able to produce an answer for the UAS due to end-to-end encryption. Since all | ||
| WebRTC calls use DTLS, we would expect this problem to occur in Matrix calls. | ||
|
|
||
| Moreover, the gateway model assumes that if, having started to receive early media from one | ||
| endpoint, another endpoint then answers, the UAC can simply switch streams and play the media | ||
| stream from the endpoint that answered. In WebRTC, this would mean supplying a different answer | ||
| from the one originally supplied and switching to a new offer from a different peer with a | ||
| different DTLS fingerprint. This may be viable using the 'pranswer' Session Description Type, | ||
| although may be considered somewhat of an edge case. | ||
|
|
||
| [RFC3960](https://datatracker.ietf.org/doc/html/rfc3960) proposes the Application Server model | ||
| for SIP early media to address these problems, and strongly recommends it for most situations. | ||
| This essentially establishes separate media sessions for each early media session and the main | ||
| media session by using multipart bodies for SIP message to send multiple session descriptions | ||
| per SIP message. This allows the media sessions to be distinct, solving the above problems. | ||
| However, it adds significant complexity and the gateway model is still widely used in practice. | ||
|
|
||
| ## Proposal | ||
|
|
||
| This MSC proposes to allow early media in a manner similar to the gateway model above. We do this | ||
| by allowing an `m.call.negotiate` event to be sent by the callee before `m.call.answer`. The `type` | ||
| field MUST be set to `pranswer`. The caller should ignore `m.call.negotiate` events of any other | ||
| type before the `m.call.answer`. Clients using WebRTC compatible APIs should imply be able to | ||
| pass this SDP object into `setRemoteDescription` as-is. In fact, if clients do not explicitly | ||
| discard `m.call.negotiate` before an `m.call.answer`, they may already inadvertently support this | ||
| MSC. | ||
|
|
||
| The the same call is later answered with an `m.call.answer` event, the caller's client passes the | ||
| answer SDP to the WebRTC API just as before: it may do so since the previous SDP was of type | ||
| `pranswer` (https://datatracker.ietf.org/doc/html/rfc8829#section-5.6). | ||
|
|
||
| If the call is not successfully set up, the caller destroys the early media stream. The process of | ||
| tearing down the PeerConnection will do this anyway. | ||
|
|
||
| If a different device answers, the caller's client still passes the answer SDP to the WebRTC API as | ||
| before: this will cause the connection to the device that sent the pranswer to be aborted and | ||
| the connection restarted with the new device. | ||
|
|
||
| If the caller's client receives `pranswer` negotiate events from multiple callee devices, it selects | ||
| one arbitrarily (ie. most likely the first) and ignores the others. | ||
|
|
||
| Callee clients cannot assume that caller clients support this MSC and therefore must not assume | ||
| that the `pranswer` SDP has been processed (however if they see the ICE connection state change to | ||
| `connected`, they will know that it has). | ||
|
|
||
| It is suggested that the `pranswer` SDP be essentially the same as the `answer` SDP, therefore | ||
| for a normal, bidirectional media call, the `pranswer` would negotiate `sendrecv` media. This | ||
| means the media stream is started and ready to go as soon as the callee answers. It is, of course, | ||
| vital that the callee's client does not play the incoming audio or send any media not explicitly | ||
| intended to be early media (eg. keeps the user's micprophone muted) until the user has accepted the | ||
| call. Likewise it is generally advised for the caller's client to keep the user's outbound media | ||
| muted until the call is answered since users are likely to assume they cannot be heard, although | ||
| sometimes early media is used to gather information from callers (eg. PINs for calling cards): | ||
| this would generally be DTMF, but this may require exceptions to this rule. | ||
|
|
||
| It is strongly advised to use this only in setups where the callee is a single device and the only | ||
| user receiving the call, eg. when the callee is a PSTN gateway or similar. It is not intended for | ||
| use on regular clients due to the number of different devices that could potentially send `pranswer`s. | ||
|
|
||
| ## Alternatives | ||
|
|
||
| This MSC opts for the simpler 'gateway model' despite the fact that some of some of its limitations | ||
| may be more of an issue in the Matrix protocol. The reasons for this are: | ||
|
|
||
| * For interfacing with SIP, we would likely need to support this anyway since this is still quite | ||
| commonly used. | ||
| * It allows for a great deal of functionality with very little overhead, even if it may not be perfect. | ||
| In many scenarios (eg. bridging) there is only one callee device and so one class of problems will | ||
| never manifest. | ||
| * This does not rule out an approach more like the Application Server method in the future, if necessary. | ||
| * It is a very natural fit for the existing WebRTC `pranswer` semantics. | ||
|
|
||
| An alternative would be a proposal negotiating separate media sessions for each early media session and | ||
| the 'real' media session by the callee making a separate offer to the caller using different events types. | ||
|
|
||
| ## Security considerations | ||
|
|
||
| Any client sending a `pranswer` should obviously bear in mind that this will reveal the device is online. | ||
| For this reason (and others, above) it is not advised for end-user clients to send `pranswer`s. | ||
|
|
||
| There are also obvious privacy concerns about establishing media sessions before a call is answered | ||
| if not done so carefully. Advice for handling this is given in the proposal section. | ||
|
|
||
| In the best case, this only allows a callee to send media to a callee without the caller's client UI | ||
| saying that the call is answered. This could still be somewhat surprising to an unsuspecting caller. | ||
|
|
||
| ## Dependencies | ||
| Depends on [MSC2746](https://github.com/matrix-org/matrix-doc/pull/2746). | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.