MSC2846: Decentralizing media through CIDs #2846

jcgruenhage · 2020-11-02T23:12:58Z

Rendered

jcgruenhage · 2020-11-02T23:26:28Z

Related to:

[WIP] MSC2706: IPFS as a media repository for Matrix #2706: could work well with a slighly changed version of that MSC
MSC2703: Media ID grammar #2703: slight conflict, as we're adding meaning to the media ID and that MSC makes it explicitly opaque
MSC2834: Media IDs as hashes #2834: competing MSC
MSCNaN: Another MSC @anoadragon453 and I are working on, for authenticating media.

proposals/2846-decentralising-media-through-cids.md

Sorunome · 2020-11-03T07:54:05Z

proposals/2846-decentralising-media-through-cids.md

+1. **MSC2834** proposes to replace MXC URLs with custom hash identifier + hash
+   strings. This is very similar to what we're doing here, with the difference
+   of not reusing pre-existing methods like multihash and CIDs. Also, by
+   removing the server name from the MXC URL, it breaks backwards compatibility


This is on purpose - the goal of matrix going forward seems to be to remove the server name from all identifiers over time. It has already happened for events.

#2787 removes the identifier off of the user ID

There were talks about removing the server from the room IDs too.

In short: Any proposal suggesting such global dedupliation and not removing the server from the identifier seems to be missing the point :/

The identifier of the media is very much only the CID. CIDs only match when the content matches, so the server should use the CID as "the one" ID for that media. Keeping the server in the MXC URL makes sense IMO, because it's an indicator of where to get the content from. MXC URLs are not quite the same as user IDs etc, because they are resource locators, not identifiers.

That's just it, even the MXC URIs in this proposals are not resource locators anymore. The server part is just "The media could be there", and not "The media will be there". If the media is not on said server, then a different one is used. That is semantically not a resource locator.

Note that #2834 also drops the // off of the URI, making it an identifier rather than a resource locator. Semantically this makes way more sense, as there is no guarantee anymore of any server having said media.

It only makes sense to not special case one server over another, as they are, well, not necessarily better or thelike.

Soru also thinks that a small breaking change now will make it way easier for clients and servers to pick up than 9001 breaking changes in a year or so when we finally remove all server identifiers. This breaking change is only minimal, on the client side. Most clients already have a function to turn an MXC URI into an http URI, it is only that function needing updating. As for servers, they can take their time updating. If other servers switch, it does not mean they will become incompatible.

Additionally, if we keep the server then we have two ways to indicate where to fetch media from - after mxc:// and via servers/via. It seems more elegant to only have one way of doing so and thus not special-case any server.

The server part is just "The media could be there", and not "The media will be there".

The media will be there, unless the server has disappeared. All this fallback stuff is for servers disappearing, not for servers deleting media they sent.

It only makes sense to not special case one server over another, as they are, well, not necessarily better or thelike.

If we remove the special casing of the server here, I'd still highly suggest that we special case the fetching in another way, which is asking the event sending server first, because that is the server which is the most likely to have access to that file. Right after sending, it is objectively better to just ask the origin server, because no other server will have it yet.

As for servers, they can take their time updating. If other servers switch, it does not mean they will become incompatible.

It means that we can't start sending them immediately though, and since the upload endpoint isn't tied to a room, we can't tape this to a room version or something like that either.

Additionally, if we keep the server then we have two ways to indicate where to fetch media from - after mxc:// and via servers/via

Those do have different meanings though, one is the origin that sent the file in the first place, which should be tried before the fallbacks, and the others are just hints on where the clients think the content might also be, in case the origin server is unavailable.

It means that we can't start sending them immediately though, and since the upload endpoint isn't tied to a room, we can't tape this to a room version or something like that either.

Yes, we can. The places where a server passes on an mxc uri are inside of events and profiles. In both of these cases the mxcs are/should be handled opague. The only breaking change is how the client resolves the mxc uri to an http uri and adding a new endpoint for that. So, if a server is slow in updating, they can still pass on new mxc URIs, the client using that server just won't be able to display them.

The media will be there, unless the server has disappeared. All this fallback stuff is for servers disappearing, not for servers deleting media they sent.

Actually, having the server delete the media makes more and more sense with a proposal like this. It will happen if we have "retry via other server", if we plan for it or not. So we might as well plan for it

If we remove the special casing of the server here, I'd still highly suggest that we special case the fetching in another way, which is asking the event sending server first, because that is the server which is the most likely to have access to that file. Right after sending, it is objectively better to just ask the origin server, because no other server will have it yet.

Just try the first specified servers/via first? Query parameters are an ordered list, after all

Those do have different meanings though, one is the origin that sent the file in the first place, which should be tried before the fallbacks, and the others are just hints on where the clients think the content might also be, in case the origin server is unavailable.

The point soru is trying to make probably is that the media would just "exist in matrix" like rooms do now, not bound to one server.

Yes, we can.

How so? The MXC URI is created before we know where to put it, so unless we also change the upload end point to include a room we can't afaict.

The places where a server passes on an mxc uri are inside of events and profiles. In both of these cases the mxcs are/should be handled opague.

Yes, servers don't need to care about the URI's they're passing on, but they need to care about the URIs they give out, which was my point.

The only breaking change is how the client resolves the mxc uri to an http uri and adding a new endpoint for that.

That is not the only breaking change, no, servers also have the breaking change of fetching media.

Actually, having the server delete the media makes more and more sense with a proposal like this. It will happen if we have "retry via other server", if we plan for it or not. So we might as well plan for it

Well, we already are, servers deleting media shouldn't be much different from servers disappearing here. I find it rather unlikely that servers will delete content they sent unless they know others are keeping it, and we don't have any signalling for that yet

Just try the first specified servers/via first? Query parameters are an ordered list, after all

Sure, can do that (and the proposal already says to do this)

The point soru is trying to make probably is that the media would just "exist in matrix" like rooms do now, not bound to one server.

Yes, that is what'd happen here. ~~The server name in the MXC URL is just a hint for where to get the content from, not an authoritative source.~~ Oh, wait, the URL/URI spec says that :// should only be used for authoritative sources, so I guess we'd be violating the URL/URI spec if we did that.

How so? The MXC URI is created before we know where to put it, so unless we also change the upload end point to include a room we can't afaict.

The server can return mxc:m.sha256:hash and thus, for the client, when putting it places it is an opague string.

Yes, servers don't need to care about the URI's they're passing on, but they need to care about the URIs they give out, which was my point.

And only a server that supports the new https endpoints for the media repo would give out an mxc uri without a server name

That is not the only breaking change, no, servers also have the breaking change of fetching media.

Only if they support server-less mxc uris. For the client that is no different, either the server supports them or not.

I had a call with @Sorunome to discuss this, will likely remove the server names from the identifier here eventually. Exact migration strategy isn't decided yet, will follow in the coming days.

proposals/2846-decentralising-media-through-cids.md

jcgruenhage · 2020-11-03T12:45:26Z

proposals/2846-decentralising-media-through-cids.md

+   it publishes the file to IPFS and that is easy to scrape, but that also means
+   that fallback nodes are automatically found. Public files in this MSC *could*
+   be put into IPFS in the future, maybe as an updated version of MSC2706,
+   without changing the MXC URL format again, as we'd already have CIDs here.


Actually, wait a second here. I dug deeper into IPFS, while they use CIDs for content addressing, the files themselves aren't directly addressed. Instead they split the files into chunks, generate CIDs for those chunks, encode a metadata struct using protobuf and then generate a CID for that metadata. Therefore: This MSC would not allow us to seamlessly use IPFS in the backend with the same identifiers.

I strongly think what we shouldn't do ipfs style chunking in a way that the client has to resolve, but we could do it on the server. It would help more with deduplication, that's sure, but it'd also be a massive jump in scope for this currently fairly short MSC.

FWIW IPFS is looking into supporting CIDs based on the original hash of the file in addition to the chunked one: protocol/beyond-bitswap#29

Alternatively, you could use a CID of a directory with original file and a simple manifest file with things like content-type, hash etc.

izN8nu6RyeneG5XnBoBgyRMVGH6H43WF · 2023-06-19T16:59:16Z

Relevant is the recently approved Decentralized Identifier (did:) standard, although there don't seem to be any MSCs applying this to media identifiers yet.

Such an MSC may compete with this one.

izN8nu6RyeneG5XnBoBgyRMVGH6H43WF · 2023-06-19T17:18:18Z

proposals/2846-decentralising-media-through-cids.md

+implementations will be able to verify additional hashes and try more fallbacks
+for fetching content.
+
+The server trying more fallbacks requires that the client hints to their server


To clarify, are these hints "optional" for the client, or the server? Hints may leak information to the server, and while the server is generally expected to be trusted, a client with a defense-in-depth approach may prefer to avoid sending any unnecessary information at all, even if that means the server can't provide fallbacks. For most use cases, this would likely mean it's optional for the client.

If the goal is to maximize client flexibility, then we would also require servers to support both hinted and hint-less requests. However, the MSC states (emphasis added):

The server trying more fallbacks requires that the client hints...

Does this mean that servers may require hints, or merely that the client shouldn't expect the server to try fallbacks if they don't provide hints? I think the text can be updated to explain whether the server must support both with/without hints, or whether the server can choose to require one or the other.

izN8nu6RyeneG5XnBoBgyRMVGH6H43WF · 2023-06-19T17:33:03Z

proposals/2846-decentralising-media-through-cids.md

+expected. Clients can additionally verify that the MXC URL they received from
+the server actually represents the file that was originally sent.
+
+The server should not use v0 CIDs, it should always use v1 CIDs (until we change


In the future, we may support more generalized ways of addressing and retrieving media, as more decentralized file storage technologies mature (e.g. IPFS). Would committing to CIDs and maintaining both CIDs and legacy MXCs impact a future migration to a more generalized content addressing system, such as W3C DIDs?

jcgruenhage added 2 commits November 3, 2020 00:10

MSC for decentralised media with CIDs

6466328

replace mermaid code with images

884b0f6

turt2live added kind:feature MSC for not-core and not-maintenance stuff proposal A matrix spec change proposal proposal-in-review labels Nov 3, 2020

turt2live requested changes Nov 3, 2020

View reviewed changes

proposals/2846-decentralising-media-through-cids.md Outdated Show resolved Hide resolved

Sorunome reviewed Nov 3, 2020

View reviewed changes

jcgruenhage force-pushed the jcgruenhage/msc2846 branch from 21cec8b to 7d54142 Compare November 3, 2020 10:54

put rendered images into the repo

f2702c2

jcgruenhage force-pushed the jcgruenhage/msc2846 branch from 7d54142 to f2702c2 Compare November 3, 2020 11:03

jcgruenhage added 2 commits November 3, 2020 12:18

rename fallback servers query parameter to via

a48d89b

linkify related MSCs

c2ca51f

jcgruenhage requested a review from turt2live November 3, 2020 12:03

jcgruenhage commented Nov 3, 2020

View reviewed changes

turt2live removed their request for review December 2, 2020 23:01

turt2live added the needs-implementation This MSC does not have a qualifying implementation for the SCT to review. The MSC cannot enter FCP. label Jun 8, 2021

turt2live force-pushed the old_master branch from e895827 to dca99ee Compare August 30, 2021 22:34

turt2live removed the proposal-in-review label May 5, 2022

anoadragon453 mentioned this pull request May 4, 2023

MSC3860: Media Download Redirects #3860

Merged

izN8nu6RyeneG5XnBoBgyRMVGH6H43WF reviewed Jun 19, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MSC2846: Decentralizing media through CIDs #2846

MSC2846: Decentralizing media through CIDs #2846

jcgruenhage commented Nov 2, 2020 •

edited

Loading

jcgruenhage commented Nov 2, 2020

Sorunome Nov 3, 2020

jcgruenhage Nov 3, 2020

Sorunome Nov 3, 2020

Sorunome Nov 3, 2020

jcgruenhage Nov 3, 2020

Sorunome Nov 3, 2020 •

edited

Loading

jcgruenhage Nov 3, 2020

Sorunome Nov 3, 2020

jcgruenhage Nov 3, 2020

jcgruenhage Nov 3, 2020

lidel Feb 25, 2021 •

edited

Loading

izN8nu6RyeneG5XnBoBgyRMVGH6H43WF commented Jun 19, 2023 •

edited

Loading

izN8nu6RyeneG5XnBoBgyRMVGH6H43WF Jun 19, 2023

izN8nu6RyeneG5XnBoBgyRMVGH6H43WF Jun 19, 2023

MSC2846: Decentralizing media through CIDs #2846

Are you sure you want to change the base?

MSC2846: Decentralizing media through CIDs #2846

Conversation

jcgruenhage commented Nov 2, 2020 • edited Loading

jcgruenhage commented Nov 2, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Sorunome Nov 3, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lidel Feb 25, 2021 • edited Loading

Choose a reason for hiding this comment

izN8nu6RyeneG5XnBoBgyRMVGH6H43WF commented Jun 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcgruenhage commented Nov 2, 2020 •

edited

Loading

Sorunome Nov 3, 2020 •

edited

Loading

lidel Feb 25, 2021 •

edited

Loading

izN8nu6RyeneG5XnBoBgyRMVGH6H43WF commented Jun 19, 2023 •

edited

Loading