Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSC2846: Decentralizing media through CIDs #2846

Open
wants to merge 5 commits into
base: old_master
Choose a base branch
from

Conversation

jcgruenhage
Copy link
Contributor

@jcgruenhage jcgruenhage commented Nov 2, 2020

@jcgruenhage
Copy link
Contributor Author

Related to:

@turt2live turt2live added kind:feature MSC for not-core and not-maintenance stuff proposal A matrix spec change proposal proposal-in-review labels Nov 3, 2020
proposals/2846-decentralising-media-through-cids.md Outdated Show resolved Hide resolved
1. **MSC2834** proposes to replace MXC URLs with custom hash identifier + hash
strings. This is very similar to what we're doing here, with the difference
of not reusing pre-existing methods like multihash and CIDs. Also, by
removing the server name from the MXC URL, it breaks backwards compatibility
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is on purpose - the goal of matrix going forward seems to be to remove the server name from all identifiers over time. It has already happened for events.

#2787 removes the identifier off of the user ID

There were talks about removing the server from the room IDs too.

In short: Any proposal suggesting such global dedupliation and not removing the server from the identifier seems to be missing the point :/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The identifier of the media is very much only the CID. CIDs only match when the content matches, so the server should use the CID as "the one" ID for that media. Keeping the server in the MXC URL makes sense IMO, because it's an indicator of where to get the content from. MXC URLs are not quite the same as user IDs etc, because they are resource locators, not identifiers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's just it, even the MXC URIs in this proposals are not resource locators anymore. The server part is just "The media could be there", and not "The media will be there". If the media is not on said server, then a different one is used. That is semantically not a resource locator.

Note that #2834 also drops the // off of the URI, making it an identifier rather than a resource locator. Semantically this makes way more sense, as there is no guarantee anymore of any server having said media.

It only makes sense to not special case one server over another, as they are, well, not necessarily better or thelike.

Soru also thinks that a small breaking change now will make it way easier for clients and servers to pick up than 9001 breaking changes in a year or so when we finally remove all server identifiers. This breaking change is only minimal, on the client side. Most clients already have a function to turn an MXC URI into an http URI, it is only that function needing updating. As for servers, they can take their time updating. If other servers switch, it does not mean they will become incompatible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, if we keep the server then we have two ways to indicate where to fetch media from - after mxc:// and via servers/via. It seems more elegant to only have one way of doing so and thus not special-case any server.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The server part is just "The media could be there", and not "The media will be there".

The media will be there, unless the server has disappeared. All this fallback stuff is for servers disappearing, not for servers deleting media they sent.

It only makes sense to not special case one server over another, as they are, well, not necessarily better or thelike.

If we remove the special casing of the server here, I'd still highly suggest that we special case the fetching in another way, which is asking the event sending server first, because that is the server which is the most likely to have access to that file. Right after sending, it is objectively better to just ask the origin server, because no other server will have it yet.

As for servers, they can take their time updating. If other servers switch, it does not mean they will become incompatible.

It means that we can't start sending them immediately though, and since the upload endpoint isn't tied to a room, we can't tape this to a room version or something like that either.

Additionally, if we keep the server then we have two ways to indicate where to fetch media from - after mxc:// and via servers/via

Those do have different meanings though, one is the origin that sent the file in the first place, which should be tried before the fallbacks, and the others are just hints on where the clients think the content might also be, in case the origin server is unavailable.

Copy link
Contributor

@Sorunome Sorunome Nov 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means that we can't start sending them immediately though, and since the upload endpoint isn't tied to a room, we can't tape this to a room version or something like that either.

Yes, we can. The places where a server passes on an mxc uri are inside of events and profiles. In both of these cases the mxcs are/should be handled opague. The only breaking change is how the client resolves the mxc uri to an http uri and adding a new endpoint for that. So, if a server is slow in updating, they can still pass on new mxc URIs, the client using that server just won't be able to display them.

The media will be there, unless the server has disappeared. All this fallback stuff is for servers disappearing, not for servers deleting media they sent.

Actually, having the server delete the media makes more and more sense with a proposal like this. It will happen if we have "retry via other server", if we plan for it or not. So we might as well plan for it

If we remove the special casing of the server here, I'd still highly suggest that we special case the fetching in another way, which is asking the event sending server first, because that is the server which is the most likely to have access to that file. Right after sending, it is objectively better to just ask the origin server, because no other server will have it yet.

Just try the first specified servers/via first? Query parameters are an ordered list, after all

Those do have different meanings though, one is the origin that sent the file in the first place, which should be tried before the fallbacks, and the others are just hints on where the clients think the content might also be, in case the origin server is unavailable.

The point soru is trying to make probably is that the media would just "exist in matrix" like rooms do now, not bound to one server.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can.

How so? The MXC URI is created before we know where to put it, so unless we also change the upload end point to include a room we can't afaict.

The places where a server passes on an mxc uri are inside of events and profiles. In both of these cases the mxcs are/should be handled opague.

Yes, servers don't need to care about the URI's they're passing on, but they need to care about the URIs they give out, which was my point.

The only breaking change is how the client resolves the mxc uri to an http uri and adding a new endpoint for that.

That is not the only breaking change, no, servers also have the breaking change of fetching media.

Actually, having the server delete the media makes more and more sense with a proposal like this. It will happen if we have "retry via other server", if we plan for it or not. So we might as well plan for it

Well, we already are, servers deleting media shouldn't be much different from servers disappearing here. I find it rather unlikely that servers will delete content they sent unless they know others are keeping it, and we don't have any signalling for that yet

Just try the first specified servers/via first? Query parameters are an ordered list, after all

Sure, can do that (and the proposal already says to do this)

The point soru is trying to make probably is that the media would just "exist in matrix" like rooms do now, not bound to one server.

Yes, that is what'd happen here. The server name in the MXC URL is just a hint for where to get the content from, not an authoritative source. Oh, wait, the URL/URI spec says that :// should only be used for authoritative sources, so I guess we'd be violating the URL/URI spec if we did that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How so? The MXC URI is created before we know where to put it, so unless we also change the upload end point to include a room we can't afaict.

The server can return mxc:m.sha256:hash and thus, for the client, when putting it places it is an opague string.

Yes, servers don't need to care about the URI's they're passing on, but they need to care about the URIs they give out, which was my point.

And only a server that supports the new https endpoints for the media repo would give out an mxc uri without a server name

That is not the only breaking change, no, servers also have the breaking change of fetching media.

Only if they support server-less mxc uris. For the client that is no different, either the server supports them or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a call with @Sorunome to discuss this, will likely remove the server names from the identifier here eventually. Exact migration strategy isn't decided yet, will follow in the coming days.

proposals/2846-decentralising-media-through-cids.md Outdated Show resolved Hide resolved
it publishes the file to IPFS and that is easy to scrape, but that also means
that fallback nodes are automatically found. Public files in this MSC *could*
be put into IPFS in the future, maybe as an updated version of MSC2706,
without changing the MXC URL format again, as we'd already have CIDs here.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, wait a second here. I dug deeper into IPFS, while they use CIDs for content addressing, the files themselves aren't directly addressed. Instead they split the files into chunks, generate CIDs for those chunks, encode a metadata struct using protobuf and then generate a CID for that metadata. Therefore: This MSC would not allow us to seamlessly use IPFS in the backend with the same identifiers.

I strongly think what we shouldn't do ipfs style chunking in a way that the client has to resolve, but we could do it on the server. It would help more with deduplication, that's sure, but it'd also be a massive jump in scope for this currently fairly short MSC.

Copy link

@lidel lidel Feb 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW IPFS is looking into supporting CIDs based on the original hash of the file in addition to the chunked one: protocol/beyond-bitswap#29

Alternatively, you could use a CID of a directory with original file and a simple manifest file with things like content-type, hash etc.

@turt2live turt2live removed their request for review December 2, 2020 23:01
@turt2live turt2live added the needs-implementation This MSC does not have a qualifying implementation for the SCT to review. The MSC cannot enter FCP. label Jun 8, 2021
@izN8nu6RyeneG5XnBoBgyRMVGH6H43WF
Copy link

izN8nu6RyeneG5XnBoBgyRMVGH6H43WF commented Jun 19, 2023

Relevant is the recently approved Decentralized Identifier (did:) standard, although there don't seem to be any MSCs applying this to media identifiers yet.

Such an MSC may compete with this one.

implementations will be able to verify additional hashes and try more fallbacks
for fetching content.

The server trying more fallbacks requires that the client hints to their server

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, are these hints "optional" for the client, or the server? Hints may leak information to the server, and while the server is generally expected to be trusted, a client with a defense-in-depth approach may prefer to avoid sending any unnecessary information at all, even if that means the server can't provide fallbacks. For most use cases, this would likely mean it's optional for the client.

If the goal is to maximize client flexibility, then we would also require servers to support both hinted and hint-less requests. However, the MSC states (emphasis added):

The server trying more fallbacks requires that the client hints...

Does this mean that servers may require hints, or merely that the client shouldn't expect the server to try fallbacks if they don't provide hints? I think the text can be updated to explain whether the server must support both with/without hints, or whether the server can choose to require one or the other.

expected. Clients can additionally verify that the MXC URL they received from
the server actually represents the file that was originally sent.

The server should not use v0 CIDs, it should always use v1 CIDs (until we change

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future, we may support more generalized ways of addressing and retrieving media, as more decentralized file storage technologies mature (e.g. IPFS). Would committing to CIDs and maintaining both CIDs and legacy MXCs impact a future migration to a more generalized content addressing system, such as W3C DIDs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:feature MSC for not-core and not-maintenance stuff needs-implementation This MSC does not have a qualifying implementation for the SCT to review. The MSC cannot enter FCP. proposal A matrix spec change proposal
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants