-
Notifications
You must be signed in to change notification settings - Fork 379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MSC2846: Decentralizing media through CIDs #2846
base: old_master
Are you sure you want to change the base?
MSC2846: Decentralizing media through CIDs #2846
Conversation
Related to:
|
1. **MSC2834** proposes to replace MXC URLs with custom hash identifier + hash | ||
strings. This is very similar to what we're doing here, with the difference | ||
of not reusing pre-existing methods like multihash and CIDs. Also, by | ||
removing the server name from the MXC URL, it breaks backwards compatibility |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is on purpose - the goal of matrix going forward seems to be to remove the server name from all identifiers over time. It has already happened for events.
#2787 removes the identifier off of the user ID
There were talks about removing the server from the room IDs too.
In short: Any proposal suggesting such global dedupliation and not removing the server from the identifier seems to be missing the point :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The identifier of the media is very much only the CID. CIDs only match when the content matches, so the server should use the CID as "the one" ID for that media. Keeping the server in the MXC URL makes sense IMO, because it's an indicator of where to get the content from. MXC URLs are not quite the same as user IDs etc, because they are resource locators, not identifiers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's just it, even the MXC URIs in this proposals are not resource locators anymore. The server part is just "The media could be there", and not "The media will be there". If the media is not on said server, then a different one is used. That is semantically not a resource locator.
Note that #2834 also drops the //
off of the URI, making it an identifier rather than a resource locator. Semantically this makes way more sense, as there is no guarantee anymore of any server having said media.
It only makes sense to not special case one server over another, as they are, well, not necessarily better or thelike.
Soru also thinks that a small breaking change now will make it way easier for clients and servers to pick up than 9001 breaking changes in a year or so when we finally remove all server identifiers. This breaking change is only minimal, on the client side. Most clients already have a function to turn an MXC URI into an http URI, it is only that function needing updating. As for servers, they can take their time updating. If other servers switch, it does not mean they will become incompatible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, if we keep the server then we have two ways to indicate where to fetch media from - after mxc://
and via servers
/via
. It seems more elegant to only have one way of doing so and thus not special-case any server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The server part is just "The media could be there", and not "The media will be there".
The media will be there, unless the server has disappeared. All this fallback stuff is for servers disappearing, not for servers deleting media they sent.
It only makes sense to not special case one server over another, as they are, well, not necessarily better or thelike.
If we remove the special casing of the server here, I'd still highly suggest that we special case the fetching in another way, which is asking the event sending server first, because that is the server which is the most likely to have access to that file. Right after sending, it is objectively better to just ask the origin server, because no other server will have it yet.
As for servers, they can take their time updating. If other servers switch, it does not mean they will become incompatible.
It means that we can't start sending them immediately though, and since the upload endpoint isn't tied to a room, we can't tape this to a room version or something like that either.
Additionally, if we keep the server then we have two ways to indicate where to fetch media from - after
mxc://
and viaservers/via
Those do have different meanings though, one is the origin that sent the file in the first place, which should be tried before the fallbacks, and the others are just hints on where the clients think the content might also be, in case the origin server is unavailable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It means that we can't start sending them immediately though, and since the upload endpoint isn't tied to a room, we can't tape this to a room version or something like that either.
Yes, we can. The places where a server passes on an mxc uri are inside of events and profiles. In both of these cases the mxcs are/should be handled opague. The only breaking change is how the client resolves the mxc uri to an http uri and adding a new endpoint for that. So, if a server is slow in updating, they can still pass on new mxc URIs, the client using that server just won't be able to display them.
The media will be there, unless the server has disappeared. All this fallback stuff is for servers disappearing, not for servers deleting media they sent.
Actually, having the server delete the media makes more and more sense with a proposal like this. It will happen if we have "retry via other server", if we plan for it or not. So we might as well plan for it
If we remove the special casing of the server here, I'd still highly suggest that we special case the fetching in another way, which is asking the event sending server first, because that is the server which is the most likely to have access to that file. Right after sending, it is objectively better to just ask the origin server, because no other server will have it yet.
Just try the first specified servers
/via
first? Query parameters are an ordered list, after all
Those do have different meanings though, one is the origin that sent the file in the first place, which should be tried before the fallbacks, and the others are just hints on where the clients think the content might also be, in case the origin server is unavailable.
The point soru is trying to make probably is that the media would just "exist in matrix" like rooms do now, not bound to one server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can.
How so? The MXC URI is created before we know where to put it, so unless we also change the upload end point to include a room we can't afaict.
The places where a server passes on an mxc uri are inside of events and profiles. In both of these cases the mxcs are/should be handled opague.
Yes, servers don't need to care about the URI's they're passing on, but they need to care about the URIs they give out, which was my point.
The only breaking change is how the client resolves the mxc uri to an http uri and adding a new endpoint for that.
That is not the only breaking change, no, servers also have the breaking change of fetching media.
Actually, having the server delete the media makes more and more sense with a proposal like this. It will happen if we have "retry via other server", if we plan for it or not. So we might as well plan for it
Well, we already are, servers deleting media shouldn't be much different from servers disappearing here. I find it rather unlikely that servers will delete content they sent unless they know others are keeping it, and we don't have any signalling for that yet
Just try the first specified
servers
/via
first? Query parameters are an ordered list, after all
Sure, can do that (and the proposal already says to do this)
The point soru is trying to make probably is that the media would just "exist in matrix" like rooms do now, not bound to one server.
Yes, that is what'd happen here. The server name in the MXC URL is just a hint for where to get the content from, not an authoritative source. Oh, wait, the URL/URI spec says that ://
should only be used for authoritative sources, so I guess we'd be violating the URL/URI spec if we did that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How so? The MXC URI is created before we know where to put it, so unless we also change the upload end point to include a room we can't afaict.
The server can return mxc:m.sha256:hash
and thus, for the client, when putting it places it is an opague string.
Yes, servers don't need to care about the URI's they're passing on, but they need to care about the URIs they give out, which was my point.
And only a server that supports the new https endpoints for the media repo would give out an mxc uri without a server name
That is not the only breaking change, no, servers also have the breaking change of fetching media.
Only if they support server-less mxc uris. For the client that is no different, either the server supports them or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a call with @Sorunome to discuss this, will likely remove the server names from the identifier here eventually. Exact migration strategy isn't decided yet, will follow in the coming days.
21cec8b
to
7d54142
Compare
7d54142
to
f2702c2
Compare
it publishes the file to IPFS and that is easy to scrape, but that also means | ||
that fallback nodes are automatically found. Public files in this MSC *could* | ||
be put into IPFS in the future, maybe as an updated version of MSC2706, | ||
without changing the MXC URL format again, as we'd already have CIDs here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, wait a second here. I dug deeper into IPFS, while they use CIDs for content addressing, the files themselves aren't directly addressed. Instead they split the files into chunks, generate CIDs for those chunks, encode a metadata struct using protobuf and then generate a CID for that metadata. Therefore: This MSC would not allow us to seamlessly use IPFS in the backend with the same identifiers.
I strongly think what we shouldn't do ipfs style chunking in a way that the client has to resolve, but we could do it on the server. It would help more with deduplication, that's sure, but it'd also be a massive jump in scope for this currently fairly short MSC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW IPFS is looking into supporting CIDs based on the original hash of the file in addition to the chunked one: protocol/beyond-bitswap#29
Alternatively, you could use a CID of a directory with original file and a simple manifest file with things like content-type, hash etc.
Relevant is the recently approved Decentralized Identifier (did:) standard, although there don't seem to be any MSCs applying this to media identifiers yet. Such an MSC may compete with this one. |
implementations will be able to verify additional hashes and try more fallbacks | ||
for fetching content. | ||
|
||
The server trying more fallbacks requires that the client hints to their server |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, are these hints "optional" for the client, or the server? Hints may leak information to the server, and while the server is generally expected to be trusted, a client with a defense-in-depth approach may prefer to avoid sending any unnecessary information at all, even if that means the server can't provide fallbacks. For most use cases, this would likely mean it's optional for the client.
If the goal is to maximize client flexibility, then we would also require servers to support both hinted and hint-less requests. However, the MSC states (emphasis added):
The server trying more fallbacks requires that the client hints...
Does this mean that servers may require hints, or merely that the client shouldn't expect the server to try fallbacks if they don't provide hints? I think the text can be updated to explain whether the server must support both with/without hints, or whether the server can choose to require one or the other.
expected. Clients can additionally verify that the MXC URL they received from | ||
the server actually represents the file that was originally sent. | ||
|
||
The server should not use v0 CIDs, it should always use v1 CIDs (until we change |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future, we may support more generalized ways of addressing and retrieving media, as more decentralized file storage technologies mature (e.g. IPFS). Would committing to CIDs and maintaining both CIDs and legacy MXCs impact a future migration to a more generalized content addressing system, such as W3C DIDs?
Rendered