Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify container description #227

Open
csarven opened this issue Feb 4, 2021 · 126 comments · May be fixed by #362
Open

Specify container description #227

csarven opened this issue Feb 4, 2021 · 126 comments · May be fixed by #362
Assignees
Labels
doc: Protocol status: Nominated An issue that has been nominated for the next monthly milestone topic: resource access

Comments

@csarven
Copy link
Member

csarven commented Feb 4, 2021

Background: to date, the Solid Protocol (including earlier drafts and issues) only required server-managed containment statements in the representation of a container. Additional information such as last modification, size, resource type etc. about the contained resources as part of the container representation was deemed to be optional or considered to be a best practice. Examples in the wild show that some servers do make this additional information available, meanwhile some other servers do not support it. Some applications do make use of the information if available or work around the limitation to get a hold of the information [Anecdotal Evidence]

General use case:
Support navigation of the container and its contents.

Use cases:

  • Guinan is viewing a list of their social assets (eg. photos, blog posts) and wants to select and view a resource by its human-readable name.
  • Janeway is viewing their inbox and wants to respond to unread notifications from oldest to most recent.
  • Dax is viewing their collection of short-films and wants to delete the ones occupying significant portion of available storage quota.
  • Burnham is viewing a list of their crew's personal logs and wants to archive the ones that are created by certain individuals.

Related UCs:

Scenarios to consider:

  • Resources with URIs having non-human-friendly path segments eg. https://example.org/{uuid}
  • Container including mixed resource types eg. containers and non-containers, resources with different formats or media types.
  • Container including public and access controlled resources.

General requirement:
Include descriptions about contained resources in container's description to further support navigation and application interaction.

Specific requirements:

  1. Any information (eg. human-readable label of resources) that may be client or server-managed.
  2. Server-managed (controlled) information (eg. last-modified, resource size, resource types for controlled interaction models)

Considerations:

  • Some resources may require authentication and authorization, and in those cases information about those resources must not leak into the container description.
  • Is there empirical data on container response times making certain kinds of information available about its contained resources?
  • What would the application UX be like when device and network constraints are taken into account?
  • What's the cost for servers when this data is not used by applications?
  • What information must a server make available in container description (besides containment triples)?
  • Caching, caching, caching..

Related issues:

Notes:

  • Some servers may be read-only and do not require authentication or authorization, hence, the requirement to check access privileges per resource (in order to expose additional data about the resource) is inapplicable.
  • If additional information about a resource is made available in the container description, how does that effect write operations on the container eg. server ignores statements with certain (server-managed) properties?
  • Instead of the container resource, the associated description resource of a container (ie. target of describedby) could include information about the contained resources. Doesn't violate best practice on self-describing documents per se but it is perhaps not the most intuitive place to look for additional information about the contained resources.
  • Would requiring the client to explicitly request additional information through the Prefer header be meaningful?
@csarven
Copy link
Member Author

csarven commented Feb 4, 2021

I find the use cases to include "basic" information about contained resources in the container description compelling. Applications can immediately provide simple functionality by keeping the number of requests/connections minimal. It'd be reasonable to require this level of support on container read operations from servers in order to enable "smart" enough applications to get off the ground without having to resort to more advanced mechanisms.

I would consider last modification and size to be "basic" information. Ditto human-readable label if available. And possibly the creator of the resource. Whether knowing a resource is a container or not (by reading the container description) is very useful, that information can be derived as per shared slash semantics, hence it is not absolutely necessary that the container description includes resource types of contained resources.

@bourgeoa
Copy link
Member

bourgeoa commented Feb 4, 2021

Can you add any reference to http/1.1 server specification with the information that is to be available on server side.

@acoburn
Copy link
Member

acoburn commented Feb 4, 2021

I would rephrase the question here to be something more like:

A client needs a mechanism for finding descriptions of contained resources to further support navigation and application interaction.

I disagree that container listing is the best way to do this. A query endpoint (e.g. triple pattern fragments) can achieve the very same end with (arguably) better scalability characteristics.

The basic problem with including this data in a container relates to authorization.

Consider, for example, a container with 100 child resources. A simple GET request to the container will require an access check at the container level. Then 100 subsequent checks would be needed for each child resource. What happens with 1,000 child resources? 10,000 child resources? This does not scale.

The only way this scales is by introducing a paging mechanism such that you limit the scope of authZ enforcement to a predictable window size, which is why I suggest TPF.

@csarven
Copy link
Member Author

csarven commented Feb 4, 2021

best way to do this

for whom? Agree from a server's point of view but not particularly attractive from an application's point of view. It is quite a burden for applications to fetch each resource to get a hold of what they need (along the lines of that's mentioned in the above use cases) in order to provide something usable.

I would consider having to collect the data through a query endpoint relatively more complex than getting it simply from the container representation. Moreover, servers are not required to provide a query endpoint - at the time of this writing - so the basic information wouldn't be consistently available to applications.

If your counter argument/proposal is to address the use cases above by querying, we need to introduce a query mechanism as a hard requirement. (Which would help to meet quite a bit of other needs but that's all besides the point).

This does not scale.

Generally agree but we need empirical data as mentioned. True that a container can theoretically hold infinite number of resources (I think). Are applications - with the understanding of hierarchical organisation of Solid storage - organising data such that containers with many resources is common (in the wild)? If at all, how is resource organisation or management factored in?

Servers may want to limit the number of members a container can have to a number it is comfortable with. Implementation detail.

Agree on needing pagination as a way to control the cost of a request/response which would be an alternative to above - server fixing the max number of resources allowed per container. Implementation detail.

@acoburn
Copy link
Member

acoburn commented Feb 4, 2021

It is quite a burden for applications to fetch each resource to get a hold of what they need

This is not what I am suggesting. I agree that such an interaction is a non-starter: there are way too many HTTP round-trips. A query endpoint allows a client to retrieve all the information it needs in a single request.

This does not scale.

Generally agree but we need empirical data as mentioned.

Here is empirical data for a system that implements the "check every child resource" approach: https://wiki.lyrasis.org/display/FF/Many+Members+Performance+Testing You can see response times in the 60 second range for 10K child resources.

@namedgraph
Copy link

namedgraph commented Feb 4, 2021

Our definition of a container is this RDFS class called dh:Container.

As you can see, there's a related property dh:select that a container resource has. It points to a SPARQL SELECT query that the client can use to select the children resources of the container. Usually it's an entry point to further client-side query building that sets modifiers (LIMIT/OFFSET/ORDER BY), wraps into DESCRIBE etc.

So for example (prefixes missing):

<photos/> a dh:Container ;
  dh:select <queries/select-children/#this> .
  
<queries/select-children/#this> a sp:Select ;
  sp:text "SELECT ?child WHERE { { ?child sioc:has_parent ?this } UNION { ?child sioc:has_container ?this } }". # ?this is a magic variable which binds to the request URI

@jeff-zucker
Copy link
Member

Would it make any sense to have the listing of a container's contents follow the permissions on the container rather than the permissions on the contents? For example:

* private resource in a private container
   * unauthorized user can not view anything about the private resource
* private resource in a public container
   * unauthorized user can view size/last-modified/etc. but not GET content of the private resource

This would mean that the server never has to do a mass check of the permissions on its contents but the user would still have the option to hide the server-managed information when that is their intention.

@csarven
Copy link
Member Author

csarven commented Feb 4, 2021

@acoburn

This is not what I am suggesting.

I know. I said that as the current solution to meet the needs. Querying, pagination or something else is currently not possible (=unspecified).

Thanks re Fedora data, that is useful. It is not easy (for me) to break it down as there are a number of different dimensions with varying values. The test with ~60s is perhaps on the higher end ("perhaps postgres needs caching configured?") - if you can provide more insight on this, that'd be useful. There is a can of warms here re caching of access policies..

Is there something along those lines available for Trellis?

@csarven
Copy link
Member Author

csarven commented Feb 4, 2021

@namedgraph I presume you can filter based on authorization policy per resource? And the response time for request to /photos/ with different access controls on each contained item is marginally different to if each item is public-read?

@csarven
Copy link
Member Author

csarven commented Feb 4, 2021

@jeff-zucker

Would it make any sense to have the listing of a container's contents follow the permissions on the container rather than the permissions on the contents?

No because each resource (container or other) can have different access controls. System must not leak any information about contained resources when agent is unauthorized to read those resources - last modification, size etc. are indeed sensitive and should not be exposed. The most a read access on a container permits is the visibility of the containment statements (just references).

@bblfish
Copy link
Member

bblfish commented Feb 4, 2021

@acoburn wrote

The only way this scales is by introducing a paging mechanism such that you limit the scope of authZ enforcement to a predictable window size, which is why I suggest TPF.

The LDP group worked quite hard on a spec for paging. See: https://www.w3.org/TR/ldp-paging/

@acoburn
Copy link
Member

acoburn commented Feb 4, 2021

@csarven

Re: Trellis, that code works as described by @jeff-zucker (authZ decisions are made based on container permissions, not based on access to the child resource). Trellis also does not include any information about the child resources, so it just sidesteps this issue. Consequently, container retrieval is measured in milliseconds.

For Fedora, there was a huge amount of work done related to this issue, and ultimately, many users began finding various work-arounds that just avoided using LDP containment, e.g.:

  • put everything in the root container, block access to that container (since requests would bring down the server) and manually manage all links in the child resources. This approach basically avoids using LDP on an LDP server.
  • create layers of intermediate containers (/container/af/03/21/b8/af0321b8-my-resource) so that no container ever has more than 256 child resources (this is a bit like a really basic paging mechanism though it still requires a lot of round trips)

In my own experience, the Fedora server just got really, really slow once you had more than a thousand child resources in a single container. There were various attempts to resolve this, but those efforts never really went anywhere with that tech stack. I don't know where things stand these days, but it led to a lot of people abandoning the project.

Re: Query -- I see paging and query as two ways of describing a very similar feature, and they are both really useful.

@namedgraph
Copy link

@namedgraph I presume you can filter based on authorization policy per resource? And the response time for request to /photos/ with different access controls on each contained item is marginally different to if each item is public-read?

No ACL for children resources, no (yes for containers themselves). Since client-side containers is just UI for certain SPARQL queries, and we don't have ACL for plain SPARQL -- only for Linked Data resources. Once you have SPARQL access, you can pretty much see all the data, so it's a privilege to have.

@NoelDeMartin
Copy link

I recently noticed that ESS does not include the modified time because it's not part of the spec, and that makes apps unusable for large collections. So I'm very happy to see this :). I think my use-case has already been covered in previous comments, but I'll go over it briefly in case it's useful to see it from an app developer's perspective.

What I want to do in my app is reduce the quantity (and size) of network requests. Given that querying is not supported, the solution I've arrived at is caching everything in the client. This makes the first session slower, but makes subsequent sessions faster. It also improves the overall responsiveness of the app, because it doesn't have to make network requests for reading data. However, all of this depends on being able to read only the updates at the start of every session. So far, that's what I've been using the modified time for, and without it I can't think of a way to improve the application start up.

Something else that would be useful is knowing the types of resources included in the documents. For example, reading the type index I can find containers that include the types of resources I'm interested in. But that doesn't mean that a container doesn't have other types of resources, and I'd like to avoid reading documents that are not relevant to my app.

I understand that doing this can have an impact on server performance, so I don't have strong opinions as to how this information should be retrieved. I think it would make sense to return only containment triples by default, and use some mechanism like headers to indicate what other types of information is relevant.

Re:pagination, I suppose for really large amounts of data it would be necessary. With my current approach it's actually better to get everything in one request, given that I'll want to read all the documents that are relevant to my application (I was actually using globbing before it was deprecated). Pagination would be useful with query support - at that point I may be able to avoid caching everything - but given the current status this is the only viable solution I found.

@gibsonf1
Copy link

gibsonf1 commented Feb 5, 2021

For the TrinPod server case in authenticating what RDF data to include in a container request:

We use a fully hierarchical authentication scheme that at the lowest level is a single statement, so our server first retrieves all the information that a request would have without authentication, then does an auth check on each statement that the authenticated user has access to to generate the final response. The hierarchical nature of the auth check in combination with the cached acls presents virtually no resource hit on the server side.

On the Application side, in creating our Files app which we are finishing now, we are arriving at the idea that a single request to a container should present enough information for the user to intelligently decide what they want to do next, such as expand a child branch of that container. So we would be very happy to support any proposed standards about what to include as part of a container request to improve the UX. I think the paging issue that @acoburn brings up is also very important, so a standard around that would be great too.

At the moment, as standards aren't yet in place, for TrinPod we are including in a request to a container: all the child nodes of the container with ldp:contains, and then the ldp:contains of those child nodes as well as the last event triples around the content in the requested container (such as any schma:UpdateAction around that content) of course all filtered by user access permissions.

@csarven
Copy link
Member Author

csarven commented Feb 5, 2021

* https://www.w3.org/TR/ldp-paging/
* https://www.w3.org/TR/activitystreams-core/#paging

Created issue for resource paging: #230

@gibsonf1
Copy link

gibsonf1 commented Feb 5, 2021

@csarven I vote to make those two specs part of the Solid standard - but I think also needed would be a recommendation for how many items to include in a given page

@csarven
Copy link
Member Author

csarven commented Feb 7, 2021

@gibsonf1 If paging is required, I can't see why more than one mechanism is needed. The number of items to include for a paged resource would either be a client preference included in the request in which a server a may agree to or simply use its own (implementation detail).

@bblfish
Copy link
Member

bblfish commented Feb 7, 2021

It would be worth having a comparison between both.

@kjetilk
Copy link
Member

kjetilk commented Jun 24, 2021

I'm catching up here, and I appreciate that this is a summarization of several different things, and so I don't think it serves to pose this as a single question.

What I'm seeing here are at least these problems:

  1. Augment the data in the container with data to enable apps to present a summary view to the user.
  2. Augment the containment triples with minimal metadata that clients are likely to find useful to perform well.
  3. Ensure that the above data isn't exposed without authorization.

The first case is essentially a generalization of the Data Browser behavior where it looks for index.ttl to augment the view. I believe that this should be solved by having a predicate (e.g. rdfs:seeAlso or a subproperty thereof) in the container representation that points towards a resource that the client should get to do it. The applications will have to deal with authz so that no users gets data it shouldn't get, but I think that is the best solution anyway, as in many cases it may be OK to show a title and a thumbnail, but nothing more. We shouldn't place too many restrictions on this from the spec side.

Number 2 is essentially what we have referred to elsewhere as a File Scan operation. We haven't set down what a File Scan operation is, but in the context of Solid is pretty clear a File Scan operation is to read the contents of a container and it now requires read privileges on the container, and that should be adequate for now.

It is very interesting to read that @gibsonf1 has an implementation that performs well when checking access control for a tree, but in the interest of having a spec that many can implement, at least in the initial versions, I think it is correct to assume that it is rather hard to achieve that performance, as @acoburn has experienced. Thus, at least initially, we should make sure that a File Scan operation can be done with read privileges on the container only. Anything beyond that is not a File Scan operation.

Then, the question becomes what information a File Scan operation can legitimately expose. I think the above discussion and @acoburn 's comment in #116 makes it very clear that at least the containment triples are a part of the container representation, if you need the hidden file case, then you need to make a child container and then have other permissions on that.

My opinion, at least right now, is that there are some other attributes, like mtime, type and size are things that could be a part of the container representation in a File Scan operation. Again, if you need to protect those, make a container with different permissions.

There's also some precedence to this, Apache has a default index that exposes mtime and size by default.

In conclusion, number 2 above is the File Scan operation, which maps to a read operation on the container in Solid, which exposes containment triples, size, type and mtime as well as other server managed and client managed metadata.

But, there's more! ;-)

It could be argued that computing mtime and size is too heavy for most users, we shouldn't give that unless people ask for it. For that, I suggest we look into defining and registering a Prefer header preference. With this, clients could for example request the container with a Prefer: return=full, which would give them the full representation, including the mtime, size and type. Effectively, this would make it optional for servers to support it, but that's OK.

@gibsonf1
Copy link

gibsonf1 commented Nov 13, 2021

TrinPod Contained resources:

For primary file resource:
posix:ctime
posix:uid
schema:fileFormat (example here "application/pdf" )
dcterms:hasVersion (points to actual resource of current file version)
neo:m_last-change
neo:m_tag
rdfs:label
rdf:type (example types for a pdf primary file resource: dcterms:source reg:file sio:SIO_000380 ldp:Resource
sio:SIO_000000 opmv:Artifact rdfs:Resource pico:Condition sio:SIO_000776 neo:a_data neo:s_instance neo:e_result neo:s_member neo:s_element neo:substance

For file version resource (referenced above with dcterms:hasVersion:

posix:mtime
posix:size
neo:m_last-change
rdf:type (example types for file version for above: reg:file sio:SIO_000380 opmv:Artifact sio:SIO_000000 pico:Condition sio:SIO_000776 neo:a_data neo:s_instance neo:s_member neo:s_element neo:e_result neo:substance

@elf-pavlik
Copy link
Member

elf-pavlik commented Oct 1, 2022

In the implementation I'm working on I would find it very useful to be able to get the description of the container without any ldp:contains statements. Sometimes we just need to show information about the container but unless the user wants to dive deeper all the containment statements are just a waste of bandwidth.

While I'm not a believer in an average person finding the filesystem an intuitive interface. This comparison seems to be still useful among spec writers / developers. When I issue ls command I don't expect tree -L 1. The main difference here is that I expect human readable labels (e.g. skos:prefLabel) rather than opaque machine readable IRIs. So for tree -L 1 I would want something like just skos:prefLabel for each contained container but not full tree -L 2 and so on.

I see it very much related to the discussion about possibilities for separating clients managed and server managed triples. Having only clients managed triples in the response (incl. assigned label) would solve this use case (at least for me)

@kjetilk
Copy link
Member

kjetilk commented Oct 11, 2022

I would be all for entirely server-managed container resources, to make a clean separation between resources that are server managed and not, but I'm not sure that is a possibility at this point.

@elf-pavlik
Copy link
Member

@kjetilk do you mean:

  • the container resource is fully server managed
  • the Description Resource of the container resource is fully client managed

?

I think this would be much better than the current mixed bag of statements and all the quirkiness around it.
I hope we could still make this change happen!

My preference would be to have the opposite

  • container resources are fully client managed
  • dedicated server-managed auxiliary resource for containment, plus option to negotiate dataset response with both named graphs (Quad support in Solid #291 (comment))

Most likely this change would be too radical.

@kjetilk
Copy link
Member

kjetilk commented Oct 12, 2022

I would be OK with either, but I suppose both are too radical at this point, and the latter more so than the former.

@elf-pavlik
Copy link
Member

@kjetilk do you see this change as radical, looking mostly at the former one, due to the impact on existing implementations, or possibly impacting other parts of the spec (or other specs) and requiring cascading changes?

@kjetilk
Copy link
Member

kjetilk commented Oct 18, 2022

Since current implementations look for containment statements in / (for things like traversal), they would have to look elsewhere, which would require both client and server changes, and a looong transition period, where legacy systems would be around for even longer. That's hard to manage. That's why I think containment statements would have to be in / and therefore that is the one that could be server managed. The cases where the client interacts with / are fewer, and probably need auxiliary data anyway, and so, that seems like a lower bar. But then, I think that too would meet significant opposition.

@elf-pavlik
Copy link
Member

@kjetilk I recall your comments a little over a year ago in solid/authorization-panel#253 (comment) , solid/authorization-panel#253 (comment) , and solid/authorization-panel#253 (comment)

I think all that we discussed there would be clearer if client-managed and server-managed would be distinct resources with specific access control applied to them. Currently, containers are an exception to resource-level access control. As a result something which should be very simple (allowing the creation of contained resources while disallowing editing the container description), becomes a nightmare. I think reviving AuthZ UCR will give us the opportunity to take another look at how containers and their client-managed descriptions are intended to be used.

IMO if there is a major design issue that we can fix, doing it pre 1.0 might be the best time to do it.

@hzbarcea
Copy link
Member

hzbarcea commented Mar 4, 2024

Removed Release 0.11.0 milestone per agreement at 2024-02-14 CG meeting.

@NoelDeMartin
Copy link

Hey @hzbarcea, thanks for the update.

If this is not being included in 0.11 after 3 years of discussions, when can we expect to have this resolved? Looking at the activity in this issue, it seems like this is important to many people. I would like to understand why it's taking so long to be resolved. In the linked meeting notes it says it won't be included because it would block the release, but what does that mean? Is it blocked because we still need to make a decision? Concerns for server implementators? Lack of contributions in the spec?

@csarven
Copy link
Member Author

csarven commented Mar 13, 2024

@NoelDeMartin, you raise excellent questions and concerns.

Evidently, there are no legitimate details:

  • in the minutes (or for a month now with any clarity or improvements);
  • in reporting of the "agreement" on this issue;
  • in communication elsewhere (e.g., chats) about the whys or plans for the two issues referenced in the minutes;
  • in addition to no documentation on checking with the editors to gain actual insight on the matter
  • or commitments to implement or not implement any particular aspect
  • or hard or soft objections to current requirements
  • or ....

HOWEVER there is considerable detail about who "voted" on a proposal and how they voted to remove an item from a milestone. That's essential or understanding certain aspects of this issue and the social dynamics. No, this issue is not blocking the next release. The entire premise is unsubstantiated. Whether it's included in the next release/milestone or not holds little significance from the ED's perspective. Similar attempts are being made to remove other items from the next milestone without presenting clear justifications or plans. There is nothing constructive here.

All that aside, I will do my best to follow up on dangling technical concerns/open considerations. There aren't that many, but they could be significant when I translate them to PRs because they will 1) clear up misunderstandings and expectations 2) introduce class 3-4 changes because it touches on some other issues/considerations/requests. As I see it, this issue is not something we look at in isolation, and removing it from a milestone indicate a lack of understanding of the concerns initially raised.

As I've and others have mentioned in recent meetings, we (CG) try to make progress on the specifications in this incubation space. If/when and in what form a WG takes place has no impact on continuing to take this issue seriously (see my bullet points above for example) now. If anyone has new data, opinion, or initiative on it, they're encouraged to share them. The door is wide open. Suggesting that the WG will handle it is an attempt to limit discussion and, dare I say, influence who gets to "vote".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc: Protocol status: Nominated An issue that has been nominated for the next monthly milestone topic: resource access
Projects
Status: Consensus Phase
Development

Successfully merging a pull request may close this issue.