Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document caveats around key-value pinset stores #97

Open
lidel opened this issue Jun 22, 2022 · 4 comments
Open

Document caveats around key-value pinset stores #97

lidel opened this issue Jun 22, 2022 · 4 comments
Labels
dif/expert Extensive knowledge (implications, ramifications) required effort/hours Estimated to take one or several hours need/analysis Needs further analysis before proceeding need/community-input Needs input from the wider community P2 Medium: Good to have, but can wait until someone steps up topic/devexp Developer experience related things topic/docs Improvements or additions to documentation

Comments

@lidel
Copy link
Member

lidel commented Jun 22, 2022

Extracted from ipfs-shipyard/pinning-service-compliance#118 (comment):

Also, is it impossible for IPFS cluster to support pagination/creation-date sorting; or is it something that hasn't been implemented yet? Is there a tracking issue for this?

It is impractical. Cluster does not have a relational-database backend for storing the pins, but just a key value store. Keys don't have sorted IDs, listing keys out from this store can result in random orders. Thus some features like pagination cannot be done without reading everything to memory, sorting, etc. which is a footgun for big pinsets. I think it is ok if cluster does not support pagination. It tries to do its best and it's quite ok that it supports everything else.

I'd like to at the very least update the Pagination and filtering section to loosen up requirements and provide some rules of thumb for service implementations backed by key-value stores.

@hsanjuan @SgtPooki
What is the current behavior of ipfs-cluster around GET /pins, filtering and pagination?
What would be the best compromise we should document?

Some ideas how to handle "sorting and filtering becomes too expensive" scenarios:

  • (a) pagination and filtering does not work at all and GET /pins always returns 405 Method Not Allowed
    • simple, if someone needs this, they would use implementation backed by a database with indexes
  • (b) pagination and filtering works for small pinsets, but starts returning 405 Method Not Allowed` above certain number of pins
    • response includes error informing user that sorting is too expensive, and they need to reduce number of pins, or track them on their own
  • (c) no pagination, no before and after filters (they produce 405 Method Not Allowed), GET /pins returns pins in random order

Are there better ways?

@lidel lidel added P2 Medium: Good to have, but can wait until someone steps up dif/expert Extensive knowledge (implications, ramifications) required effort/hours Estimated to take one or several hours need/analysis Needs further analysis before proceeding need/community-input Needs input from the wider community topic/devexp Developer experience related things topic/docs Improvements or additions to documentation labels Jun 22, 2022
@hsanjuan
Copy link

Listing 50M pins is going to suck in every model, pagination or not. Cluster REST API switched to streaming pins on such requests, to avoid building up results on memory. If the pinning service API allows limit=50M it will likely use a lot of memory on the backend while building the json response (unless it encodes on the fly and crosses fingers for no errors to happen). If it allows limit=100, then dealing with such huge pinset will cost thousands of requests (but at least they can be rate-limited etc). Streaming 50M items is also not fun.

I'm not sure what the best approach is for the Pinning Service API spec. If the pinset is small enough, sure we can implement pagination and everything. But it sucks that /pins stops working if the pinset gets to certain size. If the pinset is very big, I still need to construct the answer in memory which also sucks with or without pagination.

In general /pins, without an sql-like backend that can offer good indexing and sorting to do the things, is going to suck as soon as it gets big. But that is probably not the problem of the pinning svc api spec, but of the implementation?

@hsanjuan
Copy link

What is the current behavior of ipfs-cluster around GET /pins, filtering and pagination?

To be concrete, filtering is be done, limit is done, pagination not done, and, in general, the /pins endpoint is not apt for big pinsets as it will balloon the backend's memory usage.

@guseggert
Copy link

guseggert commented Jun 24, 2022

I ran into this same issue when writing a pinning service backed by a key-value store instead of a relational database (DynamoDB, which does have some limited ability to sort and filter keys with secondary indexes).

The biggest problem I ran into is requiring "count" in responses, which is discussed here: #86.

The second problem I ran into is that the query parameters are really complex. To fully support all the variations of sorting and filters, while still returning dense results (which is a requirement for many different kinds of queries, like finding pins with certain statuses, finding pins by CID, pins by name, etc.), is hard to do in a highly-available way, even with a relational DB. If we were to overhaul the API, I'd advocate for removing many of the query params (consider removing filtering by metadata, pick a case sensitivity and be opinionated about it, only support "exact" name matches, only accept a single CID instead of a set, etc.).

@SgtPooki
Copy link
Member

I agree with Gus here. Including count becomes unreasonable when scaling, and many of the supported pagination and query parameters make fetching pins flexible for the consumers, but extremely difficult for providers.

A better model would be something similar to dynamoDB's limited response size, with on-the-spot pagination keys (nextToken/etc) and then allowing consumers/mid-tier services to filter on the received data.

You can read more about how dynamodb works at https://www.dynamodbguide.com/the-dynamo-paper/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dif/expert Extensive knowledge (implications, ramifications) required effort/hours Estimated to take one or several hours need/analysis Needs further analysis before proceeding need/community-input Needs input from the wider community P2 Medium: Good to have, but can wait until someone steps up topic/devexp Developer experience related things topic/docs Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

4 participants