Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update OPDS #112

Open
gbnewby opened this issue Nov 6, 2023 · 9 comments
Open

Update OPDS #112

gbnewby opened this issue Nov 6, 2023 · 9 comments

Comments

@gbnewby
Copy link
Contributor

gbnewby commented Nov 6, 2023

Per an email exchange between Eric and Greg, we would like to update OPDS to version 2.0.

Our currently OPDS is 0.9 and not necessarily working properly.

This will yield the IA/OpenLibrary api which is stable and there are python wrappers for it.

The goal is for OPDS to serve as the main public-facing API offered by Project Gutenberg.

@eshellman
Copy link
Contributor

eshellman commented Nov 7, 2023 via email

@gbnewby
Copy link
Contributor Author

gbnewby commented Apr 13, 2024

Now that 2024Q1 is complete, would you please follow up with the developers for a status update? @eshellman

@ddaws
Copy link

ddaws commented Dec 8, 2024

Hey @gbnewby @eshellman, I've been reading about OPDS 2.0 and I wanted to propose an API structure for feedback before starting implementation. Please let me know what you think 🙏

Endpoints

The OPDS 2.0 API is read only, so please assume all request are HTTP GET requests to the endpoint. The following proposes a set of initial endpoints, but in the future we could extend this to support exposing more collections and publications based on language, author, series, etc.

/opds/2

This is the base URL and would return an OPDS navigation collection referencing other routes. For example, the following feed endpoints could eventually support the newest and top 100 pages on the Project Gutenberg website.

{
  "metadata": {
    "title": "Project Gutenberg OPDS 2.0 API"
  },
  "links": [
    {"rel": "self", "href": "https://gutenberg.org/opds/2", "type": "application/opds+json"}
  ],
  "groups": [
    {
      "metadata": {"title": "Newest"},
      "navigation": [
        {
          "href": "/opds/2/new/1d", 
          "title": "Newest last 24 hours", 
          "type": "application/opds+json", 
          "rel": "http://opds-spec.org/sort/new"
        },
       {
          "href": "/opds/2/new/7d", 
          "title": "Newest last 7 days", 
          "type": "application/opds+json", 
          "rel": "http://opds-spec.org/sort/new"
        },
        {
          "href": "/opds/2/new/30d", 
          "title": "Newest", 
          "type": "application/opds+json", 
          "rel": "http://opds-spec.org/sort/new"
        }
      ]
    },
    {
      "metadata": {"title": "Top 100"},
      "navigation": [
        {
          "href": "/opds/2/top100/books/1d", 
          "title": "Top 100 books in the last 24 hours", 
          "type": "application/opds+json", 
          "rel": "http://opds-spec.org/sort/popular"
        },
        {
          "href": "/opds/2/top100/books/7d", 
          "title": "Top 100 books in the last 7 days", 
          "type": "application/opds+json", 
          "rel": "http://opds-spec.org/sort/popular"
        },
       {
          "href": "/opds/2/top100/books/30d", 
          "title": "Top 100 books in the last 30 days", 
          "type": "application/opds+json", 
          "rel": "http://opds-spec.org/sort/popular"
        },
        {
          "href": "/opds/2/top100/authors/1d", 
          "title": "Top 100 authors in the last 24 hours", 
          "type": "application/opds+json", 
          "rel": "http://opds-spec.org/sort/popular"
        },
        {
          "href": "/opds/2/top100/authors/7d", 
          "title": "Top 100 authors in the last 7 days", 
          "type": "application/opds+json", 
          "rel": "http://opds-spec.org/sort/popular"
        },
        {
          "href": "/opds/2/top100/authors/30d", 
          "title": "Top 100 authors in the last 30 days", 
          "type": "application/opds+json", 
          "rel": "http://opds-spec.org/sort/popular"
        }
      ]
    }
  ]
}

For the newest and top 100 feeds endpoints would be suffixed with the time period (1d, 7d, 30d) because the OPDS spec doesn't include search query parameters for time windowed ranges. If we are comfortable deviating slightly from the spec we could implement these more concisely as search endpoints like

{
  "metadata": {
    "title": "Project Gutenberg OPDS 2.0 API"
  },
  "links": [
    {"rel": "self", "href": "https://gutenberg.org/opds/2", "type": "application/opds+json"}
  ],
  "navigation": [
   {
      "href": "/opds/2/search/authors{?query,from,to,sort,sortOrder,limit,etc...}", 
      "title": "Authors search endpoint", 
      "type": "application/opds+json", 
      "rel": "search"
    },
    {
      "href": "/opds/2/search/books{?query,from,to,sort,sortOrder,limit,etc...}", 
      "title": "Books search endpoint", 
      "type": "application/opds+json", 
      "rel": "search"
    }
  ]
}

In this case we could query the top 100 authors in the last 7 days with the query parameters

GET /opds/2/search/authors?from=now-7d&to=now&sort=downloads&sortOrder=desc&limit=100

Some things to note

  • from and to could be expressed as an ISO 8601 date time (eg, 2024-12-08T14:30:00Z) or as a relative date marker like today, 1d, 1w, etc. This provides readable query strings and makes testing easier. A list of proposed relative date markers are included at the end of this post
  • sort and sortOrder determine the way collections are sorted. A list of proposed search parameters are included at the end of this post

The nice thing about this is that it is extensible. We could expose the newest additions to Project Gutenberg via

GET /opds/2/search/books?sort=createdAt&sortOrder=desc # Assuming we track the created at time in the DB

We could expand this to provide feeds for the top Russian authors in the past 7 days by adding support for a language query parameter and using the endpoint

GET /opds/2/search/authors?from=today-7d&to=today&sort=downloads&sortOrder=desc&language=ru

I propose supporting both the search endpoints, and the more verbose fully OPDS 2 compliant endpoints. We could expose an /opds/2/top100/authors/7d endpoint and have this endpoint effectively alias (call through to the search controller class for the author OPDS feed) the /opds/2/search/authors?from=now-7d&to=now&sort=downloads&sortOrder=desc. This way we expose a 100% OPDS 2.0 compliant endpoint (/opds/2/top100/authors/7d), and expose a more extensible search endpoint that we can use to dynamically build collections on.

Note: The "aliasing" would happen in the code by having an endpoint call the controller method for another endpoint with prepopulated query parameters. I can show an example of this in my PR when we're aligned on the structure of the API and I do not think it will require duplicating any code.

/opds/2/new/{period}

This endpoint would effectively alias the endpoint /opds/2/search/books{?query,from,to,sort,sortOrder,etc...} endpoint.

For example, the /opds/2/new/7d endpoint would resolve to

GET /opds/2/search/books?from=today-7d&to=today&sort=createdAt&sortOrder=desc

This would return something like the following

{
  "metadata": {
    "title": "Newest additions to Project Gutenberg"
  },
  "links": [
    {"rel": "self", "href": "https://gutenberg.org/opds/2/new/7d", "type": "application/opds+json"}
  ],
  "publications": [
    {
      "metadata": {
        "@type": "http://schema.org/EBook",
        "title": "Moby-Dick",
        "author": "Herman Melville",
        "identifier": "urn:isbn:978031600000X",
        "language": "en",
        "modified": "2015-09-29T17:00:00Z"
      },
      "links": [
        {"rel": "self", "href": "https://gutenberg.org/opds/2/books/by/id/12345", "type": "application/opds-publication+json"},
        { "rel": "http://opds-spec.org/acquisition/open-access", "href": "https://www.gutenberg.org/ebooks/12345.epub.noimages", "type": "application/epub+zip"}
        // ...
      ],
      "images": [
        {"href": "http://example.org/cover.jpg", "type": "image/jpeg", "height": 1400, "width": 800},
        // ...
      ]
    }
    // More books listed...
  ]
}

For any result set that exceeds maxPageSize items pagination would be included based on the pagination parameters defined in the OPDS 2 spec here.

/opds/2/books/by/id/{bookId}

This endpoint is linked to from collection endpoints (like our search, new, and top 100 endpoints) and returns the information for a specific publication. The endpoint is structured as /opds/2/book/by/id/{bookId} to give us the flexibility to support retrieving publications by other types of identifiers in the future. For example, in the future we could add support for a /opds/2/books/by/isbn/{bookISBN} in the future. This might be useful for applications that integrate our APIs that aren't aware of our internal IDs but have an ISBN and want to quickly look up publication information against Project Gutenberg.

This endpoint would return OPDS publication information. For example

{
  "metadata": {
    "@type": "http://schema.org/EBook",
    "title": "Moby-Dick",
    "author": "Herman Melville",
    "identifier": "urn:isbn:978031600000X",
    "language": "en",
    "modified": "2015-09-29T17:00:00Z"
  },
  "links": [
    {"rel": "self", "href": "https://gutenberg.org/opds/2/books/by/id/12345", "type": "application/opds-publication+json"},
    { "rel": "http://opds-spec.org/acquisition/open-access", "href": "https://www.gutenberg.org/ebooks/12345.epub.noimages", "type": "application/epub+zip"}
    // ...
  ],
  "images": [
    {"href": "http://example.org/cover.jpg", "type": "image/jpeg", "height": 1400, "width": 800},
    // ...
  ]
}

Initially we will return basic information about the book and files associated to the book, and in the future we can add support for

  • Linking to series collections
  • Linking to translations
  • Providing acquisition links
  • Etc...

The main difference between the information returned in the search endpoint and the publication endpoint are

  • The results of a search endpoint may change over time (number of downloads change, new books, etc)
  • The publication results in the search endpoint will always be concise to keep the response small
  • The publication results in the publication endpoint can layer in more information and data as we support it
  • The publication endpoint is stable, ie always provides information on the same publication

Summary

I think that we should implement fully OPDS 2.0 compliant endpoints to ensure we support applications that strictly implement the spec, and we should implement search endpoints that implement a superset of the spec to support querying and building collections dynamically. I do not think this will significantly increase the complexity or create code duplication, and should make it easier to implement more facets and collections in the future.

I would really appreciate feedback, and when we are aligned I can propose an implementation plan to break this into multiple PRs so we can merge endpoints one at a time and see incremental progress 😃

Appendix

Proposed relative date markers

  • now --> the current time
  • today --> the start of day server time, to align with interval periods that updates top 100 collections
  • Nd --> N days, used as now-3d or today-7d

We could add support for weeks (w) and minutes (m), and optionally yesterday (yesterday, aka today-1d) but I don't know if we have a real use case for this.

Proposed query parameters

  • All of the query parameters defined in the Readium Default Context as part of the OPDS 2.0 spec
  • from --> The starting date time in a time windowed search
  • to --> The ending date time in a time windowed search
  • sort --> The property to sort on
    • downloads --> The number of downloads
    • createdAt --> The date time the publication (ebook) was added to Project Gutenberg
  • sortOrder --> asc or desc
  • limit --> The maximum number of results

Edits

  1. Fixed spelling of OPDS and opds (I had written ODPS and odps and then copy-pasted everywhere 😅 )
  2. Fixed url (http://projectgutenberg.org => https://gutenberg.org)

@gbnewby
Copy link
Contributor Author

gbnewby commented Dec 8, 2024 via email

@ddaws
Copy link

ddaws commented Dec 9, 2024

My curiosity is about the output these queries will generate

The output format is strictly defined by the OPDS 2.0spec, but how we structure our API, aka the endpoints/routes we expose, is up to us. The OPDS 2.0 is pretty good, and because it relies on JSON-LD it makes it easy to "discover" endpoints by following linking data.

How might the output be used to [...] build web pages?

For example, the "Frequently Downloaded" page that lists top 100 books yesterday could be populated by sending a HTTP GET request to /opds/2/top100/books/1d. This would return a result like

{
  "metadata": {
    "title": "Top 100 books yesterday"
  },
  "links": [
    {"rel": "self", "href": "https://gutenberg.org/opds/2/top100/books/1d", "type": "application/opds+json"}
  ],
  "publications": [
    {
      "metadata": {
        "@type": "http://schema.org/EBook",
        "title": "Moby-Dick",
        "author": "Herman Melville",
        "identifier": "urn:isbn:978031600000X",
        "language": "en",
        "modified": "2015-09-29T17:00:00Z"
      },
      "links": [
        {"rel": "self", "href": "https://gutenberg.org/opds/2/books/by/id/12345", "type": "application/opds-publication+json"},
        { "rel": "http://opds-spec.org/acquisition/open-access", "href": "https://gutenberg.org/ebooks/12345.epub.noimages", "type": "application/epub+zip"}
        // ...
      ],
      "images": [
        {"href": "http://example.org/cover.jpg", "type": "image/jpeg", "height": 1400, "width": 800},
        // ...
      ]
    }
    // More books listed...
  ]
}

This includes all of the information required to populate the "Top 100 EBooks yesterday" list, and more, and the OPDS spec supports additional metadata if we want to improve the listing (to include images, download links, alt languages, etc)

The page could be server side rendered by having the page controller call through to the OPDS controller, or could be client side rendered by having the client browser call the /opds/2/new/1d endpoint.

Similarly the feed of latest books on the landing page could be populated by sending a HTTP GET request to /opds/2/new/1d. This endpoint would return the exact same JSON structure (defined in the OPDS spec) with different publications. The response could also include paging parameters so the client could scroll through the latest additions.

@eshellman
Copy link
Contributor

eshellman commented Dec 9, 2024 via email

@ddaws
Copy link

ddaws commented Dec 10, 2024

Have you downloaded an OPDS client?

I haven't 😅 I will do this today.

So the most natural channels for PG (along with the top feeds would probably be the "bookshelves".

Exposing endpoints for bookshelves makes a lot of sense. My initial goal would be to expose feeds for top and new, and then add support for bookshelves. I think this is okay because the base path, /opds/2 would return a OPDS 2.0 group of navigations. This uses JSON-LD to effectively tells consumers what endpoints to hit to get specific publication feeds.

So we could start by just exposing the top and new feeds because these are an easier first implementation, and then we could add in bookshelves soon thereafter.

Our current implementation is triggered by adding ".opds" to a url. This is not a common implementation. So for example, https://gutenberg.org/ebooks/25344.opds insstead of https://gutenberg.org/ebooks/25344 or https://gutenberg.org/ebooks/bookshelf/435.opds instead of https://gutenberg.org/ebooks/bookshelf/435

This makes sense. I want to avoid changing the current OPDS 1.x implementation to avoid breaking any consumers that I am not aware of. I think that we should implement an entirely different set of paths (base path = /opds/2/) because

  • It gives us flexibility to support different path patterns in the future
    • For example, /opds/2/books/by/id/{id} or /opds/2/books/by/isbn/{isbn}. This allows us to resolve the same information using different identifiers which could be useful to integrators. We don't need to do this unless there is a use case, but mounting this API under a different base path gives us this flexibility in the future.
  • It simplifies Apache routing rules
    • We could run two autocat3 processes. We could route all non /opds/2 requests to process 1, and all /opds/2/* to process 2. This would allow us to assign different resources to each process (via Linux cgroups) to ensure the API doesn't starve out the main autocat3 process and vice versa.

We can return the Content-Type: application/opds+json header to tell consumers these routes return OPDS 2.0 in the response.

Then work with api consumers to address their specific needs

Yup, makes a lot of sense 👍


I am going to go use some OPDS clients to get a better first hand understanding and will follow up. It probably also doesn't hurt to start work on a PoC, and we can change the paths around as we get aligned 🙂

@eshellman
Copy link
Contributor

eshellman commented Dec 10, 2024 via email

@eshellman
Copy link
Contributor

I've created an opds branch you can target

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants