(WIP) Adding docker and docker-compose #43

ddaws · 2024-11-26T03:50:24Z

This PR adds support for running the Project Gutenberg site in a Linux container using Docker and docker-compose (or any alternative OCI compliant tooling, ie podman) to make it easier to start working on the site.

Running the site is now as simple as

$ docker compose up --build

Then open http://localhost:4000 to see the site!

Motivation

I am not a Ruby native and do not have Ruby, Bundler or Jekyll installed on my local machine. To start working on Project Gutenberg I need to install tools directly onto my host machine and may end up using a different version of Ruby, Bundler, libssl (OpenSSL), libxml, etc. All of these difference could result in different behavior when I try to build the site.

The solution to this is to support running the site in a Linux container and version controlling the Linux container build config as a Dockerfile. This allows us to describe the environment used to build the site in source control and gives us stronger guarantees of reproducible builds.

I also think that Docker is much more common than Ruby, and it is more likely someone looking to contribute to Project Gutenberg will have Docker installed than Ruby.

In the 2024 Stack Overflow Developer Survery only 5.2% of respondents reported doing extensive work in Ruby over the past year, while 53.9% of respondents reported using Docker for extensive development work in the past year. In fact Docker was the most popular tooling listed.

ddaws · 2024-12-02T11:50:25Z

@eshellman @gbnewby mind taking a look at this next when you have time? I updated the PR description to describe the motivation for the change, rebased the branch so the diff is much smaller, and updated the README to include updated instructions. Thank you 🙏

eshellman · 2024-12-02T18:30:43Z

Are you proposing to use this as a dev environment? a test environment?

Docker is a good solution for certain types of deployment, but I personally find it cumbersome as a dev environment. in a dev environment you want to make the modify-run cycle as simple as possible, but this makes it more complicated.

For the gutenberg website, we'd love some better automation for deployments. But this is missing autocat3, and the database. (and Python, PHP, and the file systems.) Easy working with branches for example; as it is, switching branches means editing the dockerfile? Shouldn't the you have a shared mapping so you can easily inspect the built pages without going through the docker cli?

I'd like to see Greg have a more robust and direct dev/test/deploy workflow for the pieces that are in this repo. He can describe what he does (if he has time). Adding the github actions for testing/CI is part of this, but it can be cumbersome in development.

I'm not a rubyist either, but I'm pretty sure that environment management tools exist; there's all sorts of ruby stuff on my machine and they don't conflict at all.

Finally, getting "stronger guarantees of reproducible builds" won't help much if the deployment environment is not controlled in the same way; I'm not sure that's possible for us.

Ruby is pretty ubiquitous; you can't really use if for containerization. I have a dockerized ebookmaker that I've used for massive builds in the cloud. I get annoyed at it when debugging things that don't work.

Are you using Docker yourself for gutenbergsite?

I guess my bottom line is I'd like to better understand the use case. Who would use it and what would they want to do with it?

ddaws · 2024-12-03T02:08:18Z

Are you proposing to use this as a dev environment? a test environment?

I am proposing adding Docker as a local dev environment.

in a dev environment you want to make the modify-run cycle as simple as possible, but this makes it more complicated.

Agreed, I think this does increase complexity because I did not consider the database or PHP. I really only focused on making it easier to run Jekyll, which would make it harder to run everything together.

Regarding your questions

Easy working with branches for example; as it is, switching branches means editing the dockerfile?

No, you don't need to edit the Dockerfile in this setup. In the docker-compose.yml file there is a volume bind mount that mounts the working directory into the container, so a git checkout on the host will change the files contents in the container. The main container process is jekyll build ... --watch so as soon as you checkout jekyll will automatically rebuild the static site.

Shouldn't the you have a shared mapping so you can easily inspect the built pages without going through the docker cli?

Yes, I can also add a bind mount volume for the build pages so they are visible in the host file system without going through the container like you mentioned.

Are you using Docker yourself for gutenbergsite?

I am starting to because I am still learning and getting comfortable with the project structure.

Who would use it and what would they want to do with it?

I think that developers that want to contribute to Project Gutenberg could use Docker to run the site locally with as little setup as possible. With Docker we really should be able to make it as easy as running docker-compose up to build the site and bring up all of the components locally.

My real motivation

My larger motivation is to improve Project Gutenberg on mobile. I live in country where most people don't have computers and their only access to the internet is through a mobile device over slow network connections. I think that these are the places where people could really benefit from literary access that their local governments otherwise don't provide.

Right now I am really just trying to understand how to run the project better, and I think that I can add tooling to make it easier for anyone who comes after me.

Improvements

I am going to mark this PR as a draft for now and make some more improvements. How does this sound?

Update the Jekyll container to build to a bind mount syncing compiled pages to the host
Run a web server (Apache, Nginx, whatever the project is hosted on) to serve the built pages
Add a fast-cgi or other PHP container to run the PHP code. Proxy it through the web server
Add a database container to host the database
Optionally add a command to pull data to prepopulate the database

This should get it closer to running the full site, including PHP and the database, and it should still be one command (docker-compose up) to run the entire site. I'll need help to figure out what web server is running and how it is configured, what database is running, etc.

Any links are appreciated and of course I'll do my homework and start reading 🙏

eshellman · 2024-12-03T02:40:56Z

That sounds great! We agree work is needed. A "WIP" label would be better than draft. To increase availability, I suggest we create a branch in this repo for this work, you can target that branch with a PR. Hopefully anyone with interest can then have access to that branch and propose improvements.

ddaws · 2024-12-03T11:11:57Z

@eshellman I can't add labels to the PR or create branches in this repo because I am not a collaborator on this repository. I'm happy to be added as a collaborator but I'm sure you'd like to see more from me first 🙂

Could you please add the WIP label for me and create a new branch? We could call the branch experiment/docker or exp/docker. Once the branch is created I'll update this PR to point to it.

Regarding the database

I also saw that the database is managed here gutenbergtools/pgdb and that this PR is outstanding. Are the schemas for the database otherwise up to date?

To support being able to run the entire Gutenberg site in a single docker compose up command I will need to create a container for the DB. I'm thinking I can add a Dockerfile to gutenbergtools/pgdb and Github Actions to automatically build the database container (without data, but applying all migrations) and publish the database container to the project's Github Container Registry on merges into the main branch.

gbnewby · 2024-12-03T15:11:42Z

Hi @dawson. I am not seeing how your goal of creating the entire site in a Docker container is viable. I do think it's viable to have the pages generated by Jekyll - you have demonstrated that. Here is an overview of what else is involved in having a fully functional copy of the www.gutenberg.org website: First, the two sets of content, currently including 7.7M files in nearly 3TB. This cannot go to github: it's too much content, and also a significant portion of the content (around half) is updated every month. See the "mirroring how-to" at https://www.gutenberg.org for details. Second, the app server. As you might have seen, Jekyll only manages around 70 pages. There are nearly 75000 other "pages" that are dynamically generated by autocat3 from the database. Those pages are cached by our web infrastructure, but otherwise they do not exist as static files. autocat3 generates many of the pages for things like latest books, top 100, etc. dynamically. Third, the catalog database psql dump is just under 1GB, and changes multiple times per day. I don't know how you will plan on distributing and ingesting it. It's not currently publicly available, but it could be made publicly available. I don't want to take responsibility for putting it into github, though we could publish it. Fourth, there is a lot of crufty customization of our .htaccess files, which are currently actually named .acl.gutenbergweb. Also we have some customization in our Apache config files. I think we could make these available as part of the Jekyll collection, but they are not currently published anywhere. Finally, there are around a dozen cron jobs, around 100K symlinks, and various other components that keep the site complete and up to date. Most of that stuff is not under revision control and it would be a lot of work to try to get it there. As an example, our "top 100" page ultimately comes from some daily jobs that parse the Apache logs and make entries of download counts to the database. Many of these jobs were set up over 10 years ago by long-departed personnel, and we have no documentation and little inspiration to make it suitable to run in arbitrary environments. I realize these are large challenges, and it's why we haven't tried to make the entire website portable. There are opportunities to do some work to address the last two paragraphs above, for example, but the substantial size and # of files for the first, second and third items mean that it will always be a very big task to stand up a copy. For your interest in an app, what I'm thinking is you really need an API to access what is already provided by www.gutenberg.org. In fact we have an outdated version of OPDS running, and we hope to upgrade to the new version soon (Eric might have a prognosis on this). I hope this long explanation is helpful.

…

On Tue, Dec 3, 2024 at 4:12 AM Dawson ***@***.***> wrote: @eshellman <https://github.com/eshellman> I can't add labels to the PR or create branches in this repo because I am not a collaborator on this repository. I'm happy to be added as a collaborator but I'm sure you'd like to see more from me first 🙂 Could you please add the WIP label for me and create a new branch? We could call the branch experiment/docker or exp/docker. Once the branch is created I'll update this PR to point to it. Regarding the database I also saw that the database is managed here gutenbergtools/pgdb <https://github.com/gutenbergtools/pgdb> and that this PR is outstanding <gutenbergtools/pgdb#4>. Are the schemas for the database otherwise up to date? To support being able to run the entire Gutenberg site in a single docker compose up command I will need to create a container for the DB. I'm thinking I can add a Dockerfile to gutenbergtools/pgdb <https://github.com/gutenbergtools/pgdb> and Github Actions to automatically build the database container (without data, but applying all migrations) and publish the database container to the project's Github Container Registry <https://docs.github.com/en/packages/working-with-a-github-packages-registry/working-with-the-container-registry> on merges into the main branch. — Reply to this email directly, view it on GitHub <#43 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFQRDLWCJLYRAH5GEHOUSWD2DWGZJAVCNFSM6AAAAABSPN7QRCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMJUGI2DSOJWG4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

eshellman · 2024-12-03T19:18:07Z

It might be useful to create a "toy" version of the website in a docker container. With data and files for ~10 books. Still a huge project to bring together the missing pieces. Might be best to attempt it in a separate repo.
The pgdb repo is being used to snapshot the db schema, not to maintain it. The db was designed before managed migrations were a thing.
I've added a "docker" branch and relabeled the PR.

eshellman · 2024-12-03T19:33:16Z

Since Greg mentioned it, if you know some python, creating an OPDS 2.0 feed based on the existing OPDS feed (a pre-1.0 version) might be interesting for you.
https://github.com/gutenbergtools/autocat3/blob/master/templates/results.opds
https://drafts.opds.io/opds-2.0.html

ddaws · 2024-12-04T03:44:22Z

I'm happy to take on the OPDS 2.0 feed. I think it will be a good opportunity for me to get an understanding of autocat3. It will probably take me a couple days to get comfortable making changes in the project.

Regarding containerizing the site, my goal would be a toy version of the site with limited data to run locally to support local development. I want it to be really easy to start working on Project Gutenberg. I have some ideas about how to address Greg's concerns that I'll list below

First, the two sets of content, currently including 7.7M files in nearly 3TB

Locally I would look into running an FTP or HTTP proxy. This way you don't need to download all of the files. The files would be pulled and cached on demand without having to update any of the links in the site.

Second, the app server. As you might have seen, Jekyll only manages around 70 pages. There are nearly 75000 other "pages" that are dynamically generated by autocat3 from the database

I'll need to containerize autocat3 in the future as well, and have it run alongside the Jekyll and Apache containers. I am not familiar with autocat3 yet so I am not sure how challenging this will be, but I should have a better understanding of this once I am done implementing OPDS 2.0

Third, the catalog database psql dump is just under 1GB, and changes multiple times per day. I don't know how you will plan on distributing and ingesting it

It would be great to get a copy of this for my own local work. As far as making this public, I agree that this shouldn't go in git. It could be hosted for download, but I also found garethbjohnson/gutendex that includes an updatecatalog script that populates the database from RDF data.

We could either

Publish the database dump + a slimmed down DB dump for local dev
Include a script to roughly populate the database from data feeds (like the updatecatalog script)

Fourth, there is a lot of crufty customization of our .htaccess files, which are currently actually named .acl.gutenbergweb. Also we have some customization in our Apache config files. I think we could make these available as part of the Jekyll collection, but they are not currently published anywhere.

It would be really nice to include these in the git repo. Otherwise it will be trial and error on my part to figure out how the web server is configured

Finally, there are around a dozen cron jobs, around 100K symlinks, and various other components that keep the site complete and up to date. Most of that stuff is not under revision control and it would be a lot of work to try to get it there.

Maybe this is something I can help with in the future as well 🙂

Ultimately I think I need to start making small contributions so I can learn more. I don't think these challenges are insurmountable, I'll just have to tackle it one piece at a time.

I'm happy to start with the OPDS 2.0 feed, and I'll need some data dump from the DB to do this. Eric or Greg could you share this with me? Maybe you can just drop it in Google Drive or Drop box and send me a link, or I could share an SSH key if you want to drop it on a server so I can sftp it.

gbnewby · 2024-12-04T04:53:53Z

This all sounds pretty good. I put a database dump at https://petascale.org/tmp/2024-12-03.pg_dump (1130754856 bytes, md5sum 6d12c3d104ac0439b1fefd03cb714680) I put our current .htaccess file at https://petascale.org/tmp/htaccess-pg I'm happy to provide you all the details you could want about how things work, and copies of things that aren't under git. Your current plan of looking at OPDS, and looking at autocat3, seems sound. Once you have autocat3 running behind Apache, along with the Jekyll-built pages, you will have the bulk of the website. You don't need most of the random cron jobs, because most of them result in updates to the catalog database. There are a relatively small number of browsing pages that are output from those jobs. My hope is to retire them, in favor of pages generated from autocat3. A static copy, including for latest_titles.html, might suffice. Thanks! Greg

…

On Tue, Dec 3, 2024 at 8:44 PM Dawson ***@***.***> wrote: I'm happy to take on the OPDS 2.0 feed. I think it will be a good opportunity for me to get an understanding of autocat3. It will probably take me a couple days to get comfortable making changes in the project. Regarding containerizing the site, my goal would be a toy version of the site with limited data to run locally to support local development. I want it to be really easy to start working on Project Gutenberg. I have some ideas about how to address Greg's concerns that I'll list below First, the two sets of content, currently including 7.7M files in nearly 3TB Locally I would look into running an FTP or HTTP proxy. This way you don't need to download all of the files. The files would be pulled and cached on demand without having to update any of the links in the site. Second, the app server. As you might have seen, Jekyll only manages around 70 pages. There are nearly 75000 other "pages" that are dynamically generated by autocat3 from the database I'll need to containerize autocat3 in the future as well, and have it run alongside the Jekyll and Apache containers. I am not familiar with autocat3 yet so I am not sure how challenging this will be, but I should have a better understanding of this once I am done implementing OPDS 2.0 Third, the catalog database psql dump is just under 1GB, and changes multiple times per day. I don't know how you will plan on distributing and ingesting it It would be great to get a copy of this for my own local work. As far as making this public, I agree that this shouldn't go in git. It could be hosted for download, but I also found garethbjohnson/gutendex <https://github.com/garethbjohnson/gutendex> that includes an updatecatalog <https://github.com/garethbjohnson/gutendex/blob/master/books/management/commands/updatecatalog.py> script that populates the database from RDF data. We could either - Publish the database dump + a slimmed down DB dump for local dev - Include a script to roughly populate the database from data feeds (like the updatecatalog script) Fourth, there is a lot of crufty customization of our .htaccess files, which are currently actually named .acl.gutenbergweb. Also we have some customization in our Apache config files. I think we could make these available as part of the Jekyll collection, but they are not currently published anywhere. It would be really nice to include these in the git repo. Otherwise it will be trial and error on my part to figure out how the web server is configured Finally, there are around a dozen cron jobs, around 100K symlinks, and various other components that keep the site complete and up to date. Most of that stuff is not under revision control and it would be a lot of work to try to get it there. Maybe this is something I can help with in the future as well 🙂 ------------------------------ Ultimately I think I need to start making small contributions so I can learn more. I don't think these challenges are insurmountable, I'll just have to tackle it one piece at a time. I'm happy to start with the OPDS 2.0 feed, and I'll need some data dump from the DB to do this. Eric or Greg could you share this with me? Maybe you can just drop it in Google Drive or Drop box and send me a link, or I could share an SSH key if you want to drop it on a server so I can sftp it. — Reply to this email directly, view it on GitHub <#43 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFQRDLR3ETMW5BNTDIVZIG32DZ3CXAVCNFSM6AAAAABSPN7QRCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMJWGEYDMOBZGM> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Dawson R and others added 3 commits November 26, 2024 11:46

Adding Docker and docker-compose to make it easier to bring the site up

ef10517

Merge branch 'gutenbergtools:master' into feature/add-docker

62d0e71

Updated the README with instructions on how to run the site using Docker

ee16685

ddaws changed the title ~~Adding docker and docker-compose~~ Draft: Adding docker and docker-compose Dec 3, 2024

eshellman changed the title ~~Draft: Adding docker and docker-compose~~ (WIP) Adding docker and docker-compose Dec 3, 2024

ddaws changed the base branch from master to docker December 4, 2024 02:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(WIP) Adding docker and docker-compose #43

(WIP) Adding docker and docker-compose #43

ddaws commented Nov 26, 2024 •

edited

Loading

ddaws commented Dec 2, 2024

eshellman commented Dec 2, 2024

ddaws commented Dec 3, 2024

eshellman commented Dec 3, 2024

ddaws commented Dec 3, 2024

gbnewby commented Dec 3, 2024 via email

eshellman commented Dec 3, 2024

eshellman commented Dec 3, 2024

ddaws commented Dec 4, 2024

gbnewby commented Dec 4, 2024 via email

(WIP) Adding docker and docker-compose #43

Are you sure you want to change the base?

(WIP) Adding docker and docker-compose #43

Conversation

ddaws commented Nov 26, 2024 • edited Loading

Motivation

ddaws commented Dec 2, 2024

eshellman commented Dec 2, 2024

ddaws commented Dec 3, 2024

Regarding your questions

My real motivation

Improvements

eshellman commented Dec 3, 2024

ddaws commented Dec 3, 2024

Regarding the database

gbnewby commented Dec 3, 2024 via email

eshellman commented Dec 3, 2024

eshellman commented Dec 3, 2024

ddaws commented Dec 4, 2024

gbnewby commented Dec 4, 2024 via email

ddaws commented Nov 26, 2024 •

edited

Loading