Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(WIP) Adding docker and docker-compose #43

Open
wants to merge 3 commits into
base: docker
Choose a base branch
from

Conversation

ddaws
Copy link

@ddaws ddaws commented Nov 26, 2024

This PR adds support for running the Project Gutenberg site in a Linux container using Docker and docker-compose (or any alternative OCI compliant tooling, ie podman) to make it easier to start working on the site.

Running the site is now as simple as

$ docker compose up --build

Then open http://localhost:4000 to see the site!

Motivation

I am not a Ruby native and do not have Ruby, Bundler or Jekyll installed on my local machine. To start working on Project Gutenberg I need to install tools directly onto my host machine and may end up using a different version of Ruby, Bundler, libssl (OpenSSL), libxml, etc. All of these difference could result in different behavior when I try to build the site.

The solution to this is to support running the site in a Linux container and version controlling the Linux container build config as a Dockerfile. This allows us to describe the environment used to build the site in source control and gives us stronger guarantees of reproducible builds.

I also think that Docker is much more common than Ruby, and it is more likely someone looking to contribute to Project Gutenberg will have Docker installed than Ruby.

In the 2024 Stack Overflow Developer Survery only 5.2% of respondents reported doing extensive work in Ruby over the past year, while 53.9% of respondents reported using Docker for extensive development work in the past year. In fact Docker was the most popular tooling listed.

@ddaws
Copy link
Author

ddaws commented Dec 2, 2024

@eshellman @gbnewby mind taking a look at this next when you have time? I updated the PR description to describe the motivation for the change, rebased the branch so the diff is much smaller, and updated the README to include updated instructions. Thank you 🙏

@eshellman
Copy link
Collaborator

Are you proposing to use this as a dev environment? a test environment?

Docker is a good solution for certain types of deployment, but I personally find it cumbersome as a dev environment. in a dev environment you want to make the modify-run cycle as simple as possible, but this makes it more complicated.

For the gutenberg website, we'd love some better automation for deployments. But this is missing autocat3, and the database. (and Python, PHP, and the file systems.) Easy working with branches for example; as it is, switching branches means editing the dockerfile? Shouldn't the you have a shared mapping so you can easily inspect the built pages without going through the docker cli?

I'd like to see Greg have a more robust and direct dev/test/deploy workflow for the pieces that are in this repo. He can describe what he does (if he has time). Adding the github actions for testing/CI is part of this, but it can be cumbersome in development.

I'm not a rubyist either, but I'm pretty sure that environment management tools exist; there's all sorts of ruby stuff on my machine and they don't conflict at all.

Finally, getting "stronger guarantees of reproducible builds" won't help much if the deployment environment is not controlled in the same way; I'm not sure that's possible for us.

Ruby is pretty ubiquitous; you can't really use if for containerization. I have a dockerized ebookmaker that I've used for massive builds in the cloud. I get annoyed at it when debugging things that don't work.

Are you using Docker yourself for gutenbergsite?

I guess my bottom line is I'd like to better understand the use case. Who would use it and what would they want to do with it?

@ddaws
Copy link
Author

ddaws commented Dec 3, 2024

Are you proposing to use this as a dev environment? a test environment?

I am proposing adding Docker as a local dev environment.

in a dev environment you want to make the modify-run cycle as simple as possible, but this makes it more complicated.

Agreed, I think this does increase complexity because I did not consider the database or PHP. I really only focused on making it easier to run Jekyll, which would make it harder to run everything together.

Regarding your questions

Easy working with branches for example; as it is, switching branches means editing the dockerfile?

No, you don't need to edit the Dockerfile in this setup. In the docker-compose.yml file there is a volume bind mount that mounts the working directory into the container, so a git checkout on the host will change the files contents in the container. The main container process is jekyll build ... --watch so as soon as you checkout jekyll will automatically rebuild the static site.

Shouldn't the you have a shared mapping so you can easily inspect the built pages without going through the docker cli?

Yes, I can also add a bind mount volume for the build pages so they are visible in the host file system without going through the container like you mentioned.

Are you using Docker yourself for gutenbergsite?

I am starting to because I am still learning and getting comfortable with the project structure.

Who would use it and what would they want to do with it?

I think that developers that want to contribute to Project Gutenberg could use Docker to run the site locally with as little setup as possible. With Docker we really should be able to make it as easy as running docker-compose up to build the site and bring up all of the components locally.

My real motivation

My larger motivation is to improve Project Gutenberg on mobile. I live in country where most people don't have computers and their only access to the internet is through a mobile device over slow network connections. I think that these are the places where people could really benefit from literary access that their local governments otherwise don't provide.

Right now I am really just trying to understand how to run the project better, and I think that I can add tooling to make it easier for anyone who comes after me.

Improvements

I am going to mark this PR as a draft for now and make some more improvements. How does this sound?

  • Update the Jekyll container to build to a bind mount syncing compiled pages to the host
  • Run a web server (Apache, Nginx, whatever the project is hosted on) to serve the built pages
  • Add a fast-cgi or other PHP container to run the PHP code. Proxy it through the web server
  • Add a database container to host the database
  • Optionally add a command to pull data to prepopulate the database

This should get it closer to running the full site, including PHP and the database, and it should still be one command (docker-compose up) to run the entire site. I'll need help to figure out what web server is running and how it is configured, what database is running, etc.

Any links are appreciated and of course I'll do my homework and start reading 🙏

@ddaws ddaws changed the title Adding docker and docker-compose Draft: Adding docker and docker-compose Dec 3, 2024
@eshellman
Copy link
Collaborator

That sounds great! We agree work is needed. A "WIP" label would be better than draft. To increase availability, I suggest we create a branch in this repo for this work, you can target that branch with a PR. Hopefully anyone with interest can then have access to that branch and propose improvements.

@ddaws
Copy link
Author

ddaws commented Dec 3, 2024

@eshellman I can't add labels to the PR or create branches in this repo because I am not a collaborator on this repository. I'm happy to be added as a collaborator but I'm sure you'd like to see more from me first 🙂

Could you please add the WIP label for me and create a new branch? We could call the branch experiment/docker or exp/docker. Once the branch is created I'll update this PR to point to it.

Regarding the database

I also saw that the database is managed here gutenbergtools/pgdb and that this PR is outstanding. Are the schemas for the database otherwise up to date?

To support being able to run the entire Gutenberg site in a single docker compose up command I will need to create a container for the DB. I'm thinking I can add a Dockerfile to gutenbergtools/pgdb and Github Actions to automatically build the database container (without data, but applying all migrations) and publish the database container to the project's Github Container Registry on merges into the main branch.

@gbnewby
Copy link
Collaborator

gbnewby commented Dec 3, 2024 via email

@eshellman eshellman changed the title Draft: Adding docker and docker-compose (WIP) Adding docker and docker-compose Dec 3, 2024
@eshellman
Copy link
Collaborator

It might be useful to create a "toy" version of the website in a docker container. With data and files for ~10 books. Still a huge project to bring together the missing pieces. Might be best to attempt it in a separate repo.
The pgdb repo is being used to snapshot the db schema, not to maintain it. The db was designed before managed migrations were a thing.
I've added a "docker" branch and relabeled the PR.

@eshellman
Copy link
Collaborator

Since Greg mentioned it, if you know some python, creating an OPDS 2.0 feed based on the existing OPDS feed (a pre-1.0 version) might be interesting for you.
https://github.com/gutenbergtools/autocat3/blob/master/templates/results.opds
https://drafts.opds.io/opds-2.0.html

@ddaws ddaws changed the base branch from master to docker December 4, 2024 02:31
@ddaws
Copy link
Author

ddaws commented Dec 4, 2024

I'm happy to take on the OPDS 2.0 feed. I think it will be a good opportunity for me to get an understanding of autocat3. It will probably take me a couple days to get comfortable making changes in the project.

Regarding containerizing the site, my goal would be a toy version of the site with limited data to run locally to support local development. I want it to be really easy to start working on Project Gutenberg. I have some ideas about how to address Greg's concerns that I'll list below

First, the two sets of content, currently including 7.7M files in nearly 3TB

Locally I would look into running an FTP or HTTP proxy. This way you don't need to download all of the files. The files would be pulled and cached on demand without having to update any of the links in the site.

Second, the app server. As you might have seen, Jekyll only manages around 70 pages. There are nearly 75000 other "pages" that are dynamically generated by autocat3 from the database

I'll need to containerize autocat3 in the future as well, and have it run alongside the Jekyll and Apache containers. I am not familiar with autocat3 yet so I am not sure how challenging this will be, but I should have a better understanding of this once I am done implementing OPDS 2.0

Third, the catalog database psql dump is just under 1GB, and changes multiple times per day. I don't know how you will plan on distributing and ingesting it

It would be great to get a copy of this for my own local work. As far as making this public, I agree that this shouldn't go in git. It could be hosted for download, but I also found garethbjohnson/gutendex that includes an updatecatalog script that populates the database from RDF data.

We could either

  • Publish the database dump + a slimmed down DB dump for local dev
  • Include a script to roughly populate the database from data feeds (like the updatecatalog script)

Fourth, there is a lot of crufty customization of our .htaccess files, which are currently actually named .acl.gutenbergweb. Also we have some customization in our Apache config files. I think we could make these available as part of the Jekyll collection, but they are not currently published anywhere.

It would be really nice to include these in the git repo. Otherwise it will be trial and error on my part to figure out how the web server is configured

Finally, there are around a dozen cron jobs, around 100K symlinks, and various other components that keep the site complete and up to date. Most of that stuff is not under revision control and it would be a lot of work to try to get it there.

Maybe this is something I can help with in the future as well 🙂


Ultimately I think I need to start making small contributions so I can learn more. I don't think these challenges are insurmountable, I'll just have to tackle it one piece at a time.

I'm happy to start with the OPDS 2.0 feed, and I'll need some data dump from the DB to do this. Eric or Greg could you share this with me? Maybe you can just drop it in Google Drive or Drop box and send me a link, or I could share an SSH key if you want to drop it on a server so I can sftp it.

@gbnewby
Copy link
Collaborator

gbnewby commented Dec 4, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants