-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(WIP) Adding docker and docker-compose #43
base: docker
Are you sure you want to change the base?
Conversation
@eshellman @gbnewby mind taking a look at this next when you have time? I updated the PR description to describe the motivation for the change, rebased the branch so the diff is much smaller, and updated the README to include updated instructions. Thank you 🙏 |
Are you proposing to use this as a dev environment? a test environment? Docker is a good solution for certain types of deployment, but I personally find it cumbersome as a dev environment. in a dev environment you want to make the modify-run cycle as simple as possible, but this makes it more complicated. For the gutenberg website, we'd love some better automation for deployments. But this is missing autocat3, and the database. (and Python, PHP, and the file systems.) Easy working with branches for example; as it is, switching branches means editing the dockerfile? Shouldn't the you have a shared mapping so you can easily inspect the built pages without going through the docker cli? I'd like to see Greg have a more robust and direct dev/test/deploy workflow for the pieces that are in this repo. He can describe what he does (if he has time). Adding the github actions for testing/CI is part of this, but it can be cumbersome in development. I'm not a rubyist either, but I'm pretty sure that environment management tools exist; there's all sorts of ruby stuff on my machine and they don't conflict at all. Finally, getting "stronger guarantees of reproducible builds" won't help much if the deployment environment is not controlled in the same way; I'm not sure that's possible for us. Ruby is pretty ubiquitous; you can't really use if for containerization. I have a dockerized ebookmaker that I've used for massive builds in the cloud. I get annoyed at it when debugging things that don't work. Are you using Docker yourself for gutenbergsite? I guess my bottom line is I'd like to better understand the use case. Who would use it and what would they want to do with it? |
I am proposing adding Docker as a local dev environment.
Agreed, I think this does increase complexity because I did not consider the database or PHP. I really only focused on making it easier to run Jekyll, which would make it harder to run everything together. Regarding your questions
No, you don't need to edit the Dockerfile in this setup. In the docker-compose.yml file there is a volume bind mount that mounts the working directory into the container, so a
Yes, I can also add a bind mount volume for the build pages so they are visible in the host file system without going through the container like you mentioned.
I am starting to because I am still learning and getting comfortable with the project structure.
I think that developers that want to contribute to Project Gutenberg could use Docker to run the site locally with as little setup as possible. With Docker we really should be able to make it as easy as running My real motivationMy larger motivation is to improve Project Gutenberg on mobile. I live in country where most people don't have computers and their only access to the internet is through a mobile device over slow network connections. I think that these are the places where people could really benefit from literary access that their local governments otherwise don't provide. Right now I am really just trying to understand how to run the project better, and I think that I can add tooling to make it easier for anyone who comes after me. ImprovementsI am going to mark this PR as a draft for now and make some more improvements. How does this sound?
This should get it closer to running the full site, including PHP and the database, and it should still be one command ( Any links are appreciated and of course I'll do my homework and start reading 🙏 |
That sounds great! We agree work is needed. A "WIP" label would be better than draft. To increase availability, I suggest we create a branch in this repo for this work, you can target that branch with a PR. Hopefully anyone with interest can then have access to that branch and propose improvements. |
@eshellman I can't add labels to the PR or create branches in this repo because I am not a collaborator on this repository. I'm happy to be added as a collaborator but I'm sure you'd like to see more from me first 🙂 Could you please add the WIP label for me and create a new branch? We could call the branch Regarding the databaseI also saw that the database is managed here gutenbergtools/pgdb and that this PR is outstanding. Are the schemas for the database otherwise up to date? To support being able to run the entire Gutenberg site in a single |
Hi @dawson. I am not seeing how your goal of creating the entire site in a
Docker container is viable.
I do think it's viable to have the pages generated by Jekyll - you have
demonstrated that. Here is an overview of what else is involved in having a
fully functional copy of the www.gutenberg.org website:
First, the two sets of content, currently including 7.7M files in nearly
3TB. This cannot go to github: it's too much content, and also a
significant portion of the content (around half) is updated every month.
See the "mirroring how-to" at https://www.gutenberg.org for details.
Second, the app server. As you might have seen, Jekyll only manages around
70 pages. There are nearly 75000 other "pages" that are dynamically
generated by autocat3 from the database. Those pages are cached by our web
infrastructure, but otherwise they do not exist as static files. autocat3
generates many of the pages for things like latest books, top 100, etc.
dynamically.
Third, the catalog database psql dump is just under 1GB, and changes
multiple times per day. I don't know how you will plan on distributing and
ingesting it. It's not currently publicly available, but it could be made
publicly available. I don't want to take responsibility for putting it into
github, though we could publish it.
Fourth, there is a lot of crufty customization of our .htaccess files,
which are currently actually named .acl.gutenbergweb. Also we have some
customization in our Apache config files. I think we could make these
available as part of the Jekyll collection, but they are not currently
published anywhere.
Finally, there are around a dozen cron jobs, around 100K symlinks, and
various other components that keep the site complete and up to date. Most
of that stuff is not under revision control and it would be a lot of work
to try to get it there. As an example, our "top 100" page ultimately comes
from some daily jobs that parse the Apache logs and make entries of
download counts to the database. Many of these jobs were set up over 10
years ago by long-departed personnel, and we have no documentation and
little inspiration to make it suitable to run in arbitrary environments.
I realize these are large challenges, and it's why we haven't tried to make
the entire website portable. There are opportunities to do some work to
address the last two paragraphs above, for example, but the substantial
size and # of files for the first, second and third items mean that it will
always be a very big task to stand up a copy.
For your interest in an app, what I'm thinking is you really need an API to
access what is already provided by www.gutenberg.org. In fact we have an
outdated version of OPDS running, and we hope to upgrade to the new version
soon (Eric might have a prognosis on this).
I hope this long explanation is helpful.
…On Tue, Dec 3, 2024 at 4:12 AM Dawson ***@***.***> wrote:
@eshellman <https://github.com/eshellman> I can't add labels to the PR or
create branches in this repo because I am not a collaborator on this
repository. I'm happy to be added as a collaborator but I'm sure you'd like
to see more from me first 🙂
Could you please add the WIP label for me and create a new branch? We
could call the branch experiment/docker or exp/docker. Once the branch is
created I'll update this PR to point to it.
Regarding the database
I also saw that the database is managed here gutenbergtools/pgdb
<https://github.com/gutenbergtools/pgdb> and that this PR is outstanding
<gutenbergtools/pgdb#4>. Are the schemas for the
database otherwise up to date?
To support being able to run the entire Gutenberg site in a single docker
compose up command I will need to create a container for the DB. I'm
thinking I can add a Dockerfile to gutenbergtools/pgdb
<https://github.com/gutenbergtools/pgdb> and Github Actions to
automatically build the database container (without data, but applying all
migrations) and publish the database container to the project's Github
Container Registry
<https://docs.github.com/en/packages/working-with-a-github-packages-registry/working-with-the-container-registry>
on merges into the main branch.
—
Reply to this email directly, view it on GitHub
<#43 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFQRDLWCJLYRAH5GEHOUSWD2DWGZJAVCNFSM6AAAAABSPN7QRCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMJUGI2DSOJWG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
It might be useful to create a "toy" version of the website in a docker container. With data and files for ~10 books. Still a huge project to bring together the missing pieces. Might be best to attempt it in a separate repo. |
Since Greg mentioned it, if you know some python, creating an OPDS 2.0 feed based on the existing OPDS feed (a pre-1.0 version) might be interesting for you. |
I'm happy to take on the OPDS 2.0 feed. I think it will be a good opportunity for me to get an understanding of autocat3. It will probably take me a couple days to get comfortable making changes in the project. Regarding containerizing the site, my goal would be a toy version of the site with limited data to run locally to support local development. I want it to be really easy to start working on Project Gutenberg. I have some ideas about how to address Greg's concerns that I'll list below
Locally I would look into running an FTP or HTTP proxy. This way you don't need to download all of the files. The files would be pulled and cached on demand without having to update any of the links in the site.
I'll need to containerize autocat3 in the future as well, and have it run alongside the Jekyll and Apache containers. I am not familiar with autocat3 yet so I am not sure how challenging this will be, but I should have a better understanding of this once I am done implementing OPDS 2.0
It would be great to get a copy of this for my own local work. As far as making this public, I agree that this shouldn't go in git. It could be hosted for download, but I also found garethbjohnson/gutendex that includes an We could either
It would be really nice to include these in the git repo. Otherwise it will be trial and error on my part to figure out how the web server is configured
Maybe this is something I can help with in the future as well 🙂 Ultimately I think I need to start making small contributions so I can learn more. I don't think these challenges are insurmountable, I'll just have to tackle it one piece at a time. I'm happy to start with the OPDS 2.0 feed, and I'll need some data dump from the DB to do this. Eric or Greg could you share this with me? Maybe you can just drop it in Google Drive or Drop box and send me a link, or I could share an SSH key if you want to drop it on a server so I can sftp it. |
This all sounds pretty good.
I put a database dump at https://petascale.org/tmp/2024-12-03.pg_dump
(1130754856 bytes, md5sum 6d12c3d104ac0439b1fefd03cb714680)
I put our current .htaccess file at https://petascale.org/tmp/htaccess-pg
I'm happy to provide you all the details you could want about how things
work, and copies of things that aren't under git.
Your current plan of looking at OPDS, and looking at autocat3, seems sound.
Once you have autocat3 running behind Apache, along with the Jekyll-built
pages, you will have the bulk of the website.
You don't need most of the random cron jobs, because most of them result in
updates to the catalog database.
There are a relatively small number of browsing pages that are output from
those jobs. My hope is to retire them, in favor of pages generated from
autocat3. A static copy, including for latest_titles.html, might suffice.
Thanks! Greg
…On Tue, Dec 3, 2024 at 8:44 PM Dawson ***@***.***> wrote:
I'm happy to take on the OPDS 2.0 feed. I think it will be a good
opportunity for me to get an understanding of autocat3. It will probably
take me a couple days to get comfortable making changes in the project.
Regarding containerizing the site, my goal would be a toy version of the
site with limited data to run locally to support local development. I want
it to be really easy to start working on Project Gutenberg. I have some
ideas about how to address Greg's concerns that I'll list below
First, the two sets of content, currently including 7.7M files in nearly
3TB
Locally I would look into running an FTP or HTTP proxy. This way you don't
need to download all of the files. The files would be pulled and cached on
demand without having to update any of the links in the site.
Second, the app server. As you might have seen, Jekyll only manages around
70 pages. There are nearly 75000 other "pages" that are dynamically
generated by autocat3 from the database
I'll need to containerize autocat3 in the future as well, and have it run
alongside the Jekyll and Apache containers. I am not familiar with autocat3
yet so I am not sure how challenging this will be, but I should have a
better understanding of this once I am done implementing OPDS 2.0
Third, the catalog database psql dump is just under 1GB, and changes
multiple times per day. I don't know how you will plan on distributing and
ingesting it
It would be great to get a copy of this for my own local work. As far as
making this public, I agree that this shouldn't go in git. It could be
hosted for download, but I also found garethbjohnson/gutendex
<https://github.com/garethbjohnson/gutendex> that includes an
updatecatalog
<https://github.com/garethbjohnson/gutendex/blob/master/books/management/commands/updatecatalog.py>
script that populates the database from RDF data.
We could either
- Publish the database dump + a slimmed down DB dump for local dev
- Include a script to roughly populate the database from data feeds
(like the updatecatalog script)
Fourth, there is a lot of crufty customization of our .htaccess files,
which are currently actually named .acl.gutenbergweb. Also we have some
customization in our Apache config files. I think we could make these
available as part of the Jekyll collection, but they are not currently
published anywhere.
It would be really nice to include these in the git repo. Otherwise it
will be trial and error on my part to figure out how the web server is
configured
Finally, there are around a dozen cron jobs, around 100K symlinks, and
various other components that keep the site complete and up to date. Most
of that stuff is not under revision control and it would be a lot of work
to try to get it there.
Maybe this is something I can help with in the future as well 🙂
------------------------------
Ultimately I think I need to start making small contributions so I can
learn more. I don't think these challenges are insurmountable, I'll just
have to tackle it one piece at a time.
I'm happy to start with the OPDS 2.0 feed, and I'll need some data dump
from the DB to do this. Eric or Greg could you share this with me? Maybe
you can just drop it in Google Drive or Drop box and send me a link, or I
could share an SSH key if you want to drop it on a server so I can sftp it.
—
Reply to this email directly, view it on GitHub
<#43 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFQRDLR3ETMW5BNTDIVZIG32DZ3CXAVCNFSM6AAAAABSPN7QRCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMJWGEYDMOBZGM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
This PR adds support for running the Project Gutenberg site in a Linux container using Docker and docker-compose (or any alternative OCI compliant tooling, ie podman) to make it easier to start working on the site.
Running the site is now as simple as
Then open http://localhost:4000 to see the site!
Motivation
I am not a Ruby native and do not have Ruby, Bundler or Jekyll installed on my local machine. To start working on Project Gutenberg I need to install tools directly onto my host machine and may end up using a different version of Ruby, Bundler, libssl (OpenSSL), libxml, etc. All of these difference could result in different behavior when I try to build the site.
The solution to this is to support running the site in a Linux container and version controlling the Linux container build config as a Dockerfile. This allows us to describe the environment used to build the site in source control and gives us stronger guarantees of reproducible builds.
I also think that Docker is much more common than Ruby, and it is more likely someone looking to contribute to Project Gutenberg will have Docker installed than Ruby.
In the 2024 Stack Overflow Developer Survery only 5.2% of respondents reported doing extensive work in Ruby over the past year, while 53.9% of respondents reported using Docker for extensive development work in the past year. In fact Docker was the most popular tooling listed.