Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(llm-finetuner): Migrate LLM finetuner image from kubernetes-cloud #21

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Eta0
Copy link
Collaborator

@Eta0 Eta0 commented May 18, 2023

LLM Finetuner Container

This re-homes the container for coreweave/kubernetes-cloud's LLM finetuner by copying over its Dockerfile and compiler wrapper as they appeared in commit 6c10019 under the directory finetuner-workflow/finetuner in that repository, with some updates for cross-repository downloading added to the build.

Neat Things About the Build

The coreweave/kubernetes-cloud repository is absolutely massive. At the time of writing, git clone https://github.com/coreweave/kubernetes-cloud thwacks you with a 607 MiB download, primarily comprising nearly 400 MiB of image files under /docs and an almost 200 MiB .git directory.
This is a bit over-the-top to download just a handful of files, so this container's build is configured to do sparse checkouts that reduce the download size 1000x, to a bit under 600 KiB, which is further reduced to just a few dozen kilobytes by deleting the .git directory at the end of the download step.

It's a nice improvement that could be integrated into this repository's sd-finetuner container build as well, which currently leaves that full 600+ MiB repository in its final image.

Weird Things About the Build

Building from a Branch

Branch names can but probably should not be used as commit identifiers for these builds, because Docker may cache the download by the branch's name, which isn't good if the branch has received updates and is expected to be re-downloaded in an updated state. The hash of the latest commit should be used instead.

Coupling

There is currently no default commit defined for the build, and accordingly, no rule to automatically rebuild the image on updates pushed here. The list of files copied during the build process is very specific and doesn't adapt very well between versions of the source.
This could be alleviated a bit by copying over the entire finetuner-workflow/finetuner directory into the final image, but I still see this potentially becoming very annoying to manage between many possible concurrent branches in kubernetes-cloud that could each require distinct build instructions over here, and tracking down corresponding historical changes across two the repositories seems painful.

To make that better, we could work on making the build instructions very generic, like including a version-controlled install.sh (or something) over in kubernetes-cloud and running most of the work in there. Alternatively, the LLM finetuner could have its own repository with this container published in it.

Alternatively, this entire Dockerfile could be left in kubernetes-cloud, versioned with the rest of the source, and we could dynamically download it and build it here in ml-containers from any given commit entirely through a workflow, without any corresponding directory here (or maybe one with only a README). This would cut down on the headache of managing the source in multiple disconnected places while still keeping the container in the central ml-containers repository.

I'd welcome some thoughts on this point.

This copies over the Dockerfile and compiler wrapper from the
LLM finetuner in `coreweave/kubernetes-cloud` as they appeared in commit
6c100190564a05abed79503171638e81e18bdc53 under the directory
`finetuner-workflow/finetuner`, with some updates for cross-repository
downloading added to the build.
@Eta0 Eta0 requested review from wbrown and harubaru May 18, 2023 02:28
Eta0 added a commit to Eta0/ml-containers that referenced this pull request Jun 29, 2023
Based off of coreweave/ml-containers PR coreweave#21, with application-specific
parts removed, and more precompiled DeepSpeed ops and flash-attn
components included.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant