-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(torch): Update PyTorch and CUDA versions #82
Conversation
Additionally, this change switches to base images with updated NCCL (up to v2.23.4-1) and HPC-X (v2.20) versions for torch:nccl.
@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11300925191 |
@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11300925191 |
@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11300925191 |
@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11300925191 |
@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11300925191 |
@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11300925191 |
@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11300925201 |
@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11300925201 |
@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11300925201 |
@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11300925201 |
@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11300925201 |
@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11300925201 |
@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11302932838 |
@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11302932838 |
@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11302932838 |
@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11302932838 |
@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11302932838 |
Compatibility fixed in v0.0.28 in the following commit: facebookresearch/xformers@926f410
@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11308948373 |
@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11308948373 |
@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11308948373 |
@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11309714639 |
@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11309714639 |
@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11309714639 |
@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11309714639 |
@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11309714639 |
@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11309714639 |
@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11309714639 |
@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11309714639 |
@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11309714639 |
@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11309714639 |
@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11312731459 |
@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11312731459 |
@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11312731459 |
@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11312731459 |
@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11312731459 |
@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11312731459 |
@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11312731459 |
@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11312731459 |
@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11312731459 |
@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11312731459 |
@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11312731459 |
@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11312731459 |
These builds are currently passing: The commit difference between the two was adding an extra step to support |
PyTorch 2.4.1, CUDA 12.6, & Base Image Updates
This change updates PyTorch (
torch
,torchvision
,torchaudio
) to the v2.4.1 family of releases. It also adds a build for CUDA 12.6.1, and uses updated base images fortorch:nccl
that include HPC-X v2.20 and updated NCCL versions up to v2.23.4-1 (from coreweave/nccl-tests#41).The CUDA 12.3 build has also been removed, in line with the changes in coreweave/nccl-tests#41, and other older builds that had already been commented out or otherwise turned off have been snipped from the workflow configuration files for clarity.
Compatibility & Fixes
Building PyTorch images against CUDA versions above 12.4 (the highest supported by their own distribution) presents a few compatibility issues.
torchaudio
has been missing support for CUDA 12.5+ for a few months, though unmerged PRs exist to patch it, and some CCCL library components have been rearranged, breaking thexformers
build.This adds a patch to the
torchaudio
build from pytorch/audio#3811 to compile successfully on newer CUDA versions.xformers
addressed CCCL changes in an unlisted change between v0.0.27 and v0.0.28 in commit facebookresearch/xformers@926f410, so updating to the tagv0.0.28.post1
resolves the issues with it.Additionally, the
torch:base
flavour of images was missing a runtime installation of cuDNN to match the version installed for development. The install script was extracted to an externalinstall_cudnn.sh
script which is shared betweentorch
andtorch-extras
.torch-extras
requires that script to (re-)install the development version of cuDNN prior to compiling Apex, so that thetorch:base
image can properly make use of Apex's cuDNN-exclusive features (and so that it can compile correctly).Finally, the
torch:nccl
flavour of images now avoids updating CUDA library packages viaapt-get
during thetorch-extras
builder setup, which ensures compatibility with the pinned versions used in the final image's runtime. More robust management of these package versions in the future could be aided by using theruntime
edition of thenvidia/cuda
base image instead of thebase
edition, or at least more extensive use of ourCHECK_VERSION
shell function to always install compatible versions when upgrading a package from a runtime version to a development version.