Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(torch): Update PyTorch and CUDA versions #82

Merged
merged 8 commits into from
Oct 14, 2024
Merged

Conversation

Eta0
Copy link
Collaborator

@Eta0 Eta0 commented Oct 11, 2024

PyTorch 2.4.1, CUDA 12.6, & Base Image Updates

This change updates PyTorch (torch, torchvision, torchaudio) to the v2.4.1 family of releases. It also adds a build for CUDA 12.6.1, and uses updated base images for torch:nccl that include HPC-X v2.20 and updated NCCL versions up to v2.23.4-1 (from coreweave/nccl-tests#41).

The CUDA 12.3 build has also been removed, in line with the changes in coreweave/nccl-tests#41, and other older builds that had already been commented out or otherwise turned off have been snipped from the workflow configuration files for clarity.

Compatibility & Fixes

Building PyTorch images against CUDA versions above 12.4 (the highest supported by their own distribution) presents a few compatibility issues. torchaudio has been missing support for CUDA 12.5+ for a few months, though unmerged PRs exist to patch it, and some CCCL library components have been rearranged, breaking the xformers build.

This adds a patch to the torchaudio build from pytorch/audio#3811 to compile successfully on newer CUDA versions. xformers addressed CCCL changes in an unlisted change between v0.0.27 and v0.0.28 in commit facebookresearch/xformers@926f410, so updating to the tag v0.0.28.post1 resolves the issues with it.

Additionally, the torch:base flavour of images was missing a runtime installation of cuDNN to match the version installed for development. The install script was extracted to an external install_cudnn.sh script which is shared between torch and torch-extras. torch-extras requires that script to (re-)install the development version of cuDNN prior to compiling Apex, so that the torch:base image can properly make use of Apex's cuDNN-exclusive features (and so that it can compile correctly).

Finally, the torch:nccl flavour of images now avoids updating CUDA library packages via apt-get during the torch-extras builder setup, which ensures compatibility with the pinned versions used in the final image's runtime. More robust management of these package versions in the future could be aided by using the runtime edition of the nvidia/cuda base image instead of the base edition, or at least more extensive use of our CHECK_VERSION shell function to always install compatible versions when upgrading a package from a runtime version to a development version.

Eta0 added 2 commits October 11, 2024 16:01
Additionally, this change switches to base images with updated
NCCL (up to v2.23.4-1) and HPC-X (v2.20) versions for torch:nccl.
@Eta0 Eta0 added the enhancement New feature or request label Oct 11, 2024
@Eta0 Eta0 requested a review from wbrown October 11, 2024 21:35
@Eta0 Eta0 self-assigned this Oct 11, 2024
@Eta0 Eta0 marked this pull request as ready for review October 11, 2024 21:38
Copy link

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11300925191
Image: ghcr.io/coreweave/ml-containers/torch:es-torch-updates-9d550f9-nccl-cuda12.4.1-ubuntu22.04-nccl2.23.4-1-torch2.4.1-vision0.19.1-audio2.4.1

Copy link

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11300925191
Image: ghcr.io/coreweave/ml-containers/torch:es-torch-updates-9d550f9-nccl-cuda12.4.1-ubuntu20.04-nccl2.23.4-1-torch2.4.1-vision0.19.1-audio2.4.1

Copy link

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11300925191
Image: ghcr.io/coreweave/ml-containers/torch:es-torch-updates-9d550f9-nccl-cuda12.6.1-ubuntu22.04-nccl2.23.4-1-torch2.4.1-vision0.19.1-audio2.4.1

Copy link

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11300925191
Image: ghcr.io/coreweave/ml-containers/torch:es-torch-updates-9d550f9-nccl-cuda12.6.1-ubuntu20.04-nccl2.23.4-1-torch2.4.1-vision0.19.1-audio2.4.1

Copy link

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11300925191
Image: ghcr.io/coreweave/ml-containers/torch:es-torch-updates-9d550f9-nccl-cuda12.2.2-ubuntu20.04-nccl2.21.5-1-torch2.4.1-vision0.19.1-audio2.4.1

Copy link

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11300925191
Image: ghcr.io/coreweave/ml-containers/torch:es-torch-updates-9d550f9-nccl-cuda12.2.2-ubuntu22.04-nccl2.23.4-1-torch2.4.1-vision0.19.1-audio2.4.1

Copy link

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11300925201
Image: ghcr.io/coreweave/ml-containers/torch:es-torch-updates-9d550f9-base-cuda12.4.1-ubuntu20.04-torch2.4.1-vision0.19.1-audio2.4.1

Copy link

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11300925201
Image: ghcr.io/coreweave/ml-containers/torch:es-torch-updates-9d550f9-base-cuda12.4.1-ubuntu22.04-torch2.4.1-vision0.19.1-audio2.4.1

Copy link

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11300925201
Image: ghcr.io/coreweave/ml-containers/torch:es-torch-updates-9d550f9-base-cuda12.2.2-ubuntu22.04-torch2.4.1-vision0.19.1-audio2.4.1

Copy link

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11300925201
Image: ghcr.io/coreweave/ml-containers/torch:es-torch-updates-9d550f9-base-cuda12.2.2-ubuntu20.04-torch2.4.1-vision0.19.1-audio2.4.1

Copy link

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11300925201
Image: ghcr.io/coreweave/ml-containers/torch:es-torch-updates-9d550f9-base-cuda12.6.1-ubuntu20.04-torch2.4.1-vision0.19.1-audio2.4.1

Copy link

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11300925201
Image: ghcr.io/coreweave/ml-containers/torch:es-torch-updates-9d550f9-base-cuda12.6.1-ubuntu22.04-torch2.4.1-vision0.19.1-audio2.4.1

Copy link

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11302932838
Image: ghcr.io/coreweave/ml-containers/torch:es-torch-updates-9362f03-base-cuda12.6.1-ubuntu20.04-torch2.4.1-vision0.19.1-audio2.4.1

Copy link

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11302932838
Image: ghcr.io/coreweave/ml-containers/torch:es-torch-updates-9362f03-base-cuda12.4.1-ubuntu22.04-torch2.4.1-vision0.19.1-audio2.4.1

Copy link

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11302932838
Image: ghcr.io/coreweave/ml-containers/torch:es-torch-updates-9362f03-base-cuda12.4.1-ubuntu20.04-torch2.4.1-vision0.19.1-audio2.4.1

Copy link

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11302932838
Image: ghcr.io/coreweave/ml-containers/torch:es-torch-updates-9362f03-base-cuda12.2.2-ubuntu22.04-torch2.4.1-vision0.19.1-audio2.4.1

Copy link

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11302932838
Image: ghcr.io/coreweave/ml-containers/torch:es-torch-updates-9362f03-base-cuda12.2.2-ubuntu20.04-torch2.4.1-vision0.19.1-audio2.4.1

Copy link

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11308948373
Image: ghcr.io/coreweave/ml-containers/torch:es-torch-updates-c1bd46f-nccl-cuda12.4.1-ubuntu22.04-nccl2.23.4-1-torch2.4.1-vision0.19.1-audio2.4.1

Copy link

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11308948373
Image: ghcr.io/coreweave/ml-containers/torch:es-torch-updates-c1bd46f-nccl-cuda12.6.1-ubuntu20.04-nccl2.23.4-1-torch2.4.1-vision0.19.1-audio2.4.1

Copy link

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11308948373
Image: ghcr.io/coreweave/ml-containers/torch:es-torch-updates-c1bd46f-nccl-cuda12.4.1-ubuntu20.04-nccl2.23.4-1-torch2.4.1-vision0.19.1-audio2.4.1

Copy link

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11309714639
Image: ghcr.io/coreweave/ml-containers/torch:es-torch-updates-8202be7-nccl-cuda12.4.1-ubuntu20.04-nccl2.23.4-1-torch2.4.1-vision0.19.1-audio2.4.1

Copy link

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11309714639
Image: ghcr.io/coreweave/ml-containers/torch-extras:es-torch-updates-8202be7-nccl-cuda12.4.1-ubuntu22.04-nccl2.23.4-1-torch2.4.1-vision0.19.1-audio2.4.1

Copy link

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11309714639
Image: ghcr.io/coreweave/ml-containers/torch-extras:es-torch-updates-8202be7-nccl-cuda12.6.1-ubuntu20.04-nccl2.23.4-1-torch2.4.1-vision0.19.1-audio2.4.1

Copy link

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11309714639
Image: ghcr.io/coreweave/ml-containers/torch:es-torch-updates-8202be7-nccl-cuda12.6.1-ubuntu22.04-nccl2.23.4-1-torch2.4.1-vision0.19.1-audio2.4.1

Copy link

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11309714639
Image: ghcr.io/coreweave/ml-containers/torch:es-torch-updates-8202be7-nccl-cuda12.2.2-ubuntu22.04-nccl2.23.4-1-torch2.4.1-vision0.19.1-audio2.4.1

Copy link

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11309714639
Image: ghcr.io/coreweave/ml-containers/torch:es-torch-updates-8202be7-nccl-cuda12.2.2-ubuntu20.04-nccl2.21.5-1-torch2.4.1-vision0.19.1-audio2.4.1

Copy link

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11309714639
Image: ghcr.io/coreweave/ml-containers/torch-extras:es-torch-updates-8202be7-nccl-cuda12.4.1-ubuntu20.04-nccl2.23.4-1-torch2.4.1-vision0.19.1-audio2.4.1

Copy link

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11309714639
Image: ghcr.io/coreweave/ml-containers/torch-extras:es-torch-updates-8202be7-nccl-cuda12.6.1-ubuntu22.04-nccl2.23.4-1-torch2.4.1-vision0.19.1-audio2.4.1

Copy link

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11309714639
Image: ghcr.io/coreweave/ml-containers/torch-extras:es-torch-updates-8202be7-nccl-cuda12.2.2-ubuntu22.04-nccl2.23.4-1-torch2.4.1-vision0.19.1-audio2.4.1

Copy link

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11309714639
Image: ghcr.io/coreweave/ml-containers/torch-extras:es-torch-updates-8202be7-nccl-cuda12.2.2-ubuntu20.04-nccl2.21.5-1-torch2.4.1-vision0.19.1-audio2.4.1

Copy link

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11312731459
Image: ghcr.io/coreweave/ml-containers/torch:es-torch-updates-e32904e-base-cuda12.4.1-ubuntu20.04-torch2.4.1-vision0.19.1-audio2.4.1

Copy link

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11312731459
Image: ghcr.io/coreweave/ml-containers/torch:es-torch-updates-e32904e-base-cuda12.6.1-ubuntu20.04-torch2.4.1-vision0.19.1-audio2.4.1

Copy link

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11312731459
Image: ghcr.io/coreweave/ml-containers/torch:es-torch-updates-e32904e-base-cuda12.2.2-ubuntu22.04-torch2.4.1-vision0.19.1-audio2.4.1

Copy link

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11312731459
Image: ghcr.io/coreweave/ml-containers/torch:es-torch-updates-e32904e-base-cuda12.2.2-ubuntu20.04-torch2.4.1-vision0.19.1-audio2.4.1

Copy link

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11312731459
Image: ghcr.io/coreweave/ml-containers/torch:es-torch-updates-e32904e-base-cuda12.4.1-ubuntu22.04-torch2.4.1-vision0.19.1-audio2.4.1

Copy link

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11312731459
Image: ghcr.io/coreweave/ml-containers/torch-extras:es-torch-updates-e32904e-base-cuda12.4.1-ubuntu20.04-torch2.4.1-vision0.19.1-audio2.4.1

Copy link

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11312731459
Image: ghcr.io/coreweave/ml-containers/torch-extras:es-torch-updates-e32904e-base-cuda12.6.1-ubuntu20.04-torch2.4.1-vision0.19.1-audio2.4.1

Copy link

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11312731459
Image: ghcr.io/coreweave/ml-containers/torch-extras:es-torch-updates-e32904e-base-cuda12.2.2-ubuntu22.04-torch2.4.1-vision0.19.1-audio2.4.1

Copy link

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11312731459
Image: ghcr.io/coreweave/ml-containers/torch-extras:es-torch-updates-e32904e-base-cuda12.2.2-ubuntu20.04-torch2.4.1-vision0.19.1-audio2.4.1

Copy link

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11312731459
Image: ghcr.io/coreweave/ml-containers/torch:es-torch-updates-e32904e-base-cuda12.6.1-ubuntu22.04-torch2.4.1-vision0.19.1-audio2.4.1

Copy link

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11312731459
Image: ghcr.io/coreweave/ml-containers/torch-extras:es-torch-updates-e32904e-base-cuda12.4.1-ubuntu22.04-torch2.4.1-vision0.19.1-audio2.4.1

Copy link

@Eta0 Build complete, success: https://github.com/coreweave/ml-containers/actions/runs/11312731459
Image: ghcr.io/coreweave/ml-containers/torch-extras:es-torch-updates-e32904e-base-cuda12.6.1-ubuntu22.04-torch2.4.1-vision0.19.1-audio2.4.1

@Eta0
Copy link
Collaborator Author

Eta0 commented Oct 14, 2024

These builds are currently passing:

The commit difference between the two was adding an extra step to support torch:base which should have no effect on torch:nccl. These were manually-run workflows because the torch-extras CI pipeline would have started from the latest images on main, whereas it needed to start from scratch in this branch for accurate results.

@Eta0 Eta0 added the bug Something isn't working label Oct 14, 2024
@wbrown wbrown merged commit d7253c0 into main Oct 14, 2024
26 of 28 checks passed
@wbrown wbrown deleted the es/torch-updates branch October 14, 2024 19:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants