Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TransformerEngine to PT 2.0 training images #3315

Merged
merged 40 commits into from
Sep 26, 2023
Merged

Conversation

arjkesh
Copy link
Contributor

@arjkesh arjkesh commented Sep 7, 2023

GitHub Issue #, if available:

Note:

  • If merging this PR should also close the associated Issue, please also add that Issue # to the Linked Issues section on the right.

  • All PR's are checked weekly for staleness. This PR will be closed if not updated in 30 days.

Description

  • Add transformer engine and flash attention support to CU121 images
  • Add associated tests on heavy instance types
  • Add CUDNN (required dependency of transformer engine)
  • Add future test to match CUDNN versions in torch/dlc
  • Patch requirements in existing DLC
  • Add NCCL_ASYNC_ERROR_HANDLING=1 env

Tests run

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@aws-deep-learning-containers-ci aws-deep-learning-containers-ci bot added build Reflects file change in build folder ec2 Reflects file change in dlc_tests/ec2 folder pytorch Reflects file change in pytorch folder Size:S Determines the size of the PR test Reflects file change in test folder labels Sep 7, 2023
Copy link
Contributor

@roywei roywei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also update the env var NCCL_ASYNC_ERROR_HANDLING=1 per customer request, this will make sure pytorch errors out properly out during distributed training.

@arjkesh
Copy link
Contributor Author

arjkesh commented Sep 19, 2023

/rerun

@arjkesh arjkesh marked this pull request as ready for review September 26, 2023 01:15
@arjkesh arjkesh requested a review from a team as a code owner September 26, 2023 01:15
roywei
roywei previously approved these changes Sep 26, 2023
# Install flash attn and NVIDIA transformer engine
RUN MAX_JOBS=4 pip install flash-attn==2.0.4 --no-build-isolation
RUN pip install git+https://github.com/NVIDIA/TransformerEngine.git@release_v0.12
ENV NCCL_ASYNC_ERROR_HANDLING=1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is already defined on line number 63

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, added because of a different review comment, will remove

pytorch_training, ec2_connection, region, gpu_only, ec2_instance_type, pt21_and_above_only
):
"""
PT 2.1 reintroduces a dependency on CUDNN to support NVDA TransformerEngine. This test is to ensure that torch CUDNN matches system CUDNN in the container.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no PT 2.1 yet

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no PT 2.1 yet, this is an anticipatory test that we are adding to ensure that torch binaries are compiled with the same cudnn as exists in the container

).stdout.split()[-1]

cudnn_from_torch = ec2_connection.run(
f"nvidia-docker exec --user root {container_name} python -c 'from torch.backends import cudnn; print(cudnn.version())'",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this cudnn comes from pytorch and not from installed from OS package, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cudnn represents the cudnn version that torch is compiled with, not the DLC cudnn version. There are basically static links to cudnn from torch - while it doesn't appear to be a big issue if there are slightly different versions of cudnn from compile --> system, adding this test for future safety so that the versions don't go out of sync

test/dlc_tests/ec2/test_transformerengine.py Show resolved Hide resolved
@arjkesh arjkesh merged commit 5241309 into aws:master Sep 26, 2023
37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Reflects file change in build folder ec2 Reflects file change in dlc_tests/ec2 folder pytorch Reflects file change in pytorch folder Size:S Determines the size of the PR test Reflects file change in test folder
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants