Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Fix nightly CD for GPU builds #18205

Merged
merged 4 commits into from
Apr 30, 2020
Merged

Fix nightly CD for GPU builds #18205

merged 4 commits into from
Apr 30, 2020

Conversation

mseth10
Copy link
Contributor

@mseth10 mseth10 commented Apr 30, 2020

Description

This PR fixes nightly CD GPU tests by updating the build toolchain to use cmake static build and updating dnnl headers stash location. It removes 7.5 arch for cu100, cu101, cu102 builds to solve oversized libmxnet.so binary issue with cmake builds.
It also fixes the issue with dnnl headers packaging into nightly build artifacts. Fixes #18120

Here's the link to a broken CD pipeline: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/restricted-mxnet-cd%2Fmxnet-cd-release-job/detail/mxnet-cd-release-job/1024/pipeline/

Commands used to reproduce and test the fix:

alias python=python3

git clone --recursive https://github.com/apache/incubator-mxnet.git
cd incubator-mxnet
pip3 install -r ci/requirements.txt --user
# make changes to mxnet code

# install docker-compose
sudo curl -L "https://github.com/docker/compose/releases/download/1.25.5/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
sudo ln -s /usr/local/bin/docker-compose /usr/bin/docker-compose

# build on c5.18xl
python ci/build.py --platform centos7_gpu_cu100 /work/runtime_functions.sh build_static_libmxnet cu100
# test on g3.8xl
python ci/build.py --docker-registry mxnetci --nvidiadocker --platform centos7_gpu_cu100 --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh cd_unittest_ubuntu cu100

@mxnet-bot
Copy link

Hey @mseth10 , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

  • To trigger all jobs: @mxnet-bot run ci [all]
  • To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [edge, miscellaneous, unix-cpu, centos-gpu, website, sanity, windows-gpu, unix-gpu, clang, centos-cpu, windows-cpu]


Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

@mseth10 mseth10 changed the title Fix nightly cd Fix nightly CD for GPU builds Apr 30, 2020
@szha szha requested a review from leezu April 30, 2020 16:37
@@ -33,4 +33,4 @@ set(USE_F16C OFF CACHE BOOL "Build with x86 F16C instruction support")
set(USE_LIBJPEG_TURBO ON CACHE BOOL "Build with libjpeg-turbo")

set(CUDACXX "/usr/local/cuda-10.0/bin/nvcc" CACHE STRING "Cuda compiler")
set(MXNET_CUDA_ARCH "3.0;5.0;6.0;7.0;7.5" CACHE STRING "Cuda architectures")
set(MXNET_CUDA_ARCH "3.0;5.0;6.0;7.0" CACHE STRING "Cuda architectures")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dropping support for 7.5 cuda arch is it favorable?
For eg for Tesla T4 [G4 instances] cuda arch supported is 7.5
@leezu What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a temporary fix to get the CD working, I checked the builds for cu100 and cu102, both fail because of binary size issues. We should work on adding back 7.5 arch after making sure the build works.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sidenote: 7.0 binaries also run on 7.5

if platform.system() == 'Darwin':
shutil.copytree(os.path.join(CURRENT_DIR, 'mxnet-build/3rdparty/mkldnn/build/install/include'),
os.path.join(CURRENT_DIR, 'mxnet/include/mkldnn'))
shutil.copytree(os.path.join(CURRENT_DIR, 'mxnet-build/3rdparty/mkldnn/build/install/include'),
Copy link
Contributor

@ChaiBapchya ChaiBapchya Apr 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious how

  1. previously when nightly CD was working, why was mkldnn include done only for Darwin
  2. now, to fix nightly CD, this needs to be done for all OS

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This platform condition was added by mistake in some earlier commit. We are fixing that here. No reason to package dnnl header files for only Darwin.

@ChaiBapchya
Copy link
Contributor

Also @mseth10 do you mind updating the description [removing extraneous content] and also adding the Jenkins Pipeline build where you tested it out or a command you used to reproduce & test this fix. Thanks

@mseth10
Copy link
Contributor Author

mseth10 commented Apr 30, 2020

@ChaiBapchya updated the description. Check it out.

Copy link
Contributor

@leezu leezu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mseth10

@leezu leezu merged commit 0b46d90 into apache:master Apr 30, 2020
@leezu
Copy link
Contributor

leezu commented Apr 30, 2020

This was referenced May 18, 2020
AntiZpvoh pushed a commit to AntiZpvoh/incubator-mxnet that referenced this pull request Jul 6, 2020
…he#18205)

* use cmake for cd static build, skip running kvstore tests

* update dnnl headers stash location

* remove unnecessary platform condition

* remove 7.5 arch for cu100, cu101, cu102

Co-authored-by: Ubuntu <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MKLDNN header missing in recent mxnet nightly static builds
6 participants