Raise toolchain requirements for MXNet 2 #17984

leezu · 2020-04-06T18:37:47Z

Description

As per #17968, require C++17 compatible compiler.
For cuda code, use C++14 mode introduced in Cuda 9. In general there is no issue with linking together objects resulting from different standards as long as they are compiled by the same compiler.

This PR contains the following specific changes:

Switch CI pipeline to use gcc7 on Ubuntu and CentOS
Switch CD pipeline to CentOS 7 with https://www.softwarecollections.org/en/scls/rhscl/devtoolset-7/ This enables us to build with gcc7 C++17 compiler while keeping a relatively old glibc requirement for distribution.
Simplify ARM Edge builds
- Switch to standard Ubuntu / Debian cross-compilation toolchain for ARMv7, ARMv8
- Switch to https://toolchains.bootlin.com/ toolchain for ARMv6 (the Debian ARMv6 toolchain is for ARMv4 + ARMv5 + ARMv6, but we wish to only target ARMv6 and make use of ARMv6 features)
- Remove reliance on dockcross for cross compilation.
Simplify Jetson build
- Use standard Ubuntu / Debian cross-compilation toolchain for ARMv8
- Upgrade to Cuda 10 and Jetpack 4.3
- Simplify build setup
Simplify QEMU ARM virtualization test setup on CI
- Remove complex "Virtual Machine in Docker" logic and run a QEMU based Docker container instead based on arm32v7/ubuntu
Fix out of bounds vector accesses in
- SoftmaxGradOpType
- MKLDNNFCBackward
Fix use of non-standard rand_r function (which is not available on anymore on newer Android toolchains and shouldn't be use in any case).
Fix reproducibility of RNN with Dropout
Fix reproducibility of DGL Graph Sampling Operators
Update tests for Android Edge build to NDK19. The previously used standalone toolchain is obsolete.

Those Dockerfiles that required refactoring as part of the effort were refactored based on the following consideration

Maximize the use of system dependencies provided by the distribution instead of manually installing dependencies from source or from third party vendors. This reduces the complexity of the installation process and essentially pins the dependency versions, increasing CI stability. Further, Dockerfile build speed is improved. To facilitate this, use recent distribution versions. We still ensure backwards compatibility via CentOS7 based build and test stages
Minimize the number of layers in the Dockerfile. Don't have 5 different script files executed, each calling apt-get update and install, but just execute once. Speeds up the build and reduces image size. Keep each Dockerfile simple and tailored to a purpose, instead of running 20 scripts to install dependencies for every thinkable scenario, which is unmaintainable.

Some more small changes:

Remove outdated references to Cuda 7 and Cuda 8 in various files.
Remove C++03 support in mshadow
Disable broken tests (may fix as part of this PR if I find the time)
- NumpyBooleanAssignForwardCPU Boolean indexing accesses out of bound elements #17990
- test_init.test_rsp_const_init Row-sparse constant initializer accesses out of bound elements #17988

mxnet-bot · 2020-04-06T18:37:49Z

Hey @leezu , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

To trigger all jobs: @mxnet-bot run ci [all]
To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [website, centos-cpu, edge, centos-gpu, unix-cpu, windows-cpu, sanity, windows-gpu, unix-gpu, miscellaneous, clang]

Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

ptrendx · 2020-04-07T23:53:07Z

Note: With this we will be able to change shared_ptr here: https://github.com/apache/incubator-mxnet/blob/master/src/engine/threaded_engine.h#L440 to unique_ptr (as it was supposed to be but C++11 did not allow std::move to lambda).

CentOS 7 fullfills the requirements for PEP 599 manylinux-2014 and provides a C++17 toolchain.

apache#18042

szha · 2020-04-13T19:36:07Z

src/operator/contrib/dgl_graph.cc

@@ -24,6 +24,9 @@
 #include <mxnet/operator_util.h>


@zheng-da could you help review this in #17984?

DGL no longer depends on the contrib graph operators. I would suggest to deprecate this sampling operator and even remove it in MXNet 2.0. Please see my comment here. #16167 (comment)

szha

Thanks for the upgrade, @leezu. I went through the commits and overall the changes looks good to me. It would have been nice if we could merge the individual commits in this change, though I realize that given the state of CI it may not be possible. In this case, this is an important change that I think we should accept and fix forward if there's any issue.

As per apache#17968, require C++17 compatible compiler. For cuda code, use C++14 mode introduced in Cuda 9. C++17 support for Cuda will be available in Cuda 11. Switching to C++17 requires modernizing the toolchain, which exposed a number of technical debt issues in the codebase. All blocking issues are fixed as part of this PR. See the full list below. This PR contains the following specific changes: Switch CI pipeline to use gcc7 on Ubuntu and CentOS Switch CD pipeline to CentOS 7 with https://www.softwarecollections.org/en/scls/rhscl/devtoolset-7/ This enables us to build with gcc7 C++17 compiler while keeping a relatively old glibc requirement for distribution. Simplify ARM Edge builds Switch to standard Ubuntu / Debian cross-compilation toolchain for ARMv7, ARMv8 Switch to https://toolchains.bootlin.com/ toolchain for ARMv6 (the Debian ARMv6 toolchain is for ARMv4 + ARMv5 + ARMv6, but we wish to only target ARMv6 and make use of ARMv6 features) Remove reliance on dockcross for cross compilation. Simplify Jetson build Use standard Ubuntu / Debian cross-compilation toolchain for ARMv8 Upgrade to Cuda 10 and Jetpack 4.3 Simplify build setup Simplify QEMU ARM virtualization test setup on CI Remove complex "Virtual Machine in Docker" logic and run a QEMU based Docker container instead based on arm32v7/ubuntu Fix out of bounds vector accesses in SoftmaxGradOpType MKLDNNFCBackward Fix use of non-standard rand_r function (which is not available on anymore on newer Android toolchains and shouldn't be use in any case). Fix reproducibility of RNN with Dropout Fix reproducibility of DGL Graph Sampling Operators Update tests for Android Edge build to NDK19. The previously used standalone toolchain is obsolete. Those Dockerfiles that required refactoring as part of the effort were refactored based on the following consideration Maximize the use of system dependencies provided by the distribution instead of manually installing dependencies from source or from third party vendors. This reduces the complexity of the installation process and essentially pins the dependency versions, increasing CI stability. Further, Dockerfile build speed is improved. To facilitate this, use recent distribution versions. We still ensure backwards compatibility via CentOS7 based build and test stages Minimize the number of layers in the Dockerfile. Don't have 5 different script files executed, each calling apt-get update and install, but just execute once. Speeds up the build and reduces image size. Keep each Dockerfile simple and tailored to a purpose, instead of running 20 scripts to install dependencies for every thinkable scenario, which is unmaintainable. Some more small changes: Remove outdated references to Cuda 7 and Cuda 8 in various files. Remove C++03 support in mshadow Disable broken tests NumpyBooleanAssignForwardCPU apache#17990 test_init.test_rsp_const_init apache#17988 quantized_elemwise_mul apache#18034 List of squashed commits * cpp standard * Remove leftover files of Cuda 7 and Cuda 8 support * thrust 1.9.8 for clang10 * compiler warnings * Disable broken test_init.test_rsp_const_init * Disable tests invoking NumpyBooleanAssignForwardCPU * Fix out of bounds access in SoftmaxGradOpType * Use CentOS 7 for staticbuilds CentOS 7 fullfills the requirements for PEP 599 manylinux-2014 and provides a C++17 toolchain. * Fix MKLDNNFCBackward * Update edge toolchain * Support platforms without rand_r * Cleanup random.h * Greatly simplify qemu setup * Remove unused functions in Jenkins_steps.groovy * Skip quantized_elemwise_mul due QuantizedElemwiseMulOpShape bug * Fix R package installation apache#18042 * Fix centos ccache * Fix GPU Makefile staticbuild on CentOS7 * CentOS7 NCCL * CentOS7 staticbuild fix link with libculibos

leezu force-pushed the cpp17 branch 10 times, most recently from b8a65c5 to 29f8ed7 Compare April 7, 2020 02:14

leezu mentioned this pull request Apr 7, 2020

Dynamic subgraph accesses elements of empty vector #17987

Closed

leezu force-pushed the cpp17 branch from 29f8ed7 to 84c09e4 Compare April 7, 2020 03:52

leezu mentioned this pull request Apr 7, 2020

Row-sparse constant initializer accesses out of bound elements #17988

Open

leezu force-pushed the cpp17 branch 2 times, most recently from ea43436 to 6c41b7d Compare April 7, 2020 05:50

leezu mentioned this pull request Apr 7, 2020

Boolean indexing accesses out of bound elements #17990

Closed

leezu force-pushed the cpp17 branch 2 times, most recently from 468f73b to 8a585bf Compare April 7, 2020 08:08

leezu mentioned this pull request Apr 7, 2020

MKLDNNConvolutionBackward accesses out of bound elements #17992

Closed

leezu force-pushed the cpp17 branch 3 times, most recently from 0096d4c to 0bf829a Compare April 7, 2020 23:08

leezu force-pushed the cpp17 branch 2 times, most recently from e516025 to 3430b08 Compare April 10, 2020 01:23

leezu marked this pull request as ready for review April 10, 2020 01:35

leezu requested review from aaronmarkham, anirudh2290 and marcoabreu as code owners April 10, 2020 01:35

leezu added 3 commits April 13, 2020 16:07

Disable broken test_init.test_rsp_const_init

cc0966b

Disable tests invoking NumpyBooleanAssignForwardCPU

435f8b1

Fix out of bounds access in SoftmaxGradOpType

36986df

leezu force-pushed the cpp17 branch from 8a9ebbf to 33916d2 Compare April 13, 2020 16:27

leezu added the pr-awaiting-review PR is waiting for code review label Apr 13, 2020

leezu force-pushed the cpp17 branch from 875d0be to 696d86c Compare April 13, 2020 17:37

leezu added 9 commits April 13, 2020 19:11

Use CentOS 7 for staticbuilds

d9d6a29

CentOS 7 fullfills the requirements for PEP 599 manylinux-2014 and provides a C++17 toolchain.

Fix MKLDNNFCBackward

cc74ec4

Cleanup random.h

8f648c3

Support platforms without rand_r

104074a

Update edge toolchain

36c87c3

Greatly simplify qemu setup

b1dd13c

Remove unused functions in Jenkins_steps.groovy

80fadd1

Skip quantized_elemwise_mul due QuantizedElemwiseMulOpShape bug

0cc7580

Fix R package installation

941b7ab

apache#18042

leezu force-pushed the cpp17 branch from 696d86c to 941b7ab Compare April 13, 2020 19:27

szha reviewed Apr 13, 2020

View reviewed changes

szha approved these changes Apr 13, 2020

View reviewed changes

leezu added 4 commits April 13, 2020 21:44

Fix centos ccache

006ad0a

Fix GPU Makefile staticbuild on CentOS7

24344c5

CentOS7 NCCL

7aaae05

CentOS7 staticbuild fix link with libculibos

c33c85b

leezu merged commit fb73a17 into apache:master Apr 14, 2020

leezu deleted the cpp17 branch April 14, 2020 17:29

This was referenced Apr 14, 2020

CI: cran broken #18042

Closed

Fix ci/docker_cache.py #18056

Merged

nickguletskii mentioned this pull request May 22, 2020

[v1.x] Backport edge pipeline #18375

Merged

13 tasks

leezu mentioned this pull request May 26, 2020

Fix CD failure due to illegal instruction in OpenBLAS #18408

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raise toolchain requirements for MXNet 2 #17984

Raise toolchain requirements for MXNet 2 #17984

leezu commented Apr 6, 2020 •

edited

Loading

mxnet-bot commented Apr 6, 2020

ptrendx commented Apr 7, 2020

szha Apr 13, 2020

zheng-da Apr 14, 2020

szha left a comment

Raise toolchain requirements for MXNet 2 #17984

Raise toolchain requirements for MXNet 2 #17984

Conversation

leezu commented Apr 6, 2020 • edited Loading

Description

mxnet-bot commented Apr 6, 2020

ptrendx commented Apr 7, 2020

szha Apr 13, 2020

Choose a reason for hiding this comment

zheng-da Apr 14, 2020

Choose a reason for hiding this comment

szha left a comment

Choose a reason for hiding this comment

leezu commented Apr 6, 2020 •

edited

Loading