Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

FusedOp Failing Static Linked Build #16765

Closed
zachgk opened this issue Nov 8, 2019 · 7 comments
Closed

FusedOp Failing Static Linked Build #16765

zachgk opened this issue Nov 8, 2019 · 7 comments
Assignees

Comments

@zachgk
Copy link
Contributor

zachgk commented Nov 8, 2019

Description

The build is currently failing for the statically linked build that is used for Scala Maven Publishing. This is blocking the currently nightly snapshot and must also be fixed before building the release jars as well.

build/src/operator/fusion/fused_op_gpu.o: In function `void mxnet::FusedOp::Forward<mshadow::gpu>(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)':

tmpxft_00008748_00000000-5_fused_op.compute_70.cudafe1.cpp:(.text._ZN5mxnet7FusedOp7ForwardIN7mshadow3gpuEEEvRKN4nnvm9NodeAttrsERKNS_9OpContextERKSt6vectorINS_5TBlobESaISC_EERKSB_INS_9OpReqTypeESaISH_EESG_+0x1287): undefined reference to `cuLaunchKernel'

tmpxft_00008748_00000000-5_fused_op.compute_70.cudafe1.cpp:(.text._ZN5mxnet7FusedOp7ForwardIN7mshadow3gpuEEEvRKN4nnvm9NodeAttrsERKNS_9OpContextERKSt6vectorINS_5TBlobESaISC_EERKSB_INS_9OpReqTypeESaISH_EESG_+0x12a6): undefined reference to `cuGetErrorString'

collect2: error: ld returned 1 exit status

make: *** [bin/im2rec] Error 1

make: *** Waiting for unfinished jobs....

2019-11-01 20:27:09,794 - root - INFO - Waiting for status of container 00fc4568b4c9 for 600 s.

2019-11-01 20:27:10,060 - root - INFO - Container exit status: {'StatusCode': 2, 'Error': None}

2019-11-01 20:27:10,061 - root - ERROR - Container exited with an error 😞

2019-11-01 20:27:10,061 - root - INFO - Executed command for reproduction:

See full log at http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/restricted-publish-artifacts/detail/master/287/pipeline/
Main Scala nightly pipeline at http://jenkins.mxnet-ci.amazon-ml.com/job/restricted-publish-artifacts/job/master/

It seems to be a result of #15167. The pip build has also been failing since this date for what might be the same reason.

To Reproduce

This version of the build can be run by following the instructions located at https://github.com/apache/incubator-mxnet/tree/master/tools/staticbuild. The Scala build uses variant cu92mkl by default, but other cuda builds should have the same problem.
The build is currently run on a Ubuntu 14.04 docker instance using https://github.com/apache/incubator-mxnet/blob/master/ci/docker/Dockerfile.publish.ubuntu1404_cpu.

@zachgk
Copy link
Contributor Author

zachgk commented Nov 8, 2019

@ptrendx Can you take a look?

@ptrendx
Copy link
Member

ptrendx commented Nov 8, 2019

@samskalicky
Copy link
Contributor

@zachgk assign @ptrendx

@ptrendx
Copy link
Member

ptrendx commented Nov 12, 2019

Trading my assignment with @DickJC123 who is looking into this issue.

@ptrendx ptrendx assigned DickJC123 and unassigned ptrendx Nov 12, 2019
@DickJC123
Copy link
Contributor

I'm able to reproduce the link error in the docker container mentioned above with the command:

tools/staticbuild/build.sh cu92mkl maven

I'll continue investigating the root cause. FYI, the following command does not have a similar issue:

tools/staticbuild/build.sh cu92 pip

@DickJC123
Copy link
Contributor

DickJC123 commented Nov 15, 2019

Finally figured out what's going on here. The build of bin/im2rec via ld (as driven by g++) is failing because LDFLAGS is missing '-lcuda -lnvrtc'. The Makefile will add these flags to LDFLAGS (as well as compile with MXNET_ENABLE_CUDA_RTC=1) if it sees ENABLE_CUDA_RTC set in config.mk . The maven builds are using the flag USE_NVRTC (to no effect), while the pip builds were converted to using ENABLE_CUDA_RTC via PR #14250. Not sure why the PR stopped short of converting all the builds.

The functionality of ./src/common/rtc.cc is guarded by MXNET_ENABLE_CUDA_RTC. So the question to fusion PR author @ptrendx becomes, do you think the pointwise fusion should be similarly guarded by ENABLE_CUDA_RTC (or a different flag)? Should MXNet warn when the user is running on a build that lacks the rtc capability, and under what circumstances (e.g. only when MXNET_USE_FUSION=1 is set explicitly in the environment and on a gpu context)? Should the user expect to run the unittest suite on the no-rtc builds, and how do we detect that?

@samskalicky
Copy link
Contributor

@zachgk is this fixed now that #16838 is merged?

@zachgk zachgk closed this as completed Nov 25, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants