Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Add CMake flag CMAKE_BUILD_TYPE=Release #16294

Merged
merged 1 commit into from
Sep 30, 2019

Conversation

hgt312
Copy link
Contributor

@hgt312 hgt312 commented Sep 27, 2019

The default value of CMAKE_BUILD_TYPE is debug, it will cause some problem when using CUDA (some operators with broadcast cannot do backward).

Now add it to docs.

@marcoabreu
Copy link
Contributor

Thanks for your contribution, but shouldn't we rather fix the bug instead of working around it?

Copy link
Contributor

@marcoabreu marcoabreu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

@hgt312
Copy link
Contributor Author

hgt312 commented Sep 27, 2019

Thanks for your contribution, but shouldn't we rather fix the bug instead of working around it?

This flag will be passed to nvcc, whose behavior is quiet different between debug and no-debug mode. I check the code that cannot work well without the compiler flag, but I can't find the bug.

@marcoabreu
Copy link
Contributor

@DickJC123 could you maybe assist on this matter?

Also, could you provide detailed instructions (ideally a docker container) to reproduce this?

@hgt312
Copy link
Contributor Author

hgt312 commented Sep 27, 2019

Simply use mx.np.divide do broadcast computation and then backward, the error message ends with:

too many resources requested for launch

@marcoabreu
Copy link
Contributor

Could you also provide the version of your toolchains? Nvcc, cuda, cudnn, gcc, etc

@hgt312
Copy link
Contributor Author

hgt312 commented Sep 27, 2019

EC2 Deeplearning AMI 18.1

OS: Ubuntu 16.04
CMake: 3.13.3
CUDA: 10.0
CUDNN: 7.4.1
nvcc: 10.0.130
gcc: 5.4.0

@hgt312
Copy link
Contributor Author

hgt312 commented Sep 27, 2019

#15747

@apeforest
Copy link
Contributor

@marcoabreu The default cmake build flag if not specified will not use the compiler optimization. We need to explicitly specify CMAKE_BULD_TYPE=Release. I suggest we do something in CMakeList.txt as the following: https://blog.kitware.com/cmake-and-the-default-build-type/

@marcoabreu
Copy link
Contributor

I know and that's fine. Unless you explicitly compile with release, you don't get release.

But this PR is trying to hide an error and I think we should rather address the error it revealed.

@hgt312
Copy link
Contributor Author

hgt312 commented Sep 28, 2019

I find that all install script and ci use this flag, as well as windows build tutorial use release mode. We should make it consistent.
For ubuntu and other unix-like OS are used most, the tutorial should have it to make sure users get no-bug MXNet.

@marcoabreu
Copy link
Contributor

There are in fact totally valid use cases to not use release compilation. So I wouldn't dismiss the other options.

@hgt312
Copy link
Contributor Author

hgt312 commented Sep 28, 2019

There are in fact totally valid use cases to not use release compilation. So I wouldn't dismiss the other options.

CMake build without 'release' flag cannot pass all unit test in test_operator_gpu.py in my try.

@marcoabreu
Copy link
Contributor

Interesting, which ones are failing? Can we fix them?

@xidulu
Copy link
Contributor

xidulu commented Sep 29, 2019

Just FYI, CMAKE_BULD_TYPE=Release is specified when running CI.

https://github.com/apache/incubator-mxnet/blob/master/ci/docker/runtime_functions.sh#L858

@reminisce
Copy link
Contributor

@marcoabreu There are two problems reflected here. IMHO, we should not tangle them.

  1. The Release build type is not set as the default value for cmake build config. Considering most of the users would just adopt the default settings we provide, this problem should have been corrected to make the performance of the binaries produced by make and cmake comparable. This PR is submitted to address this problem.

  2. The lib built using either current cmake (without Release build type) or make with DEBUG=1 would fail on reduce-like operators on GPUs, sum for example. This problem has been observed many times by various users. That part of code was owned by Nvidia folks. The algorithm itself is so sophisticated that it requires their expertise to dive in. We should track this issue in another thread, instead of requesting the contributors fixing the build type problem to look into the problem that is out of their domain knowledge.

@haojin2
Copy link
Contributor

haojin2 commented Sep 29, 2019

@marcoabreu IMHO the default settings for building from source should be as close to the ones used for the release versions as possible. DEBUG is more of a choice for our developers, isn't it?
Also considering the fact that our CI does not even test the DEBUG build mode and most other builds are not using DEBUG as the default, I would consider this PR a step toward a more consistent settings for the different builds.
I would agree that we need to fix "bugs", let's take a look at this "bug" now. I've personally encountered and fixed several of this "bug" myself. So from my previous experience, this error @hgt312 encountered could be possibly caused by using too large blocks/too many registers/too much shared memory according to this thread on CUDA user forum. So now let's examine each one of them:

  1. Could we have different number of GPU thread blocks because we changed the build? NO, it's related with the input size, which was not changed between the 2 test runs with the same test code. Also we mostly use very small tensors for testing, so this could not be the cause.
  2. Could we use different numbers of registers because we changed the build? YES! It's possible that without proper optimization, a vanilla compilation of a complicated GPU kernel could lead to excessive usage of registers.
  3. Could we use too much shared memory because we changed the build? NO, the amount of shared memory needed only depends on the input data's shape and type.

So now we can see that the most possible cause of this is the second one: we may have some complicated GPU kernels that require rewrites (maybe splitting into 2 or more smaller kernels and do multiple launches, or some optimizations to make register usages lower). However, technically this is not "bug" rather than us hitting limitations put on us by the tools and the hardwares we are using. So either we make everything work for everyone (especially the few developers who use DEBUG and run on older GPUs) even when compiler optimizations are poorly done or when the code is run on ancient hardware (at a cost), or, we aim for a working and performant code compiled with proper optimizations for nearly all of our users (who may not even be aware of the build settings), which one would you prefer? Or do you think that this needs a community consensus on it?
@hgt312 Would you please share the failures so that we know which operators we need to take a look at? So that once the decision on the tradeoff has been made we could take further actions accordingly.

@hgt312
Copy link
Contributor Author

hgt312 commented Sep 29, 2019

A sample error message:

(base) ubuntu@ip-172-31-16-49:~/incubator-mxnet$ nosetests -s --verbose tests/python/gpu/test_operator_gpu.py:test_np_sum
[INFO] Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=342263604 to reproduce.
test_operator_gpu.test_np_sum ... [08:06:44] ../src/base.cc:84: Upgrade advisory: this mxnet has been built against cuDNN lib version 7401, which is older than the oldest version tested by CI (7600).  Set MXNET_CUDNN_LIB_CHECKING=0 to quiet this warning.
[INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1216105730 to reproduce.
ERROR

======================================================================
ERROR: test_operator_gpu.test_np_sum
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/nose/util.py", line 620, in newfunc
    return func(*arg, **kw)
  File "/home/ubuntu/incubator-mxnet/tests/python/gpu/../unittest/common.py", line 177, in test_new
    orig_test(*args, **kwargs)
  File "/home/ubuntu/incubator-mxnet/python/mxnet/util.py", line 307, in _with_np_shape
    return func(*args, **kwargs)
  File "/home/ubuntu/incubator-mxnet/python/mxnet/util.py", line 491, in _with_np_array
    return func(*args, **kwargs)
  File "/home/ubuntu/incubator-mxnet/tests/python/gpu/../unittest/test_numpy_op.py", line 264, in test_np_sum
    assert_almost_equal(y.asnumpy(), expected_ret, rtol=1e-3 if dtype == 'float16' else 1e-3,
  File "/home/ubuntu/incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 2504, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/home/ubuntu/incubator-mxnet/python/mxnet/base.py", line 254, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [08:06:54] /home/ubuntu/incubator-mxnet/src/operator/nn/././../tensor/./broadcast_reduce-inl.cuh:528: Check failed: err == cudaSuccess (7 vs. 0) : Name: reduce_kernel_M1 ErrStr:too many resources requested for launch
Stack trace:
  [bt] (0) /home/ubuntu/incubator-mxnet/python/mxnet/../../build/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x32) [0x7f81a9b7fb82]
  [bt] (1) /home/ubuntu/incubator-mxnet/python/mxnet/../../build/libmxnet.so(void mxnet::op::broadcast::ReduceImpl<mxnet::op::mshadow_op::sum, 2, float, mshadow::half::half_t, mshadow::half::half_t, mxnet::op::mshadow_op::identity>(CUstream_st*, mxnet::TBlob const&, mxnet::OpReqType, mxnet::TBlob const&, mshadow::Tensor<mshadow::gpu, 1, char> const&, mxnet::op::broadcast::ReduceImplConfig<2> const&)+0x820) [0x7f81aa184e10]
  [bt] (2) /home/ubuntu/incubator-mxnet/python/mxnet/../../build/libmxnet.so(void mxnet::op::broadcast::Reduce<mxnet::op::mshadow_op::sum, 2, mshadow::half::half_t, mxnet::op::mshadow_op::identity, true>(mshadow::Stream<mshadow::gpu>*, mxnet::TBlob const&, mxnet::OpReqType, mshadow::Tensor<mshadow::gpu, 1, char> const&, mxnet::TBlob const&)+0x539) [0x7f81aa187eb9]
  [bt] (3) /home/ubuntu/incubator-mxnet/python/mxnet/../../build/libmxnet.so(void mxnet::op::ReduceAxesComputeImpl<mshadow::gpu, mxnet::op::mshadow_op::sum, true, false, mxnet::op::mshadow_op::identity>(mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, mxnet::TShape const&)+0x13e9) [0x7f81aa868649]
  [bt] (4) /home/ubuntu/incubator-mxnet/python/mxnet/../../build/libmxnet.so(void mxnet::op::NumpyReduceAxesCompute<mshadow::gpu, mxnet::op::mshadow_op::sum, true, false, mxnet::op::mshadow_op::identity>(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0x4ac) [0x7f81aa97a26c]
  [bt] (5) /home/ubuntu/incubator-mxnet/python/mxnet/../../build/libmxnet.so(mxnet::imperative::PushFCompute(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&)::{lambda(mxnet::RunContext)#1}::operator()(mxnet::RunContext) const+0x2a6) [0x7f81ac1cdc16]
  [bt] (6) /home/ubuntu/incubator-mxnet/python/mxnet/../../build/libmxnet.so(std::_Function_handler<void (mxnet::RunContext), mxnet::imperative::PushFCompute(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&)::{lambda(mxnet::RunContext)#1}>::_M_invoke(std::_Any_data const&, mxnet::RunContext&&)+0x17) [0x7f81ac1cde67]
  [bt] (7) /home/ubuntu/incubator-mxnet/python/mxnet/../../build/libmxnet.so(+0x39d3f4e) [0x7f81ac127f4e]
  [bt] (8) /home/ubuntu/incubator-mxnet/python/mxnet/../../build/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x5cf) [0x7f81ac13418f]


-------------------- >> begin captured logging << --------------------
root: INFO: NumPy-shape semantics has been activated in your code. This is required for creating and manipulating scalar and zero-size tensors, which were not supported in MXNet before, as in the official NumPy library. Please DO NOT manually deactivate this semantics while using `mxnet.numpy` and `mxnet.numpy_extension` modules.
common: INFO: Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=342263604 to reproduce.
root: INFO: NumPy-shape semantics has been activated in your code. This is required for creating and manipulating scalar and zero-size tensors, which were not supported in MXNet before, as in the official NumPy library. Please DO NOT manually deactivate this semantics while using `mxnet.numpy` and `mxnet.numpy_extension` modules.
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1216105730 to reproduce.
--------------------- >> end captured logging << ---------------------

----------------------------------------------------------------------
Ran 1 test in 9.612s

FAILED (errors=1)

@reminisce
Copy link
Contributor

@marcoabreu As you can see from the discussion, sufficient technical justification has been presented for the change in this PR. I will move forward if no more questions/concerns in 24 hours.

@marcoabreu
Copy link
Contributor

I'd like to get an opinion from some Nvidia engineers first. So far, we are making assumptions about these kernels differing in such a significant matter that we run into limits. Since we have the advantage of having people with detailed knowledge about the ins and outs of GPUs, I'd like to have them consulted first.

@marcoabreu
Copy link
Contributor

@ptrendx @DickJC123 could you chime in please?

@szha
Copy link
Member

szha commented Sep 30, 2019

It makes sense to use release build by default, though we should still document the usage of debug build in developer guide.

The debug build issue should definitely be investigated. However, given that the debug build has issues regardless of the changes in this PR, the investigation of the bug should not be a blocker for this change.

#9516 seems to be an appropriate issue for investigation on nvcc debug mode. @hgt312 @reminisce @haojin2 it would be great if you could document what you've seen in that issue. @ptrendx @DickJC123 feel free to comment there with any insight you have.

@marcoabreu does that sound good? Do you have other technical reasons that make you believe the PR should be stopped?

@marcoabreu
Copy link
Contributor

Yeah that sounds like a good way to move forward. Could you add the issue as release requirement for the 1.6 release please? After that, we're good to move forward with this PR.

@ptrendx
Copy link
Member

ptrendx commented Sep 30, 2019

The too many resources requested for launch error happens most often because the number of registers required for the kernel exceeded the number of registers available. The register file on the GPU has some capacity that is shared by all threads in a SM (streaming multiprocessor), so the more registers is used, less number of threads can be actually launched. The problem comes from the fact that the number of threads launched is a value known only at runtime, not at compile time, so the compiler cannot do the analysis to limit the number of used registers / spill some to global memory. Debug build uses more registers than the release build, so that is where you hit the error in that particular kernel. This problem can be solved by telling the compiler what is the maximum number of threads and blocks that will be launched per SM via launch bounds. Inserting proper launch bounds should be a very easy change, if you have any problem applying it just tell us which exact kernel is giving this error and we can make a PR for it as well.

@reminisce
Copy link
Contributor

@ptrendx Thanks for the detailed analysis. That's very helpful. The kernel currently throws the error is reduce_kernel_M1. That would be the best if you guys can apply the fix since you guys are experts in this.

@marcoabreu I have created an issue for tracking the progress of fixing this problem. Please kindly consider unblocking this PR from being merged.

@ptrendx
Copy link
Member

ptrendx commented Sep 30, 2019

@reminisce Ok, I assigned myself to that issue.

@marcoabreu marcoabreu merged commit 097deff into apache:master Sep 30, 2019
@marcoabreu
Copy link
Contributor

Thanks everyone!

@reminisce
Copy link
Contributor

@ptrendx Thank you for helping resolve this issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants