Add CMake flag `CMAKE_BUILD_TYPE=Release` #16294

hgt312 · 2019-09-27T08:29:09Z

The default value of CMAKE_BUILD_TYPE is debug, it will cause some problem when using CUDA (some operators with broadcast cannot do backward).

Now add it to docs.

marcoabreu · 2019-09-27T08:30:42Z

Thanks for your contribution, but shouldn't we rather fix the bug instead of working around it?

marcoabreu

.

hgt312 · 2019-09-27T08:43:41Z

Thanks for your contribution, but shouldn't we rather fix the bug instead of working around it?

This flag will be passed to nvcc, whose behavior is quiet different between debug and no-debug mode. I check the code that cannot work well without the compiler flag, but I can't find the bug.

marcoabreu · 2019-09-27T08:58:45Z

@DickJC123 could you maybe assist on this matter?

Also, could you provide detailed instructions (ideally a docker container) to reproduce this?

hgt312 · 2019-09-27T09:05:26Z

Simply use mx.np.divide do broadcast computation and then backward, the error message ends with:

too many resources requested for launch

marcoabreu · 2019-09-27T10:21:02Z

Could you also provide the version of your toolchains? Nvcc, cuda, cudnn, gcc, etc

hgt312 · 2019-09-27T10:26:35Z

EC2 Deeplearning AMI 18.1

OS: Ubuntu 16.04
CMake: 3.13.3
CUDA: 10.0
CUDNN: 7.4.1
nvcc: 10.0.130
gcc: 5.4.0

hgt312 · 2019-09-27T10:27:51Z

#15747

apeforest · 2019-09-27T17:55:31Z

@marcoabreu The default cmake build flag if not specified will not use the compiler optimization. We need to explicitly specify CMAKE_BULD_TYPE=Release. I suggest we do something in CMakeList.txt as the following: https://blog.kitware.com/cmake-and-the-default-build-type/

marcoabreu · 2019-09-27T21:08:07Z

I know and that's fine. Unless you explicitly compile with release, you don't get release.

But this PR is trying to hide an error and I think we should rather address the error it revealed.

hgt312 · 2019-09-28T05:48:37Z

I find that all install script and ci use this flag, as well as windows build tutorial use release mode. We should make it consistent.
For ubuntu and other unix-like OS are used most, the tutorial should have it to make sure users get no-bug MXNet.

marcoabreu · 2019-09-28T10:27:03Z

There are in fact totally valid use cases to not use release compilation. So I wouldn't dismiss the other options.

hgt312 · 2019-09-28T10:40:14Z

There are in fact totally valid use cases to not use release compilation. So I wouldn't dismiss the other options.

CMake build without 'release' flag cannot pass all unit test in test_operator_gpu.py in my try.

marcoabreu · 2019-09-28T18:34:34Z

Interesting, which ones are failing? Can we fix them?

xidulu · 2019-09-29T02:53:56Z

Just FYI, CMAKE_BULD_TYPE=Release is specified when running CI.

https://github.com/apache/incubator-mxnet/blob/master/ci/docker/runtime_functions.sh#L858

reminisce · 2019-09-29T04:58:49Z

@marcoabreu There are two problems reflected here. IMHO, we should not tangle them.

The Release build type is not set as the default value for cmake build config. Considering most of the users would just adopt the default settings we provide, this problem should have been corrected to make the performance of the binaries produced by make and cmake comparable. This PR is submitted to address this problem.
The lib built using either current cmake (without Release build type) or make with DEBUG=1 would fail on reduce-like operators on GPUs, sum for example. This problem has been observed many times by various users. That part of code was owned by Nvidia folks. The algorithm itself is so sophisticated that it requires their expertise to dive in. We should track this issue in another thread, instead of requesting the contributors fixing the build type problem to look into the problem that is out of their domain knowledge.

haojin2 · 2019-09-29T05:08:39Z

@marcoabreu IMHO the default settings for building from source should be as close to the ones used for the release versions as possible. DEBUG is more of a choice for our developers, isn't it?
Also considering the fact that our CI does not even test the DEBUG build mode and most other builds are not using DEBUG as the default, I would consider this PR a step toward a more consistent settings for the different builds.
I would agree that we need to fix "bugs", let's take a look at this "bug" now. I've personally encountered and fixed several of this "bug" myself. So from my previous experience, this error @hgt312 encountered could be possibly caused by using too large blocks/too many registers/too much shared memory according to this thread on CUDA user forum. So now let's examine each one of them:

Could we have different number of GPU thread blocks because we changed the build? NO, it's related with the input size, which was not changed between the 2 test runs with the same test code. Also we mostly use very small tensors for testing, so this could not be the cause.
Could we use different numbers of registers because we changed the build? YES! It's possible that without proper optimization, a vanilla compilation of a complicated GPU kernel could lead to excessive usage of registers.
Could we use too much shared memory because we changed the build? NO, the amount of shared memory needed only depends on the input data's shape and type.

So now we can see that the most possible cause of this is the second one: we may have some complicated GPU kernels that require rewrites (maybe splitting into 2 or more smaller kernels and do multiple launches, or some optimizations to make register usages lower). However, technically this is not "bug" rather than us hitting limitations put on us by the tools and the hardwares we are using. So either we make everything work for everyone (especially the few developers who use DEBUG and run on older GPUs) even when compiler optimizations are poorly done or when the code is run on ancient hardware (at a cost), or, we aim for a working and performant code compiled with proper optimizations for nearly all of our users (who may not even be aware of the build settings), which one would you prefer? Or do you think that this needs a community consensus on it?
@hgt312 Would you please share the failures so that we know which operators we need to take a look at? So that once the decision on the tradeoff has been made we could take further actions accordingly.

hgt312 · 2019-09-29T08:08:22Z

A sample error message:

(base) ubuntu@ip-172-31-16-49:~/incubator-mxnet$ nosetests -s --verbose tests/python/gpu/test_operator_gpu.py:test_np_sum
[INFO] Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=342263604 to reproduce.
test_operator_gpu.test_np_sum ... [08:06:44] ../src/base.cc:84: Upgrade advisory: this mxnet has been built against cuDNN lib version 7401, which is older than the oldest version tested by CI (7600).  Set MXNET_CUDNN_LIB_CHECKING=0 to quiet this warning.
[INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1216105730 to reproduce.
ERROR

======================================================================
ERROR: test_operator_gpu.test_np_sum
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/nose/util.py", line 620, in newfunc
    return func(*arg, **kw)
  File "/home/ubuntu/incubator-mxnet/tests/python/gpu/../unittest/common.py", line 177, in test_new
    orig_test(*args, **kwargs)
  File "/home/ubuntu/incubator-mxnet/python/mxnet/util.py", line 307, in _with_np_shape
    return func(*args, **kwargs)
  File "/home/ubuntu/incubator-mxnet/python/mxnet/util.py", line 491, in _with_np_array
    return func(*args, **kwargs)
  File "/home/ubuntu/incubator-mxnet/tests/python/gpu/../unittest/test_numpy_op.py", line 264, in test_np_sum
    assert_almost_equal(y.asnumpy(), expected_ret, rtol=1e-3 if dtype == 'float16' else 1e-3,
  File "/home/ubuntu/incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 2504, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/home/ubuntu/incubator-mxnet/python/mxnet/base.py", line 254, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [08:06:54] /home/ubuntu/incubator-mxnet/src/operator/nn/././../tensor/./broadcast_reduce-inl.cuh:528: Check failed: err == cudaSuccess (7 vs. 0) : Name: reduce_kernel_M1 ErrStr:too many resources requested for launch
Stack trace:
  [bt] (0) /home/ubuntu/incubator-mxnet/python/mxnet/../../build/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x32) [0x7f81a9b7fb82]
  [bt] (1) /home/ubuntu/incubator-mxnet/python/mxnet/../../build/libmxnet.so(void mxnet::op::broadcast::ReduceImpl<mxnet::op::mshadow_op::sum, 2, float, mshadow::half::half_t, mshadow::half::half_t, mxnet::op::mshadow_op::identity>(CUstream_st*, mxnet::TBlob const&, mxnet::OpReqType, mxnet::TBlob const&, mshadow::Tensor<mshadow::gpu, 1, char> const&, mxnet::op::broadcast::ReduceImplConfig<2> const&)+0x820) [0x7f81aa184e10]
  [bt] (2) /home/ubuntu/incubator-mxnet/python/mxnet/../../build/libmxnet.so(void mxnet::op::broadcast::Reduce<mxnet::op::mshadow_op::sum, 2, mshadow::half::half_t, mxnet::op::mshadow_op::identity, true>(mshadow::Stream<mshadow::gpu>*, mxnet::TBlob const&, mxnet::OpReqType, mshadow::Tensor<mshadow::gpu, 1, char> const&, mxnet::TBlob const&)+0x539) [0x7f81aa187eb9]
  [bt] (3) /home/ubuntu/incubator-mxnet/python/mxnet/../../build/libmxnet.so(void mxnet::op::ReduceAxesComputeImpl<mshadow::gpu, mxnet::op::mshadow_op::sum, true, false, mxnet::op::mshadow_op::identity>(mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, mxnet::TShape const&)+0x13e9) [0x7f81aa868649]
  [bt] (4) /home/ubuntu/incubator-mxnet/python/mxnet/../../build/libmxnet.so(void mxnet::op::NumpyReduceAxesCompute<mshadow::gpu, mxnet::op::mshadow_op::sum, true, false, mxnet::op::mshadow_op::identity>(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)+0x4ac) [0x7f81aa97a26c]
  [bt] (5) /home/ubuntu/incubator-mxnet/python/mxnet/../../build/libmxnet.so(mxnet::imperative::PushFCompute(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&)::{lambda(mxnet::RunContext)#1}::operator()(mxnet::RunContext) const+0x2a6) [0x7f81ac1cdc16]
  [bt] (6) /home/ubuntu/incubator-mxnet/python/mxnet/../../build/libmxnet.so(std::_Function_handler<void (mxnet::RunContext), mxnet::imperative::PushFCompute(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&)::{lambda(mxnet::RunContext)#1}>::_M_invoke(std::_Any_data const&, mxnet::RunContext&&)+0x17) [0x7f81ac1cde67]
  [bt] (7) /home/ubuntu/incubator-mxnet/python/mxnet/../../build/libmxnet.so(+0x39d3f4e) [0x7f81ac127f4e]
  [bt] (8) /home/ubuntu/incubator-mxnet/python/mxnet/../../build/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x5cf) [0x7f81ac13418f]


-------------------- >> begin captured logging << --------------------
root: INFO: NumPy-shape semantics has been activated in your code. This is required for creating and manipulating scalar and zero-size tensors, which were not supported in MXNet before, as in the official NumPy library. Please DO NOT manually deactivate this semantics while using `mxnet.numpy` and `mxnet.numpy_extension` modules.
common: INFO: Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=342263604 to reproduce.
root: INFO: NumPy-shape semantics has been activated in your code. This is required for creating and manipulating scalar and zero-size tensors, which were not supported in MXNet before, as in the official NumPy library. Please DO NOT manually deactivate this semantics while using `mxnet.numpy` and `mxnet.numpy_extension` modules.
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1216105730 to reproduce.
--------------------- >> end captured logging << ---------------------

----------------------------------------------------------------------
Ran 1 test in 9.612s

FAILED (errors=1)

reminisce · 2019-09-30T18:20:16Z

@marcoabreu As you can see from the discussion, sufficient technical justification has been presented for the change in this PR. I will move forward if no more questions/concerns in 24 hours.

marcoabreu · 2019-09-30T18:30:06Z

I'd like to get an opinion from some Nvidia engineers first. So far, we are making assumptions about these kernels differing in such a significant matter that we run into limits. Since we have the advantage of having people with detailed knowledge about the ins and outs of GPUs, I'd like to have them consulted first.

marcoabreu · 2019-09-30T18:30:34Z

@ptrendx @DickJC123 could you chime in please?

szha · 2019-09-30T19:13:08Z

It makes sense to use release build by default, though we should still document the usage of debug build in developer guide.

The debug build issue should definitely be investigated. However, given that the debug build has issues regardless of the changes in this PR, the investigation of the bug should not be a blocker for this change.

#9516 seems to be an appropriate issue for investigation on nvcc debug mode. @hgt312 @reminisce @haojin2 it would be great if you could document what you've seen in that issue. @ptrendx @DickJC123 feel free to comment there with any insight you have.

@marcoabreu does that sound good? Do you have other technical reasons that make you believe the PR should be stopped?

marcoabreu · 2019-09-30T19:43:22Z

Yeah that sounds like a good way to move forward. Could you add the issue as release requirement for the 1.6 release please? After that, we're good to move forward with this PR.

ptrendx · 2019-09-30T20:01:39Z

The too many resources requested for launch error happens most often because the number of registers required for the kernel exceeded the number of registers available. The register file on the GPU has some capacity that is shared by all threads in a SM (streaming multiprocessor), so the more registers is used, less number of threads can be actually launched. The problem comes from the fact that the number of threads launched is a value known only at runtime, not at compile time, so the compiler cannot do the analysis to limit the number of used registers / spill some to global memory. Debug build uses more registers than the release build, so that is where you hit the error in that particular kernel. This problem can be solved by telling the compiler what is the maximum number of threads and blocks that will be launched per SM via launch bounds. Inserting proper launch bounds should be a very easy change, if you have any problem applying it just tell us which exact kernel is giving this error and we can make a PR for it as well.

reminisce · 2019-09-30T21:41:37Z

@ptrendx Thanks for the detailed analysis. That's very helpful. The kernel currently throws the error is reduce_kernel_M1. That would be the best if you guys can apply the fix since you guys are experts in this.

@marcoabreu I have created an issue for tracking the progress of fixing this problem. Please kindly consider unblocking this PR from being merged.

ptrendx · 2019-09-30T23:04:54Z

@reminisce Ok, I assigned myself to that issue.

marcoabreu · 2019-09-30T23:42:40Z

Thanks everyone!

reminisce · 2019-10-01T01:23:26Z

@ptrendx Thank you for helping resolve this issue.

add 'Release' cmake flag

1a3211f

hgt312 requested review from aaronmarkham and szha as code owners September 27, 2019 08:29

marcoabreu suggested changes Sep 27, 2019

View reviewed changes

reminisce mentioned this pull request Sep 30, 2019

Reduce op throws "too many resources requested for launch" #16338

Closed

marcoabreu merged commit 097deff into apache:master Sep 30, 2019

QueensGambit mentioned this pull request Oct 1, 2019

added -DCMAKE_BUILD_TYPE=Release to docs for building from source #16144

Closed

1 task

hgt312 deleted the add_cmake_release branch April 10, 2020 10:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CMake flag `CMAKE_BUILD_TYPE=Release` #16294

Add CMake flag `CMAKE_BUILD_TYPE=Release` #16294

hgt312 commented Sep 27, 2019

marcoabreu commented Sep 27, 2019

marcoabreu left a comment

hgt312 commented Sep 27, 2019

marcoabreu commented Sep 27, 2019

hgt312 commented Sep 27, 2019

marcoabreu commented Sep 27, 2019

hgt312 commented Sep 27, 2019

hgt312 commented Sep 27, 2019

apeforest commented Sep 27, 2019

marcoabreu commented Sep 27, 2019

hgt312 commented Sep 28, 2019

marcoabreu commented Sep 28, 2019

hgt312 commented Sep 28, 2019

marcoabreu commented Sep 28, 2019

xidulu commented Sep 29, 2019

reminisce commented Sep 29, 2019

haojin2 commented Sep 29, 2019

hgt312 commented Sep 29, 2019

reminisce commented Sep 30, 2019

marcoabreu commented Sep 30, 2019

marcoabreu commented Sep 30, 2019

szha commented Sep 30, 2019

marcoabreu commented Sep 30, 2019

ptrendx commented Sep 30, 2019

reminisce commented Sep 30, 2019

ptrendx commented Sep 30, 2019

marcoabreu commented Sep 30, 2019

reminisce commented Oct 1, 2019

Add CMake flag CMAKE_BUILD_TYPE=Release #16294

Add CMake flag CMAKE_BUILD_TYPE=Release #16294

Conversation

hgt312 commented Sep 27, 2019

marcoabreu commented Sep 27, 2019

marcoabreu left a comment

Choose a reason for hiding this comment

hgt312 commented Sep 27, 2019

marcoabreu commented Sep 27, 2019

hgt312 commented Sep 27, 2019

marcoabreu commented Sep 27, 2019

hgt312 commented Sep 27, 2019

hgt312 commented Sep 27, 2019

apeforest commented Sep 27, 2019

marcoabreu commented Sep 27, 2019

hgt312 commented Sep 28, 2019

marcoabreu commented Sep 28, 2019

hgt312 commented Sep 28, 2019

marcoabreu commented Sep 28, 2019

xidulu commented Sep 29, 2019

reminisce commented Sep 29, 2019

haojin2 commented Sep 29, 2019

hgt312 commented Sep 29, 2019

reminisce commented Sep 30, 2019

marcoabreu commented Sep 30, 2019

marcoabreu commented Sep 30, 2019

szha commented Sep 30, 2019

marcoabreu commented Sep 30, 2019

ptrendx commented Sep 30, 2019

reminisce commented Sep 30, 2019

ptrendx commented Sep 30, 2019

marcoabreu commented Sep 30, 2019

reminisce commented Oct 1, 2019

Add CMake flag `CMAKE_BUILD_TYPE=Release` #16294

Add CMake flag `CMAKE_BUILD_TYPE=Release` #16294