-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Add CMake flag CMAKE_BUILD_TYPE=Release
#16294
Conversation
Thanks for your contribution, but shouldn't we rather fix the bug instead of working around it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.
This flag will be passed to nvcc, whose behavior is quiet different between debug and no-debug mode. I check the code that cannot work well without the compiler flag, but I can't find the bug. |
@DickJC123 could you maybe assist on this matter? Also, could you provide detailed instructions (ideally a docker container) to reproduce this? |
Simply use
|
Could you also provide the version of your toolchains? Nvcc, cuda, cudnn, gcc, etc |
EC2 Deeplearning AMI 18.1 OS: Ubuntu 16.04 |
@marcoabreu The default cmake build flag if not specified will not use the compiler optimization. We need to explicitly specify CMAKE_BULD_TYPE=Release. I suggest we do something in CMakeList.txt as the following: https://blog.kitware.com/cmake-and-the-default-build-type/ |
I know and that's fine. Unless you explicitly compile with release, you don't get release. But this PR is trying to hide an error and I think we should rather address the error it revealed. |
I find that all install script and ci use this flag, as well as windows build tutorial use release mode. We should make it consistent. |
There are in fact totally valid use cases to not use release compilation. So I wouldn't dismiss the other options. |
CMake build without 'release' flag cannot pass all unit test in |
Interesting, which ones are failing? Can we fix them? |
Just FYI, CMAKE_BULD_TYPE=Release is specified when running CI. |
@marcoabreu There are two problems reflected here. IMHO, we should not tangle them.
|
@marcoabreu IMHO the default settings for building from source should be as close to the ones used for the release versions as possible.
So now we can see that the most possible cause of this is the second one: we may have some complicated GPU kernels that require rewrites (maybe splitting into 2 or more smaller kernels and do multiple launches, or some optimizations to make register usages lower). However, technically this is not "bug" rather than us hitting limitations put on us by the tools and the hardwares we are using. So either we make everything work for everyone (especially the few developers who use DEBUG and run on older GPUs) even when compiler optimizations are poorly done or when the code is run on ancient hardware (at a cost), or, we aim for a working and performant code compiled with proper optimizations for nearly all of our users (who may not even be aware of the build settings), which one would you prefer? Or do you think that this needs a community consensus on it? |
A sample error message:
|
@marcoabreu As you can see from the discussion, sufficient technical justification has been presented for the change in this PR. I will move forward if no more questions/concerns in 24 hours. |
I'd like to get an opinion from some Nvidia engineers first. So far, we are making assumptions about these kernels differing in such a significant matter that we run into limits. Since we have the advantage of having people with detailed knowledge about the ins and outs of GPUs, I'd like to have them consulted first. |
@ptrendx @DickJC123 could you chime in please? |
It makes sense to use release build by default, though we should still document the usage of debug build in developer guide. The debug build issue should definitely be investigated. However, given that the debug build has issues regardless of the changes in this PR, the investigation of the bug should not be a blocker for this change. #9516 seems to be an appropriate issue for investigation on nvcc debug mode. @hgt312 @reminisce @haojin2 it would be great if you could document what you've seen in that issue. @ptrendx @DickJC123 feel free to comment there with any insight you have. @marcoabreu does that sound good? Do you have other technical reasons that make you believe the PR should be stopped? |
Yeah that sounds like a good way to move forward. Could you add the issue as release requirement for the 1.6 release please? After that, we're good to move forward with this PR. |
The |
@ptrendx Thanks for the detailed analysis. That's very helpful. The kernel currently throws the error is @marcoabreu I have created an issue for tracking the progress of fixing this problem. Please kindly consider unblocking this PR from being merged. |
@reminisce Ok, I assigned myself to that issue. |
Thanks everyone! |
@ptrendx Thank you for helping resolve this issue. |
The default value of
CMAKE_BUILD_TYPE
isdebug
, it will cause some problem when using CUDA (some operators with broadcast cannot do backward).Now add it to docs.