-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Add option to choose between OMP implementations #15808
Conversation
@mxnet-label-bot add [Build, CMake] |
@mxnet-label-bot add [Breaking] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. But I am not an expert in CMake so I can not really decide whether this is the most proper way. @marcoabreu @lebeg for review.
I think what you were proposing would be a better way to handle this. Instead of For a case like |
@lebeg you might be right, but getting this working is urgent as we can't debug. I would propose doing it incrementally or getting your contribution in this direction. |
CI is blocked now, and it's going to be blocked for a few days. There's problems with Windows and other things. If I make changes I'm afraid this won't be merged for a while. Up to you guys to decide. I think this is urgent enough to have it merged and then refine it. I would say this is a critical fix for being able to debug, as other people need to debug issues related to other tasks. (Make in debug requires updating the linker which is something I would prefer to avoid in my workstation). The effort to implement your suggestion should be around 30 minutes. |
I'd also prefer the multi selection. |
Can you guys review again? I made the suggested changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@apeforest I think what you propose is not homogeneous with what we have. I think is better to leave an intitial boolean switch and a fine grained selector. It also keeps backwards compatibility. Hence my proposal in this PR. |
Yes, this is more or less what I had in mind. For the sake of backwards compatibility (and maybe simplicity) keeping the option |
e293bdb
to
c378374
Compare
9dff7d8
to
1f68252
Compare
There seem to be some unintended changes in this PR, otherwise it looks good |
They are for CI, a change in dmlc is needed first: |
@lebeg please approve then |
@@ -23,7 +23,7 @@ mxnet_option(USE_CUDA "Build with CUDA support" ON) | |||
mxnet_option(USE_OLDCMAKECUDA "Build with old cmake cuda" OFF) | |||
mxnet_option(USE_NCCL "Use NVidia NCCL with CUDA" OFF) | |||
mxnet_option(USE_OPENCV "Build with OpenCV support" ON) | |||
mxnet_option(USE_OPENMP "Build with Openmp support" ON) | |||
mxnet_option(USE_OPENMP "Build with Openmp support" ON) # OFF | ON | PLATFORM | BUNDLED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between ON and the rest? Can you add some comments here?
if(OPENMP_FOUND) | ||
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} ${OpenMP_C_FLAGS}") | ||
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${OpenMP_CXX_FLAGS}") | ||
set(CMAKE_SHARED_LINKER_FLAGS "${CMAKE_SHARED_LINKER_FLAGS} ${OpenMP_EXE_LINKER_FLAGS}") | ||
set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} ${OpenMP_EXE_LINKER_FLAGS}") | ||
add_definitions(-DMXNET_USE_OPENMP=1) | ||
endif() | ||
else() | ||
message(FATAL_ERROR "USE_OPENMP takes values [PLATFORM, BUNDLED, OFF]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"USE_OPENMP takes values [PLATFORM, BUNDLED, OFF, ON]")?
Why is it needed for this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review comments. Also is the submodule update necessary
AND NOT MSVC | ||
AND NOT CMAKE_CROSSCOMPILING) | ||
|
||
if(USE_OPENMP STREQUAL "BUNDLED" AND EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/3rdparty/openmp/CMakeLists.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first check and the others should be separate to throw appropriate error messages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your review. Can you explain what you mean exactly? A concrete request would help, as this is a followup on previous review requests that might be conflicting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If someone uses the option "bundled" but the "exists" check fails, they'd receive an error message that they selected an invalid mode.
@@ -452,14 +451,18 @@ if(USE_OPENMP) | |||
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} ${OpenMP_C_FLAGS}") | |||
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${OpenMP_CXX_FLAGS}") | |||
add_definitions(-DMXNET_USE_OPENMP=1) | |||
else() | |||
elseif(USE_OPENMP STREQUAL "PLATFORM" OR USE_OPENMP STREQUAL "ON") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between on and platform
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are the same, for backwards compatibility, meaning if someone had USE_OPENMP=ON will continue working with the OMP implementation which has been shown to be most stable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dangerous
@cjolivier01 if you'd like your change request to be considered a veto, please provide a technical reason. You may refer to the Apache Voting Process for details. Reopening given that @larroy didn't indicate that he'd like to abandon this change. |
Submodule update is necessary as build was failing, there's a fix in the submodule that should be merged soon. You can see the error message by pulling the PR in your local and trying to build. It's related to throw specification, for details see the line changed in the submodule. |
this is a repeat of a previously vetoed proposal by Pedro and another by someone he tried to sneak a similar change in, just under a different guise. I really don’t have the energy to fight openly discuss this anymore after two years. |
in addition, on a fundamental level, flippantly switching between omp libraries while taking nothing else into considerations is extraordinarily hazardous (ie what’s mkl bringing in?). |
This has repeatedly caused crashes reported by several users which didn't happen when we used the platform OMP which is also the one shipped with binary releases. You are doing a disservice to the community and the project with this attitude, the users are free to choose the OMP implementation that they want. We need to reproduce the builds which use the builtin omp with the CMake build, and this change enables exactly this behaviour, You are free to compile with whatever OMP version you like, see the github issues for reports of crashes and problems with other OMP implementations. Intel already said they will remove the bundled iomp from MKL as it causes some crashes using multiprocessing which doesn't see to be caused by our code. |
@cjolivier01 I didn't realize that. If this is indeed a repetition of a change proposal already vetoed, would you mind linking to it? I only recall a case where @lebeg attempted to remove the bundled omp, which is different from what @larroy is proposing here. If this is what you're referring to, given the difference, would you mind providing a technical reason for rejecting this change? |
There is plenty of info out there already on this issue including discussions on dev. I will not be forever forced to re-argue this thing every time Pedro wants to sneak basically the exact same change in every few months for two years+ while never actually root-causing the issue (he’s been told what the issue was — not the OMP library’s fault) — it’s a big in mxnet — see the issue he is locked out of), but ignores repeatedly. |
@cjolivier01 stop holding this project hostage. This will be reported to the relevant decision makers in Apache. I hope your powers are revoked due to your track record of destructive actions. |
If the submodule update fixes the build, please put it in a separate PR as already stated. |
While I try to not engage in the discussion above whether this is a match to the previously vetoed PR, I agree that I see some similarities. With regards to #15790: This PR doesn't actually solve the problem (which appears to be the same as previously) but gives users a different OMP library choice. That effectively means that the default method we use is still broken and users would have to know that they have to use a different one. I might be no expert in that field, but I agree that to this day we only read something about some assertion being hit and nothing further. From what I can see, there has been no root causing (to the actual root) but a lot based on assumptions. This change seems quite massive (introducing a new OMP Backend choice) while not being based on real data. So i understand the frustration from both sides here: Chris' request for root causing is constantly being denied, while the root cause for pedro seems apparent. But since we are a community and we would like to make a well educated decision as a group, I'd agree with the general request for more root causing in order to create transparency. Thinking about it, I agree that this is the same change in a different costume that already had been vetoed with proper technical reasons - but I understand that Chris is getting frustrated by constantly having to repeat his request. So yeah, I agree with his veto that I'd like to see a deep dive. That should be collected somewhere and if we are at a crossroads, we should bring the topic up for a broader vote on dev@. |
Another proposal to avoid the difficult personal situation is to hand this issues over to somebody else. Maybe another engineer at Amazon or somebody at Intel or Nvidia would like to be of assistance here. I'm certain that they are also quite capable of debugging concurrency issues. Wdyt? |
Description
Adds a build switch to choose OpenMP implementations:
Workaround for #15790
@karan6181