-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Refactor for windows CI 'out of heap space' errors #15922
Conversation
I reviewed the original laop_6 failure in one of my other PR CI runs and found it was not related to tolerances. Hence reverting the test tolerance change. The test_shuffle speed up commit does not alter tolerances or the functionality of the test. |
As mentioned to Dick offline, I do not like this PR to be seen as an alternative to the #15882 PR. I believe we should do both of them - refactor the biggest files to speed up the compilation, but also guard against problems with single operators being too big. |
@larroy @marcoabreu Do you understand why the CI doesn't show here as completely passing? The 'miscellaneous' part also passed if you look at the details. |
Since October last year our CI system is unable to reliably communicate with GitHub. This results in lost status updates. |
Hey, can you weigh in on this PR? With the bundled speed improvement of test_shuffle, it's a major step in stabilizing the CI. |
I don't feel comfortable reviewing such an amount of c++ code and also a bit time constrained due to my business trip. So it would be great if somebody else could weigh in here |
Understandable. It's really only a lot of cut-and-paste, but I'll reach out to someone else. |
@@ -1580,6 +1581,20 @@ void PickOpBackward(const nnvm::NodeAttrs& attrs, | |||
}); | |||
} | |||
|
|||
inline std::string get_reduce_axes_description(const std::string& op_name, int line) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you want to inline this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The short answer is that the function was formerly in broadcast_reduce_op_value.cc (with the inline keyword) and I copied it intact to broadcast_reduce_op.h.
Also, removing inline
breaks the compile with a 'multiple definitions' error. One shouldn't put the definition of a stand-alone function in a header file that's part of multiple translation units without the inline keyword.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course I know that. The question is why isn't declared on the header and defined in the implementation file as usual. Seems like a big function to inline. Which adds bloat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks. small comment regarding inline.
* Speed up test_random.py:test_shuffle to get past CI timeouts. * Fix flakey test_operator.py:test_laop_6. * Break up broadcast_reduce_op_value.{cc,cu} * Revert "Fix flakey test_operator.py:test_laop_6." This reverts commit 479ba38. * Break up elemwise_unary_op_basic.{cc,cu}
Description
This PR can be thought of as an alternate proposal to PR #15882 as far as how to handle the 'out of heap space' errors of the windows CI. This PR follows the suggestion of @marcoabreu to break up the problematic large-compiles into smaller pieces, rather than moving the Windows compiler toolchain to 64-bit.
The PR breaks apart ./src/operator/tensor/broadcast_reduce_op_value.{cu,cc}, a top cause of the 'out of heap space' errors. Also, because of its mention in issue #13958, the PR breaks apart ./src/operator/tensor/elemwise_unary_op_basic.{cu,cc}. The PR also cherry-picks a test_shuffle speed-up to help get the PR through CI without timing out. The PR no longer modifies test_laop_6, which was seen to fail sporadically with access violations on a different PR.
The pitch: smaller compiler tasks are more easily parallelized over the CPU cores in a 'make -j N' scenario, though it's not clear if total compile time is reduced. It may decrease the time needed for a failing compile to complete- a help for those that like to isolate a current compile problem with a follow-up simple 'make' command.
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
Comments