Support populating errors back to MXNet engine in callback #13922

yuxihu · 2019-01-17T22:34:31Z

This PR adds an optional dmlc::Error* argument in MXNet engine callback functions. The callers can leverage this argument to populate errors back to MXNet engine through callback such that the errors can be handled properly by MXNet engine. One such use case is to populate the errors detected in Horovod back to MXNet engine for Hovorod and MXNet integration. This change does not affect existing use cases.

apeforest · 2019-01-17T22:39:36Z

include/mxnet/engine.h

@@ -74,15 +74,15 @@ class CallbackOnComplete {
 public:
  // use implicit copy and assign
  /*! \brief involve the callback */
-  inline void operator()() const {
-    (*callback_)(engine_, param_);
+  inline void operator()(const char* error_msg = nullptr) const {


Instead of passing in a const string of error message, would it make sense to pass in a error struct which contains error message and error code in the fields? This leaves more room for proper error handling based on error code.

The error code is not universally defined across different libraries. In the Horovod case, the error types are mostly Horovod specific. We convert all those to dmlc::Error which MXNet can catch.

Can we define an error data structure and export it? So all libraries that consume libmxnet will use this error data structure to pass data back.

This sounds like a good idea to make it extendable in the future. We can use the error message for now.

Use dmlc::Error instead of char*. We can change dmlc::Error if more information need to be passed.

stu1130 · 2019-01-17T23:58:36Z

Nice work @yuxihu
@mxnet-label-bot add [pr-awaiting-review]

yuxihu · 2019-01-18T00:18:10Z

@eric-haibin-lin please take a look.

apeforest

LGTM

larroy · 2019-01-18T15:10:47Z

src/engine/threaded_engine.cc

  OprBlock *opr_block = static_cast<OprBlock*>(opr_block_);
  ThreadedOpr *threaded_opr = opr_block->opr;
+  if (error != nullptr) {
+    auto ex_p = std::make_exception_ptr(*error);
+    threaded_opr->opr_exception = std::make_shared<std::exception_ptr>(ex_p);


Isn't exception_ptr already reference counted? Do we need to wrap it again in make_shared?

Good catch. I will change related sites by removing the redundant shared_ptr in a separate PR.

larroy

See previous question. Is there a test for this?

larroy

As discussed, LGTM would be nice to remove the shared ptr if it's redundant in a separate PR.

) * add an optional error_msg in engine on_complete callbcak * use dmlc::Error struct to make error population extendable

add an optional error_msg in engine on_complete callbcak

5246b1f

yuxihu requested a review from anirudh2290 as a code owner January 17, 2019 22:34

apeforest reviewed Jan 17, 2019

View reviewed changes

yuxihu mentioned this pull request Jan 17, 2019

Handle horovod errors ctcyang/horovod#24

Merged

marcoabreu added the pr-awaiting-review PR is waiting for code review label Jan 17, 2019

use dmlc::Error struct to make error population extendable

9498a74

apeforest approved these changes Jan 18, 2019

View reviewed changes

larroy reviewed Jan 18, 2019

View reviewed changes

larroy suggested changes Jan 18, 2019

View reviewed changes

larroy approved these changes Jan 18, 2019

View reviewed changes

marcoabreu merged commit 0c85665 into apache:master Jan 18, 2019

yuxihu deleted the hvd_mx_error branch January 18, 2019 18:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support populating errors back to MXNet engine in callback #13922

Support populating errors back to MXNet engine in callback #13922

yuxihu commented Jan 17, 2019 •

edited

Loading

apeforest Jan 17, 2019

yuxihu Jan 17, 2019

apeforest Jan 17, 2019

yuxihu Jan 17, 2019

apeforest Jan 17, 2019

yuxihu Jan 18, 2019

stu1130 commented Jan 17, 2019

yuxihu commented Jan 18, 2019

apeforest left a comment

larroy Jan 18, 2019

yuxihu Jan 18, 2019

larroy left a comment

larroy left a comment

Support populating errors back to MXNet engine in callback #13922

Support populating errors back to MXNet engine in callback #13922

Conversation

yuxihu commented Jan 17, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stu1130 commented Jan 17, 2019

yuxihu commented Jan 18, 2019

apeforest left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

larroy left a comment

Choose a reason for hiding this comment

larroy left a comment

Choose a reason for hiding this comment

yuxihu commented Jan 17, 2019 •

edited

Loading