This repository has been archived by the owner on Nov 17, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 6.8k
fix custom exception handling #14575
Closed
Closed
Changes from 5 commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
da74a63
fix custom exception handling
arcadiaphy 2b1da9e
add test
arcadiaphy e43f13d
simplify catch exception
arcadiaphy 5bce27e
add comment
arcadiaphy 23aee98
remove custom from engine
arcadiaphy 963be55
add sync func
arcadiaphy 931aca9
unlimited custom thread
arcadiaphy File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -29,6 +29,7 @@ | |
from mxnet.test_utils import * | ||
from mxnet.base import py_str, MXNetError, _as_list | ||
from common import setup_module, with_seed, teardown, assert_raises_cudnn_not_satisfied, assertRaises | ||
from nose.tools import assert_raises | ||
import unittest | ||
import os | ||
|
||
|
@@ -5200,29 +5201,29 @@ def create_operator(self, ctx, shapes, dtypes): | |
|
||
# test custom operator fork | ||
# see https://github.com/apache/incubator-mxnet/issues/14396 | ||
if not sys.platform.startswith('win'): # no fork in windows | ||
class AdditionOP(mx.operator.CustomOp): | ||
def __init__(self): | ||
super(AdditionOP, self).__init__() | ||
def forward(self, is_train, req, in_data, out_data, aux): | ||
out_data[0][:] = in_data[0] + in_data[1] | ||
def backward(self, req, out_grad, in_data, out_data, in_grad, aux): | ||
in_grad[0][:] = out_grad[0] | ||
in_grad[1][:] = out_grad[0] | ||
|
||
@mx.operator.register("AdditionOP") | ||
class AdditionOPProp(mx.operator.CustomOpProp): | ||
def __init__(self): | ||
super(AdditionOPProp, self).__init__() | ||
def list_arguments(self): | ||
return ['a', 'b'] | ||
def list_outputs(self): | ||
return ['output'] | ||
def infer_shape(self, in_shape): | ||
return in_shape, [in_shape[0]] | ||
def create_operator(self, ctx, shapes, dtypes): | ||
return AdditionOP() | ||
class AdditionOP(mx.operator.CustomOp): | ||
def __init__(self): | ||
super(AdditionOP, self).__init__() | ||
def forward(self, is_train, req, in_data, out_data, aux): | ||
out_data[0][:] = in_data[0] + in_data[1] | ||
def backward(self, req, out_grad, in_data, out_data, in_grad, aux): | ||
in_grad[0][:] = out_grad[0] | ||
in_grad[1][:] = out_grad[0] | ||
|
||
@mx.operator.register("AdditionOP") | ||
class AdditionOPProp(mx.operator.CustomOpProp): | ||
def __init__(self): | ||
super(AdditionOPProp, self).__init__() | ||
def list_arguments(self): | ||
return ['a', 'b'] | ||
def list_outputs(self): | ||
return ['output'] | ||
def infer_shape(self, in_shape): | ||
return in_shape, [in_shape[0]] | ||
def create_operator(self, ctx, shapes, dtypes): | ||
return AdditionOP() | ||
|
||
if not sys.platform.startswith('win'): # no fork in windows | ||
def custom_add(): | ||
a = mx.nd.array([1, 2, 3]) | ||
b = mx.nd.array([4, 5, 6]) | ||
|
@@ -5237,6 +5238,18 @@ def custom_add(): | |
p.join(5) | ||
assert not p.is_alive(), "deadlock may exist in custom operator" | ||
|
||
# test except handling | ||
# see https://github.com/apache/incubator-mxnet/pull/14575 | ||
def custom_add_exc(): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we add a comment that an exception is expected due to shapes I assume? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure, I'll add it. |
||
a = mx.nd.array([1, 2, 3]) | ||
b = mx.nd.array([4, 5]) | ||
# trigger exception by providing unmatched operand shapes | ||
c = mx.nd.Custom(a, b, op_type='AdditionOP') | ||
c.wait_to_read() | ||
|
||
assert_raises(MXNetError, custom_add_exc) | ||
|
||
|
||
@with_seed() | ||
def test_psroipooling(): | ||
for num_rois in [1, 2]: | ||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can solve both 1 and 2 this way: After func is called do wait_to_read on all elements in arrs. Then catch and save. Remove lines 104 and 105. In PushSync, check if exception is set and rethrow exception. Also catch it and call async_on_complete in pushsync. and return.
Something like the following:
Thanks to this support added for horovod: #13932 we may be able to leverage this to call async_on_complete with the error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding wait_to_read in custom op can solve 1&2, and it can be treated as normal op without using ExecType::kAsync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we probably still need PushSync for the Sparse ndarray updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we still need ExecType::kAsync. Custom operator is still async and when push is called it just pushes it into its custom op worker queue for execution later. Async will ensure that the threaded_engine_pooled and threaded_engine_per_device treat it as a special case and execute immediately instead of pushing the work again to one of the engine worker thread queue. Pushing to engine worker thread queue is unnecessary for custom op.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After testing, ExecType::kAsync is really needed. Adding wait_to_read in engine worker thread will cause deadlock.
But PushSync can be removed and works well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we probably still need it for sparse. since for sparse we are updating chunk it is a write option. WaitToRead may not be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I also add WaitToWrite to make sure there's no left out exceptions.