[DO NOT MERGE] Added error handling in MXNet #19

apeforest · 2019-01-07T23:10:58Z

Also added unit tests

…/mxnet

…to develop/mxnet

yuxihu · 2019-01-07T23:13:11Z

horovod/mxnet/handle_manager.cc

+}
+
+void HandleManager::AttachCallback(int handle, Callback cb) {
+  std::unique_lock<std::mutex> lock(mutex_);


we can use lock_guard here

yuxihu · 2019-01-08T18:56:47Z

horovod/mxnet/mpi_ops.cc

  auto device = TensorUtil::GetDevice(tensor);
  auto hvd_tensor = std::make_shared<MXTensor<NDArray>>(tensor);
  auto hvd_context = std::make_shared<MXOpContext<NDArray>>(device, output);
  auto hvd_output = std::make_shared<MXTensor<NDArray>>(output);
+  handle_manager.AttachCallback(handle, cb);


nit: let's have consistent formatting about this line across different functions. How about having one empty line before L52 and remove the empty line L53?

yuxihu · 2019-01-08T18:57:09Z

horovod/mxnet/mpi_ops.cc

@@ -72,37 +73,41 @@ void DoAllreduceCudaOnCPU(NDArray* tensor, NDArray* output, std::string& name,

  auto hvd_context = std::make_shared<MXOpContext<NDArray>>(
      CPU_DEVICE_ID, hvd_cpu_buffer->tensor());
+  handle_manager.AttachCallback(handle, cb);


same as above

yuxihu · 2019-01-08T18:57:33Z

horovod/mxnet/mpi_ops.cc

@@ -142,50 +149,57 @@ void DoBroadcast(NDArray* tensor, NDArray* output, int root_rank,
    hvd_output = std::make_shared<MXTensor<NDArray>>(output);
  }

+  handle_manager.AttachCallback(handle, cb);
+


nit: remove L153?

yuxihu · 2019-01-08T18:57:48Z

horovod/mxnet/mpi_ops.cc

  // Make async copy of input tensor to CPU tensor and record completion event.
  auto hvd_context = std::make_shared<MXOpContext<NDArray>>(
      CPU_DEVICE_ID, hvd_cpu_buffer->tensor());
  auto ready_event =
      std::make_shared<MXReadyEvent<NDArray>>(hvd_cpu_buffer->tensor());

+  handle_manager.AttachCallback(handle, cb);
+


nit: remove L175?

yuxihu · 2019-01-08T19:11:45Z

test/test_mxnet.py

+            hvd.broadcast(tensor, 0)
+            assert False, 'hvd.broadcast did not throw error'
+        except (MXNetError, RuntimeError) as e:
+            print(e)


remove print or keep it?

yuxihu · 2019-01-08T19:11:57Z

test/test_mxnet.py

+            hvd.broadcast(tensor, 0)
+            assert False, 'hvd.broadcast did not throw error'
+        except (MXNetError, RuntimeError) as e:
+            print(e)


remove print or keep it?

test/test_mxnet.py

yuxihu · 2019-01-08T19:15:14Z

horovod/mxnet/mpi_ops.py

+
+    check_call(MPI_MXNET_LIB_CTYPES.horovod_mxnet_wait_and_clear(handle))
+    output = _handle_map.pop(handle)
+    return output


Is this output useful for the users? If not, we may not need to store the output in the _handle_map. We can even use a set to store the handle.

output is useful because allreduce(), broadcast() need to return tensor, and they are returned by calling synchronize()

yuxihu · 2019-01-08T19:15:34Z

horovod/mxnet/mpi_ops.py

+        handle = MPI_MXNET_LIB_CTYPES.horovod_mxnet_allreduce_async(
+            c_in, c_out, name, ctypes.c_bool(average))
+
+    _handle_map[handle] = (tensor, tensor)


_handle_map[handle] = tensor?

good catch. this is a bug

yuxihu

Let's discuss PR in more details.

horovod/mxnet/mpi_ops.cc

horovod/mxnet/mpi_ops.py

yuxihu · 2019-01-08T19:34:52Z

horovod/mxnet/mpi_ops.cc

+
+extern "C" int horovod_mxnet_wait_and_clear(int handle) {
+  API_BEGIN();
+  while (!handle_manager.PollHandle(handle)) {


I have concerns about this since it will introduce contention on the mutex.

updated the sequence of callback and markdone as we discussed. So this should no longer introduce race condition. Please review again

Thanks. I will review after you push your changes.

alsrgv · 2019-01-09T01:27:01Z

horovod/mxnet/mpi_ops.py

+            ctypes.byref(mx_handle)))
+
+    _handle_map[mx_handle.value] = output
+    return synchronize(mx_handle.value)


Is the plan to introduce true async functions later that could be used in DistributedOptimizer to improve performance? Would current formulation still perform better than the reference parameter server?

Referring to these numbers:

# gpus | Without HA | With HA --------------------------------- 8 | 3072 (NA)| 3078 (NA) 16 | 6027 (98%)| 5859 (95%) 32 | 12030 (98%)| 11675 (95%) 64 | 22346 (83%)| 23166 (94%) 128 | 40938 (84%)| 45972 (93%) 256 | 64998 (66%)| 89858 (91%)

Yes, it is possible to remove the synchronize() and reply on MXNet engine to handle the dependencies between tasks. We plan to further improve the performance after we merge the current stable PR into Horovod. Does that sound good to you?

Yep, sounds good. Do you have a sense of scaling efficiency we'll see with the current version? I hope to include MXNet support in upcoming 0.16.0 release next week, and I wanted to see if we can publish good scaling #s with it.

@alsrgv We found the throughput is affected when we use synchronize(). I am reverting to the original implementation now. However, we need a better mechanism to catch the error status returned by Horovod. I am currently trying to introduce a context variable to store the Status inside the callback just like what Tensorflow is doing. If you have other better suggestion, it will be greatly appreciated.

I think if MXNet has a mechanism to notify framework about op failure, which it will propagate to the user, it would be the best option to use.

@alsrgv I have been trying to leverage MXNet to catch the exception and propagate to user at Python level in the past two days. However, there was always a ibc++abi.dylib: terminating with uncaught exception of type dmlc::Error:. I suspect there is some bug in MXNet side to handle exception thrown in the engine callback.
In the meantime, do you think would it be okay to just log the error for now in Horovod and we will continue to improve this after MXNet is merged in Horovod? Please let us know your thoughts. Thanks!

yuxihu · 2019-01-09T21:06:43Z

horovod/mxnet/mpi_ops.cc

-                                             char* name, bool average) {
-  auto handle = handle_manager.AllocateHandle();
+                                             const char* name, bool average,
+                                             int* handle) {


why do we use int* handle here? we cannot return the handle to the caller?

If you are not returning, the function return type should be changed to void

Same applies to other functions.

This function returns the status through MX_API_END() instead of handle. The return status is actually needed at python level when you call check_call().

…/mxnet

apeforest and others added 30 commits November 29, 2018 22:49

Make mxnet build successful in CPU

d625394

update required mxnet version

02ab771

remove outdated comment

9abcc4e

remove commented line

cd096e4

Merge remote-tracking branch 'origin/mxnet_feature_fp16' into develop…

f9b2083

…/mxnet

fix test in CPU

b0e2e58

Merge branch 'mxnet_feature_fp16' into develop/mxnet

dd4f9e2

refactor

b617e14

Merge branch 'mxnet_feature_fp16' into develop/mxnet

84ed58e

link nccl to mpi_lib for mxnet

2b902ae

Merge branch 'develop/mxnet' of https://github.com/ctcyang/horovod in…

ff57e51

…to develop/mxnet

Merge branch 'mxnet_feature_fp16' into develop/mxnet

6013957

Merge branch 'mxnet_feature_fp16' into develop/mxnet

bc47aa9

make mxnet build process the same as tensorflow

297e79a

Merge branch 'mxnet_feature_fp16' into develop/mxnet

f28ba01

compute allreduce average in C++ to avoid perf deg

ab78201

rename variable

dc62625

add mxnet mnist example

c56322f

fix lint

4eb787e

reduce epoch and acc check

3e5491a

Merge branch 'mxnet_feature_fp16' into develop/mxnet

9589209

broadcast initial parames

b42f0c5

Update README

13adbb3

Merge branch 'mxnet_feature_fp16' into develop/mxnet

b4aa9f2

remove unused handle manager

f9c9d73

renaming variable type

dc96acc

return non empty op name

aaf3d7f

Merge branch 'mxnet_feature_fp16' into develop/mxnet

0797570

scale learning rate by workers

89ba103

Merge branch 'mxnet_feature_fp16' into develop/mxnet

60877b7

apeforest added 8 commits January 4, 2019 20:39

refactor test_mxnet to make it easier to read

b3a24db

fix a bug in building on GPU

6e4b845

Merge branch 'mxnet_feature_fp16' into develop/mxnet

710c703

Merge branch 'mxnet_feature_fp16' into develop/mxnet

0112e6a

polish imagenet example

4a1c010

add handle_manager

61741e8

error handling in MXNet

c24d0bd

Merge branch 'mxnet_feature_fp16' into develop/mxnet

effd043

apeforest requested review from ctcyang and yuxihu January 7, 2019 23:11

add exception handling

1c9443f

apeforest mentioned this pull request Jan 8, 2019

add extra header file to export apache/mxnet#13795

Merged

yuxihu requested changes Jan 8, 2019

View reviewed changes

yuxihu reviewed Jan 8, 2019

View reviewed changes

apeforest added 2 commits January 8, 2019 12:48

rename c_api_common

9b9bab1

wrap MXNet C API with exception handling

2d64e05

alsrgv reviewed Jan 9, 2019

View reviewed changes

alsrgv mentioned this pull request Jan 9, 2019

Horovod support for MXNet framework horovod/horovod#542

Merged

remove unused function declaration

1cd08be

yuxihu reviewed Jan 9, 2019

View reviewed changes

apeforest added 3 commits January 9, 2019 13:25

fix a typo

77cbb8b

fix a bug

4f1a626

fix build error

c1c476c

apeforest changed the title ~~Added error handling in MXNet~~ [DO NOT MERGE] Added error handling in MXNet Jan 11, 2019

Ubuntu added 2 commits January 14, 2019 20:55

Merge branch 'mxnet_feature_fp16' into develop/mxnet

51f81d0

Merge remote-tracking branch 'origin/mxnet_feature_fp16' into develop…

75c56f7

…/mxnet

yuxihu mentioned this pull request Jan 16, 2019

Handle horovod errors #24

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DO NOT MERGE] Added error handling in MXNet #19

[DO NOT MERGE] Added error handling in MXNet #19

apeforest commented Jan 7, 2019

yuxihu Jan 7, 2019

apeforest Jan 8, 2019

yuxihu Jan 8, 2019

yuxihu Jan 8, 2019

yuxihu Jan 8, 2019

yuxihu Jan 8, 2019

yuxihu Jan 8, 2019

yuxihu Jan 8, 2019

yuxihu Jan 8, 2019

apeforest Jan 8, 2019

yuxihu Jan 8, 2019

apeforest Jan 8, 2019

yuxihu left a comment

yuxihu Jan 8, 2019

apeforest Jan 8, 2019

yuxihu Jan 8, 2019

alsrgv Jan 9, 2019 •

edited

Loading

alsrgv Jan 9, 2019

apeforest Jan 9, 2019

alsrgv Jan 9, 2019

apeforest Jan 9, 2019 •

edited

Loading

alsrgv Jan 9, 2019

apeforest Jan 11, 2019

yuxihu Jan 9, 2019

apeforest Jan 9, 2019

[DO NOT MERGE] Added error handling in MXNet #19

Are you sure you want to change the base?

[DO NOT MERGE] Added error handling in MXNet #19

Conversation

apeforest commented Jan 7, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yuxihu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alsrgv Jan 9, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apeforest Jan 9, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alsrgv Jan 9, 2019 •

edited

Loading

apeforest Jan 9, 2019 •

edited

Loading