-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[MXNET-1111] Horovod support for MXNet #12666
Changes from 9 commits
3d63950
ab2225f
64c6e1f
772969b
f84ced1
d6b02e8
5694aed
67b14ce
e50d25f
bd25f65
f6f9cf3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -161,7 +161,7 @@ def output_shapes(self): | |
assert self.binded | ||
return self._curr_module.output_shapes | ||
|
||
def get_params(self): | ||
def get_params(self, copy_to_cpu=True): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is this copy_to_cpu argument needed here since it is not used. Nor is it declared the BaseModule There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please check Line 174 in this file: params = self._curr_module.get_params(). Do we need to change to params = self._curr_module.get_params(copy_to_cpu)? |
||
"""Gets current parameters. | ||
|
||
Returns | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -412,7 +412,7 @@ def set_params(self, arg_params, aux_params, allow_extra=False): | |
for exec_ in self.execs: | ||
exec_.copy_params_from(arg_params, aux_params, allow_extra_params=allow_extra) | ||
|
||
def get_params(self, arg_params, aux_params): | ||
def get_params(self, arg_params, aux_params, copy_to_cpu): | ||
""" Copy data from each executor to `arg_params` and `aux_params`. | ||
|
||
Parameters | ||
|
@@ -421,17 +421,29 @@ def get_params(self, arg_params, aux_params): | |
Target parameter arrays. | ||
aux_params : list of NDArray | ||
Target aux arrays. | ||
copy_to_cpu : boolean | ||
Whether or not to copy parameters to CPU. (default to 'true') | ||
|
||
Notes | ||
----- | ||
- This function will inplace update the NDArrays in arg_params and aux_params. | ||
""" | ||
for name, block in zip(self.param_names, self.param_arrays): | ||
weight = sum(w.copyto(ctx.cpu()) for w in block) / len(block) | ||
if copy_to_cpu: | ||
context = ctx.cpu() | ||
else: | ||
context = block[0].context | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed |
||
weight = sum(w.copyto(context) for w in block) / len(block) | ||
weight.astype(arg_params[name].dtype).copyto(arg_params[name]) | ||
arg_params[name] = arg_params[name].as_in_context(context) | ||
for name, block in zip(self.aux_names, self.aux_arrays): | ||
weight = sum(w.copyto(ctx.cpu()) for w in block) / len(block) | ||
if copy_to_cpu: | ||
context = ctx.cpu() | ||
else: | ||
context = block[0].context | ||
weight = sum(w.copyto(context) for w in block) / len(block) | ||
weight.astype(aux_params[name].dtype).copyto(aux_params[name]) | ||
aux_params[name] = aux_params[name].as_in_context(context) | ||
|
||
def forward(self, data_batch, is_train=None): | ||
"""Split `data_batch` according to workload and run forward on each devices. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -93,7 +93,7 @@ def output_shapes(self): | |
################################################################################ | ||
# Parameters of a module | ||
################################################################################ | ||
def get_params(self): | ||
def get_params(self, copy_to_cpu=True): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is this argument needed here? |
||
"""Gets parameters, those are potentially copies of the the actual parameters used | ||
to do computation on the device. Subclass should override this method if contains | ||
parameters. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -149,7 +149,7 @@ def output_shapes(self): | |
assert self.binded | ||
return self._modules[-1].output_shapes | ||
|
||
def get_params(self): | ||
def get_params(self, copy_to_cpu=True): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is this argument needed here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please check Line 167 in this file: arg, aux = module.get_params(). Do we need to pass in the copy_to_cpu value when calling get_params? arg, aux = module.get_params(copy_to_cpu)? |
||
"""Gets current parameters. | ||
|
||
Returns | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -285,9 +285,9 @@ inline bool ImageRecordIOParser2<DType>::ParseNext(DataBatch *out) { | |
shape_vec.push_back(param_.label_width); | ||
TShape label_shape(shape_vec.begin(), shape_vec.end()); | ||
|
||
out->data.at(0) = NDArray(data_shape, Context::CPUPinned(0), false, | ||
out->data.at(0) = NDArray(data_shape, Context::CPU(0), false, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For my own understanding, why do we change to CPU from CPUPinned in this PR? Is it going to cause issue if we stick to CPUPinned when conduct distributed training with Horovod? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In my benchmark, it doesn't affect performance. You may recall we can't use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @ctcyang what did you benchmark? Did you also run kvstore('local') and kvstore('nccl') to verify the perf impact? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In my benchmark, I ran kvstore('nccl') and kvstore('device') and there is no perf impact before and after the change. I did not test kvstore('local'). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm afraid that data copy will more likely become the bottleneck as GPUs become faster in the next generation. What about adding an ctx option like https://mxnet.incubator.apache.org/versions/master/api/python/ndarray/ndarray.html?highlight=logis#mxnet.ndarray.zeros which defaults to cpu_pinned(0)? For horovod users, if they have memory issues, they can either pass ctx=mx.cpu() or mx.cpu_pinned(local_rank()). |
||
mshadow::DataType<DType>::kFlag); | ||
out->data.at(1) = NDArray(label_shape, Context::CPUPinned(0), false, | ||
out->data.at(1) = NDArray(label_shape, Context::CPU(0), false, | ||
mshadow::DataType<real_t>::kFlag); | ||
unit_size_[0] = param_.data_shape.Size(); | ||
unit_size_[1] = param_.label_width; | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -64,7 +64,7 @@ class KVStoreNCCL : public KVStoreLocal { | |
KVStoreNCCL() : KVStoreLocal() { | ||
// Due to aggregation, we do not use the Comm interface | ||
comm_ = nullptr; | ||
pinned_ctx_ = Context::CPUPinned(0); | ||
pinned_ctx_ = Context::CPU(0); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. cc @ptrendx @DickJC123 is this ok to nccl? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The changes in |
||
inited_ = false; | ||
} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For all the places you added this new param, could you please update the comments for the function to explain what this copy_to_cpu is for, like what you did in executor_group.py? May also be good to add some explanation why it is not used here and where it is going to be used. The readers might have the same confusion why the param is in function signature but not used in the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed offline, based on our tests of distributed training on 1 node and 2 node p3.16xlarge instances, the training accuracy looks normal. We also saved and compared the module parameters from a rank from each node in the 2 node scenario and they look identical. I think we can remove this copy_to_cpu argument change