Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MXNET-1111] Horovod support for MXNet #12666

Merged
merged 11 commits into from
Oct 29, 2018
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions src/io/iter_image_recordio_2.cc
Original file line number Diff line number Diff line change
Expand Up @@ -285,9 +285,9 @@ inline bool ImageRecordIOParser2<DType>::ParseNext(DataBatch *out) {
shape_vec.push_back(param_.label_width);
TShape label_shape(shape_vec.begin(), shape_vec.end());

out->data.at(0) = NDArray(data_shape, Context::CPUPinned(0), false,
out->data.at(0) = NDArray(data_shape, Context::CPU(0), false,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own understanding, why do we change to CPU from CPUPinned in this PR? Is it going to cause issue if we stick to CPUPinned when conduct distributed training with Horovod?

Copy link
Contributor Author

@ctcyang ctcyang Oct 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my benchmark, it doesn't affect performance.

You may recall we can't use horovod.local_rank() to determine the GPU id because mxnet cannot be compile-time dependent on horovod. But then if you set all GPUs from 0-7 to use CPUPinned(0) you will get 8 processes starting the CUDA driver on GPU 0, which wastes memory and you won't be able to get the largest batch size.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ctcyang what did you benchmark? Did you also run kvstore('local') and kvstore('nccl') to verify the perf impact?

Copy link
Contributor Author

@ctcyang ctcyang Oct 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my benchmark, I ran kvstore('nccl') and kvstore('device') and there is no perf impact before and after the change. I did not test kvstore('local').

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid that data copy will more likely become the bottleneck as GPUs become faster in the next generation. What about adding an ctx option like https://mxnet.incubator.apache.org/versions/master/api/python/ndarray/ndarray.html?highlight=logis#mxnet.ndarray.zeros which defaults to cpu_pinned(0)? For horovod users, if they have memory issues, they can either pass ctx=mx.cpu() or mx.cpu_pinned(local_rank()).

mshadow::DataType<DType>::kFlag);
out->data.at(1) = NDArray(label_shape, Context::CPUPinned(0), false,
out->data.at(1) = NDArray(label_shape, Context::CPU(0), false,
mshadow::DataType<real_t>::kFlag);
unit_size_[0] = param_.data_shape.Size();
unit_size_[1] = param_.label_width;
Expand Down