Balance Allocation for Multiple Cards On Device KVStore #252

junranhe · 2015-10-10T02:43:38Z

我用vggnet 11layer 训练，在总的batchsize是36，3个780gpu，conv workspace 设置为 256, cuda-7.0, 但不用cudnn（觉得可以规避一些内存分配的不确定性），发现占用gpu：2795m，2661m，2383m，他们100多m的递减差别是怎么产生的？希望能差别小一点，这样显存的利用率高些，毕竟显卡相同，显存利用取决于占用最大的gpu了，batchsize太小我的数据有时不收敛:(

tqchen · 2015-10-10T03:02:58Z

This could due to the memory allocation policy used for the distributed KVStore under mode kvstore_type = 'device'. I guess this should not be the case for kvstore_type=local

When we do device type kvstore, what is needed is we allocate the reduction temporal memory on each of the device. We do it in a random assignment way.here https://github.com/dmlc/mxnet/blob/master/src/kvstore/kvstore_device.h#L36 to balance the temp weight memory on each device.

If the weight is not uniformly distributed (e.g there is a weight that is particularly big chunk of weight), then this could cause the imbalance.

junranhe · 2015-10-10T03:27:52Z

是的，我是使用kvstore=device，可不可以添加一个比较确定的分配策略？例如从weight从大到小贪心，总是放到当前weight最少的设备，我觉得不少人会喜欢把显存用得满满的，如果随机的话，就有偶尔爆显存的顾虑，即使是随机在显存分配上最好也有个默认的seed，这样不用担心显存每次跑起来不一样：）

tqchen · 2015-10-10T03:28:58Z

This seems to be a good idea. The allocation strategy code is here https://github.com/dmlc/mxnet/blob/master/src/kvstore/kvstore_device.h#L36

Maybe you can consider hack it a bit and contribute back :) ?

junranhe · 2015-10-10T03:30:51Z

yes，I will

mli · 2015-10-10T20:30:48Z

another possible way is, do random assignment if the size < bigarray_bound_, otherwise, evenly split the array into num_dev parts, and assign one part to each device

tqchen · 2015-10-10T20:34:14Z

This dep checking of slice might block other slices, as we do not
distinguish between non overlapping slices. So the current way is easier
On Sat, Oct 10, 2015 at 1:30 PM Mu Li [email protected] wrote:

another possible way is, do random assignment if the size <
bigarray_bound_, otherwise, evenly split the array into num_dev parts, and
assign one part to each device

—
Reply to this email directly or view it on GitHub
#252 (comment).

tqchen · 2015-10-16T16:02:40Z

This issue is fixed by contribution #256 Thanks to @junranhe

* cub 89de7ab2(89de7ab2)...05eb57fa(05eb57fa) (798 commits) > Merge pull request sociomantic-tsunami#1 from ptrendx/update > Create README.md > update readme.md > 1.7.0 < 1.6.4 doc update (part 2) (...) * dlpack ()...a6e09b5(a6e09b5) (1 commits) > Change order of device_type/id in Context (sociomantic-tsunami#11) * dmlc-core a6c5701(a6c5701)...87b7ffa(87b7ffa) (54 commits) > add SetEnv (apache#322) > Fix a bug in seek/tell on Windows (apache#318) > Fixes apache#303: added recurse_directories to InputSplit::Create (apache#310) > Type name error (apache#316) > Small param bug (apache#315) (...) * mshadow c037b06(c037b06)...2d7780c(2d7780c) (42 commits) > [CMAKE][ARM] Change USE_SSE to SUPPORT_MSSE2 to it uses the autodetected presence of sse compiler flag from the parent project (see PR apache#8395) (apache#303) > Makes repeated setting of gpu rng seed produce repeatable sequences. (apache#304) > Add USE_SSE which propagates into MSHADOW_USE_SSE in cmake (apache#302) > fix range (apache#301) > fix for random seed generation (apache#300) (...) * nnvm b279286(b279286)...e4a138a(e4a138a) (139 commits) > [TVM] upgrade to latest version (apache#263) > Added support for CoreML Permute layers (apache#262) > [CMPL] Add Support for Other Data Types (apache#252) > fix onnx conv2d_transpose loading (apache#245) > [FIX] Fix from_mxnet for multiple outputs symbol (apache#247) (...) * ps-lite v1+118(acdb698)...v1+123(2ce8b9a) (2 commits) > Merge pull request apache#117 from madjam/listen-interface > Merge pull request apache#109 from b0noI/master

Fixes sociomantic-tsunami#11 * cub 89de7ab2(89de7ab2)...05eb57fa(05eb57fa) (798 commits) > Merge pull request sociomantic-tsunami#1 from ptrendx/update > Create README.md > update readme.md > 1.7.0 < 1.6.4 doc update (part 2) (...) * dlpack ()...a6e09b5(a6e09b5) (1 commits) > Change order of device_type/id in Context (sociomantic-tsunami#11) * cub 89de7ab2(89de7ab2)...05eb57fa(05eb57fa) (798 commits) > Merge pull request sociomantic-tsunami#1 from ptrendx/update > Create README.md > update readme.md > 1.7.0 < 1.6.4 doc update (part 2) (...) * dlpack ()...a6e09b5(a6e09b5) (1 commits) > Change order of device_type/id in Context (sociomantic-tsunami#11) * dmlc-core a6c5701(a6c5701)...87b7ffa(87b7ffa) (54 commits) > add SetEnv (apache#322) > Fix a bug in seek/tell on Windows (apache#318) > Fixes apache#303: added recurse_directories to InputSplit::Create (apache#310) > Type name error (apache#316) > Small param bug (apache#315) (...) * mshadow c037b06(c037b06)...2d7780c(2d7780c) (42 commits) > [CMAKE][ARM] Change USE_SSE to SUPPORT_MSSE2 to it uses the autodetected presence of sse compiler flag from the parent project (see PR apache#8395) (apache#303) > Makes repeated setting of gpu rng seed produce repeatable sequences. (apache#304) > Add USE_SSE which propagates into MSHADOW_USE_SSE in cmake (apache#302) > fix range (apache#301) > fix for random seed generation (apache#300) (...) * nnvm b279286(b279286)...e4a138a(e4a138a) (139 commits) > [TVM] upgrade to latest version (apache#263) > Added support for CoreML Permute layers (apache#262) > [CMPL] Add Support for Other Data Types (apache#252) > fix onnx conv2d_transpose loading (apache#245) > [FIX] Fix from_mxnet for multiple outputs symbol (apache#247) (...) * ps-lite v1+118(acdb698)...v1+123(2ce8b9a) (2 commits) > Merge pull request apache#117 from madjam/listen-interface > Merge pull request apache#109 from b0noI/master

* [CMPL] Add Support for Other Data Types * [CMPL] Add test * [CMPL] Fix

tqchen changed the title ~~可不可以让显存分配均匀一些？~~ Balance Allocation for Multiple Cards Oct 10, 2015

tqchen changed the title ~~Balance Allocation for Multiple Cards~~ Balance Allocation for Multiple Cards On Device KVStore Oct 10, 2015

tqchen added the enhancement label Oct 10, 2015

tqchen closed this as completed Oct 16, 2015

eric-haibin-lin pushed a commit to eric-haibin-lin/mxnet that referenced this issue Dec 2, 2017

revert export of DMLC_SERVER_ID and DMLC_WORKER_ID (apache#252)

e6e9e80

joseph-wakeling-sociomantic mentioned this issue Jan 15, 2018

Merge stable 1.0.0 release of MXNet sociomantic-tsunami/mxnet#12

Merged

eric-haibin-lin pushed a commit to eric-haibin-lin/mxnet that referenced this issue Apr 4, 2018

[CMPL] Add Support for Other Data Types (apache#252)

9091d4d

* [CMPL] Add Support for Other Data Types * [CMPL] Add test * [CMPL] Fix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Balance Allocation for Multiple Cards On Device KVStore #252

Balance Allocation for Multiple Cards On Device KVStore #252

junranhe commented Oct 10, 2015

tqchen commented Oct 10, 2015

junranhe commented Oct 10, 2015

tqchen commented Oct 10, 2015

junranhe commented Oct 10, 2015

mli commented Oct 10, 2015

tqchen commented Oct 10, 2015

tqchen commented Oct 16, 2015

Balance Allocation for Multiple Cards On Device KVStore #252

Balance Allocation for Multiple Cards On Device KVStore #252

Comments

junranhe commented Oct 10, 2015

tqchen commented Oct 10, 2015

junranhe commented Oct 10, 2015

tqchen commented Oct 10, 2015

junranhe commented Oct 10, 2015

mli commented Oct 10, 2015

tqchen commented Oct 10, 2015

tqchen commented Oct 16, 2015