resnet cpp-package test is broken #14406

anirudh2290 · 2019-03-13T01:47:51Z

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-14397/5/pipeline

after adding waitall support the resnet example is failing with cudamalloc out of memory error.

mxnet-label-bot · 2019-03-13T01:47:55Z

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Example

wkcn · 2019-03-13T02:12:22Z

I wonder the memory of GPU in CI.
The input shape is (50,3,224,224), which may triggers OOM.

wkcn · 2019-03-13T02:14:30Z

In addition, the model in cpp-package seems to be not convergent.

anirudh2290 · 2019-03-13T02:38:55Z

I think its running on a p3.8xlarge which should be sufficient to run this test. @marcoabreu can you confirm.

anirudh2290 · 2019-03-13T02:40:19Z

In addition, the model in cpp-package seems to be not convergent.

yes i observed that too.

wkcn · 2019-03-13T03:03:50Z

Since the input shape of ResNet is (3, 224, 224), so I resized the MNIST image (1, 28, 28) to (3, 224, 224).

marcoabreu · 2019-03-13T12:16:56Z

We run on a g3.8xlarge

wkcn · 2019-03-13T12:38:28Z

Changing batch size to a smaller value will address the OOM issue.

leleamol · 2019-03-13T18:13:53Z

@marcoabreu There are no changes to the alexnet.cpp, resnet.cpp or cpp-package recently.
Are there any changes to underlying cuda or mxnet implementation.

These tests were part of CI tests and have been passing before. We can change the examples so that pass on lower capacity instances, in my opinion that won't be the right solution.

ddavydenko · 2019-03-13T18:14:19Z

Did infra that these tests are run on have changed recently? It seems that the test would be running fine on p3.8xl but would fail on g3.8x (legacy hardware)... @marcoabreu

anirudh2290 · 2019-03-13T18:27:08Z

as i said this happened in waitall change. waitall earlier used to hide exceptions, but with the PR: #14397 it is thrown. These problems would have been there from before but surfacing now.

leleamol · 2019-03-13T22:45:28Z

I tried these examples with the recent code change in "WaitAll()" on p2.16x instances and c5.18x instances. I did not see the crash.

However, we still need to add missing exception handling in the example so that we can prevent the crashes due to unhandled exceptions.

anirudh2290 · 2019-03-13T23:44:20Z

hi @leleamol . to reproduce you will have to use g3.8xlarge. I was able to reproduce on a g3.8xlarge.

wkcn · 2019-03-13T23:53:44Z

Could someone please look the GPU memory used by the model?

anirudh2290 · 2019-03-13T23:57:50Z

the last i observed it was around 11GB. For now I am going to use smaller batch_size for tests and later @leleamol will revisit and improve the cpp tests.

leleamol · 2019-03-14T23:26:17Z

@anirudh2290
I could reproduce this issue on p2.8 as well when I change the batch size to 100.
The example uses only one GPU. With batch size = 50, the GPU memory reaches 11GB.

leleamol · 2019-04-12T22:08:48Z

This issue can be closed since the PR is merged. @lanking520

anirudh2290 added Bug C++ Related to C++ labels Mar 13, 2019

anirudh2290 added the Example label Mar 13, 2019

leleamol mentioned this issue Mar 15, 2019

[MXNET-1357] Fix the cpp-examples to add exception handling #14441

Merged

4 tasks

lanking520 closed this as completed Apr 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resnet cpp-package test is broken #14406

resnet cpp-package test is broken #14406

anirudh2290 commented Mar 13, 2019

mxnet-label-bot commented Mar 13, 2019

wkcn commented Mar 13, 2019

wkcn commented Mar 13, 2019

anirudh2290 commented Mar 13, 2019

anirudh2290 commented Mar 13, 2019

wkcn commented Mar 13, 2019

marcoabreu commented Mar 13, 2019

wkcn commented Mar 13, 2019

leleamol commented Mar 13, 2019 •

edited

Loading

ddavydenko commented Mar 13, 2019

anirudh2290 commented Mar 13, 2019

leleamol commented Mar 13, 2019

anirudh2290 commented Mar 13, 2019

wkcn commented Mar 13, 2019

anirudh2290 commented Mar 13, 2019

leleamol commented Mar 14, 2019

leleamol commented Apr 12, 2019

resnet cpp-package test is broken #14406

resnet cpp-package test is broken #14406

Comments

anirudh2290 commented Mar 13, 2019

mxnet-label-bot commented Mar 13, 2019

wkcn commented Mar 13, 2019

wkcn commented Mar 13, 2019

anirudh2290 commented Mar 13, 2019

anirudh2290 commented Mar 13, 2019

wkcn commented Mar 13, 2019

marcoabreu commented Mar 13, 2019

wkcn commented Mar 13, 2019

leleamol commented Mar 13, 2019 • edited Loading

ddavydenko commented Mar 13, 2019

anirudh2290 commented Mar 13, 2019

leleamol commented Mar 13, 2019

anirudh2290 commented Mar 13, 2019

wkcn commented Mar 13, 2019

anirudh2290 commented Mar 13, 2019

leleamol commented Mar 14, 2019

leleamol commented Apr 12, 2019

leleamol commented Mar 13, 2019 •

edited

Loading