Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

resnet cpp-package test is broken #14406

Closed
anirudh2290 opened this issue Mar 13, 2019 · 17 comments
Closed

resnet cpp-package test is broken #14406

anirudh2290 opened this issue Mar 13, 2019 · 17 comments
Labels
Bug C++ Related to C++ Example

Comments

@anirudh2290
Copy link
Member

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-14397/5/pipeline

after adding waitall support the resnet example is failing with cudamalloc out of memory error.

@anirudh2290 anirudh2290 added Bug C++ Related to C++ labels Mar 13, 2019
@mxnet-label-bot
Copy link
Contributor

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Example

@wkcn
Copy link
Member

wkcn commented Mar 13, 2019

I wonder the memory of GPU in CI.
The input shape is (50,3,224,224), which may triggers OOM.

@wkcn
Copy link
Member

wkcn commented Mar 13, 2019

In addition, the model in cpp-package seems to be not convergent.

@anirudh2290
Copy link
Member Author

I think its running on a p3.8xlarge which should be sufficient to run this test. @marcoabreu can you confirm.

@anirudh2290
Copy link
Member Author

In addition, the model in cpp-package seems to be not convergent.

yes i observed that too.

@wkcn
Copy link
Member

wkcn commented Mar 13, 2019

Since the input shape of ResNet is (3, 224, 224), so I resized the MNIST image (1, 28, 28) to (3, 224, 224).

@marcoabreu
Copy link
Contributor

We run on a g3.8xlarge

@wkcn
Copy link
Member

wkcn commented Mar 13, 2019

Changing batch size to a smaller value will address the OOM issue.

@leleamol
Copy link
Contributor

leleamol commented Mar 13, 2019

@marcoabreu There are no changes to the alexnet.cpp, resnet.cpp or cpp-package recently.
Are there any changes to underlying cuda or mxnet implementation.

These tests were part of CI tests and have been passing before. We can change the examples so that pass on lower capacity instances, in my opinion that won't be the right solution.

@ddavydenko
Copy link
Contributor

Did infra that these tests are run on have changed recently? It seems that the test would be running fine on p3.8xl but would fail on g3.8x (legacy hardware)... @marcoabreu

@anirudh2290
Copy link
Member Author

as i said this happened in waitall change. waitall earlier used to hide exceptions, but with the PR: #14397 it is thrown. These problems would have been there from before but surfacing now.

@leleamol
Copy link
Contributor

I tried these examples with the recent code change in "WaitAll()" on p2.16x instances and c5.18x instances. I did not see the crash.

However, we still need to add missing exception handling in the example so that we can prevent the crashes due to unhandled exceptions.

@anirudh2290
Copy link
Member Author

hi @leleamol . to reproduce you will have to use g3.8xlarge. I was able to reproduce on a g3.8xlarge.

@wkcn
Copy link
Member

wkcn commented Mar 13, 2019

Could someone please look the GPU memory used by the model?

@anirudh2290
Copy link
Member Author

the last i observed it was around 11GB. For now I am going to use smaller batch_size for tests and later @leleamol will revisit and improve the cpp tests.

@leleamol
Copy link
Contributor

@anirudh2290
I could reproduce this issue on p2.8 as well when I change the batch size to 100.
The example uses only one GPU. With batch size = 50, the GPU memory reaches 11GB.

@leleamol
Copy link
Contributor

This issue can be closed since the PR is merged. @lanking520

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Bug C++ Related to C++ Example
Projects
None yet
Development

No branches or pull requests

7 participants