Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[v1.x] test_gluon_data unit tests failing #19877

Closed
josephevans opened this issue Feb 10, 2021 · 7 comments
Closed

[v1.x] test_gluon_data unit tests failing #19877

josephevans opened this issue Feb 10, 2021 · 7 comments
Labels

Comments

@josephevans
Copy link
Contributor

Description

On the v1.x pipeline, we are seeing the following test failures consistently:

in tests/python/unittest/test_gluon_data.py:

test_multi_worker_dataloader_release_pool
test_multi_worker_forked_data_loader

Occurrences

https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-19872/7/pipeline/293/#step-776-log-1725
https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-19872/4/pipeline/296

Test failure logs:

[2021-02-10T01:39:46.205Z] test_gluon_data.test_multi_worker_dataloader_release_pool ... terminate called after throwing an instance of 'dmlc::Error'
[2021-02-10T01:39:46.205Z]   what():  [01:39:41] src/storage/./cpu_shared_storage_manager.h:218: Check failed: count >= 0 (-2 vs. 0) : 
[2021-02-10T01:39:46.205Z] Stack trace:
[2021-02-10T01:39:46.205Z]   [bt] (0) /work/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x61) [0x7f191fc63b61]
[2021-02-10T01:39:46.205Z]   [bt] (1) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::storage::CPUSharedStorageManager::FreeImpl(mxnet::Storage::Handle const&)+0xd3) [0x7f192522fdf3]
[2021-02-10T01:39:46.205Z]   [bt] (2) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::storage::CPUSharedStorageManager::Free(mxnet::Storage::Handle)+0x98) [0x7f1925237348]
[2021-02-10T01:39:46.205Z]   [bt] (3) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::StorageImpl::Free(mxnet::Storage::Handle)+0x69) [0x7f1925232ce9]
[2021-02-10T01:39:46.205Z]   [bt] (4) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x5ade409) [0x7f1924b21409]
[2021-02-10T01:39:46.205Z]   [bt] (5) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x61d3c50) [0x7f1925216c50]
[2021-02-10T01:39:46.205Z]   [bt] (6) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0xa50) [0x7f1925210440]
[2021-02-10T01:39:46.205Z]   [bt] (7) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)+0x349) [0x7f192522c9d9]
[2021-02-10T01:39:46.205Z]   [bt] (8) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::Push(mxnet::engine::Opr*, mxnet::Context, int, bool)+0x42b) [0x7f1925219f5b]
[2021-02-10T01:39:46.205Z]   [bt] (9) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::PushAsync(std::function<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete)>, mxnet::Context, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, mxnet::FnProperty, int, char const*, bool)+0xd8) [0x7f1925216948]
[2021-02-10T01:39:46.461Z] /work/runtime_functions.sh: line 1008:     6 Aborted                 (core dumped) nosetests-3.4 $NOSE_COVERAGE_ARGUMENTS $NOSE_TIMER_ARGUMENTS --with-xunit --xunit-file nosetests_unittest.xml --verbose 
[2021-02-09T22:11:59.574Z] ======================================================================
[2021-02-09T22:11:59.574Z] ERROR: test_gluon_data.test_multi_worker_forked_data_loader
[2021-02-09T22:11:59.574Z] ----------------------------------------------------------------------
[2021-02-09T22:11:59.574Z] Traceback (most recent call last):
[2021-02-09T22:11:59.574Z]   File "/usr/local/lib/python3.7/dist-packages/nose/case.py", line 198, in runTest
[2021-02-09T22:11:59.574Z]     self.test(*self.arg)
[2021-02-09T22:11:59.574Z]   File "/work/mxnet/tests/python/unittest/common.py", line 226, in test_new
[2021-02-09T22:11:59.574Z]     mx.nd.waitall()
[2021-02-09T22:11:59.574Z]   File "/work/mxnet/python/mxnet/ndarray/ndarray.py", line 211, in waitall
[2021-02-09T22:11:59.574Z]     check_call(_LIB.MXNDArrayWaitAll())
[2021-02-09T22:11:59.574Z]   File "/work/mxnet/python/mxnet/base.py", line 246, in check_call
[2021-02-09T22:11:59.574Z]     raise get_last_ffi_error()
[2021-02-09T22:11:59.574Z] mxnet.base.MXNetError: Traceback (most recent call last):
[2021-02-09T22:11:59.574Z]   [bt] (9) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::PushAsync(std::function<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete)>, mxnet::Context, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, mxnet::FnProperty, int, char const*, bool)+0xd8) [0x7f0df6da1c48]
[2021-02-09T22:11:59.574Z]   [bt] (8) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::Push(mxnet::engine::Opr*, mxnet::Context, int, bool)+0x42b) [0x7f0df6da525b]
[2021-02-09T22:11:59.574Z]   [bt] (7) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)+0x349) [0x7f0df6db7e69]
[2021-02-09T22:11:59.574Z]   [bt] (6) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0xa50) [0x7f0df6d9b740]
[2021-02-09T22:11:59.574Z]   [bt] (5) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x63dbf50) [0x7f0df6da1f50]
[2021-02-09T22:11:59.574Z]   [bt] (4) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x5cde545) [0x7f0df66a4545]
[2021-02-09T22:11:59.574Z]   [bt] (3) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::StorageImpl::Free(mxnet::Storage::Handle)+0x69) [0x7f0df6dbe0b9]
[2021-02-09T22:11:59.574Z]   [bt] (2) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::storage::CPUSharedStorageManager::Free(mxnet::Storage::Handle)+0x98) [0x7f0df6dc2718]
[2021-02-09T22:11:59.574Z]   [bt] (1) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::storage::CPUSharedStorageManager::FreeImpl(mxnet::Storage::Handle const&)+0xcf) [0x7f0df6dbb27f]
[2021-02-09T22:11:59.574Z]   [bt] (0) /work/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x61) [0x7f0df16c59e1]
[2021-02-09T22:11:59.574Z]   File "src/storage/./cpu_shared_storage_manager.h", line 218
[2021-02-09T22:11:59.574Z] MXNetError: Check failed: count >= 0 (-1 vs. 0) : 
josephevans pushed a commit to josephevans/mxnet that referenced this issue Feb 10, 2021
@josephevans
Copy link
Contributor Author

Since we're trying to unblock the v1.x CI pipeline, I am disabling these 2 tests for now in #19872.

sandeep-krishnamurthy pushed a commit that referenced this issue Feb 10, 2021
* Attempt to fix v1.x CI issues.

* Re-pin scipy.

* Add numpy with pinned version so other package installs don't overwrite out required version.

* Use python3 (from /usr/local/bin) for tensorrt gpu tests, so it can find all required python modules.

* Fix onnx tests; need to pass scalar value (not np.array) to create_const_scalar_node.

* Fix pylint

* Set values using np.dtype(dtype) instead of using float32 and then casting to desired type.

* Skip 2 tests that are flakey, reported in issue #19877.

Co-authored-by: Joe Evans <[email protected]>
@access2rohit
Copy link
Contributor

this PR made changes to gluon data loader #19748 . But the issue is not reproducible on local machine though. CI is unblocked for now. This requires further investigation for root cause

@access2rohit
Copy link
Contributor

Raised PR #19879 to revert #19748 and re-enable gluon data loader tests to see if test failure is caused by the PR or not. Unable to repro failure on local instance using master w/o skipping tests.

@josephevans
Copy link
Contributor Author

Getting another test failure, looks related to gluon data loader stuff. We really need to dig down and root cause this issue.

[2021-02-11T06:56:52.649Z] test_gluon_data.test_list_dataset ... terminate called after throwing an instance of 'dmlc::Error'
[2021-02-11T06:56:52.649Z]   what():  [06:56:48] /work/mxnet/src/storage/./cpu_shared_storage_manager.h:218: Check failed: count >= 0 (-1 vs. 0) : 
[2021-02-11T06:56:52.649Z] Stack trace:
[2021-02-11T06:56:52.649Z]   [bt] (0) /work/mxnet/python/mxnet/../../build/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x61) [0x7f882bad4491]
[2021-02-11T06:56:52.649Z]   [bt] (1) /work/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::storage::CPUSharedStorageManager::FreeImpl(mxnet::Storage::Handle const&)+0xd3) [0x7f882e523ea3]
[2021-02-11T06:56:52.649Z]   [bt] (2) /work/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::storage::CPUSharedStorageManager::Free(mxnet::Storage::Handle)+0x98) [0x7f882e527c08]
[2021-02-11T06:56:52.649Z]   [bt] (3) /work/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::StorageImpl::Free(mxnet::Storage::Handle)+0x69) [0x7f882e526d39]
[2021-02-11T06:56:52.649Z]   [bt] (4) /work/mxnet/python/mxnet/../../build/libmxnet.so(+0xc747fa) [0x7f882bdc47fa]
[2021-02-11T06:56:52.649Z]   [bt] (5) /work/mxnet/python/mxnet/../../build/libmxnet.so(+0xac46bf) [0x7f882bc146bf]
[2021-02-11T06:56:52.649Z]   [bt] (6) /work/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x5c4) [0x7f882bc20234]
[2021-02-11T06:56:52.649Z]   [bt] (7) /work/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)+0x33e) [0x7f882bc2694e]
[2021-02-11T06:56:52.649Z]   [bt] (8) /work/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::engine::ThreadedEngine::Push(mxnet::engine::Opr*, mxnet::Context, int, bool)+0x190) [0x7f882bc16ba0]
[2021-02-11T06:56:52.649Z]   [bt] (9) /work/mxnet/python/mxnet/../../build/libmxnet.so(mxnet::engine::ThreadedEngine::PushAsync(std::function<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete)>, mxnet::Context, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, mxnet::FnProperty, int, char const*, bool)+0xd7) [0x7f882bc143b7]
[2021-02-11T06:56:52.649Z] 
[2021-02-11T06:56:52.649Z] 
[2021-02-11T06:56:52.649Z] /work/runtime_functions.sh: line 1008:     6 Aborted                 (core dumped) nosetests-3.4 $NOSE_COVERAGE_ARGUMENTS $NOSE_TIMER_ARGUMENTS --with-xunit --xunit-file nosetests_unittest.xml --verbose 

@josephevans
Copy link
Contributor Author

I don't think it's related, but I create #19886 to test if upgrading python (from 3.6 to 3.7) caused these unit test failures.

@ptrendx
Copy link
Member

ptrendx commented Feb 19, 2021

See my comment in the other issue about this: #19918 (comment)

@josephevans
Copy link
Contributor Author

Thanks @ptrendx for the fix! Closing this issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants