Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

with_seed() broken when running GPU unit test #18198

Open
xidulu opened this issue Apr 30, 2020 · 2 comments
Open

with_seed() broken when running GPU unit test #18198

xidulu opened this issue Apr 30, 2020 · 2 comments
Labels

Comments

@xidulu
Copy link
Contributor

xidulu commented Apr 30, 2020

Description

Ran GPU unit tests
DMLC_LOG_STACK_TRACE_DEPTH=10 MXNET_MODULE_SEED=781106105 MXNET_ENGINE_TYPE=NaiveEngine pytest tests/python/gpu/test_operator_gpu.py

Error Message

tests/python/gpu/test_operator_gpu.py .........s.s...................... [  5%]
.........................FFFFFsF.FFFFFFFFFFFFFFFFFFFFFFFFFF.FFFFFFFFFFFF [ 16%]
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF.FFF.FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF [ 28%]
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFsFssFFFFFF [ 39%]
FFFFFFFFFFFFFFFFsFFFFFFFFFFFFFFFFFFFsFFFFFFFFFFFsFFFFFFFsFFFsFFFFFFFFFFF [ 50%]
FFFFF.FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFsFFFFFFFFFFFFFFFFFFFFFFFFF [ 62%]
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFsFFFFFFFFFsFFFFFF.FFFFFFFFFFFF [ 73%]
FFFFFFFFFFFFFFFFFFF....F.FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF....FFFFFFFFFFFFF [ 84%]
FFFxxxFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF [ 96%]
FFFFFFFFFFFFFFFFFFFFFFFF                                                 [100%]

=================================== FAILURES ===================================
_______________________ test_batchnorm_backwards_notrain _______________________

args = (), kwargs = {}, test_count = 1, env_seed_str = None, i = 0
this_test_seed = 1871614074, log_level = 10
post_test_state = ('MT19937', array([ 793462385, 4162567913, 2690816661, 3146259572, 1379942102,
        894119658,  364406528, 36749442..., 3314795127, 3420630909, 2538379262,
       3698999054, 2822638424,  471751221, 3037373484], dtype=uint32), 1, 0, 0.0)

    @functools.wraps(orig_test)
    def test_new(*args, **kwargs):
        test_count = int(os.getenv('MXNET_TEST_COUNT', '1'))
        env_seed_str = os.getenv('MXNET_TEST_SEED')
        for i in range(test_count):
            if seed is not None:
                this_test_seed = seed
                log_level = logging.INFO
            elif env_seed_str is not None:
                this_test_seed = int(env_seed_str)
                log_level = logging.INFO
            else:
                this_test_seed = np.random.randint(0, np.iinfo(np.int32).max)
                log_level = logging.DEBUG
            post_test_state = np.random.get_state()
            np.random.seed(this_test_seed)
>           mx.random.seed(this_test_seed)

tests/python/unittest/common.py:206: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
python/mxnet/random.py:96: in seed
    check_call(_LIB.MXRandomSeed(seed_state))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

ret = -1

    def check_call(ret):
        """Check the return value of C API call.
    
        This function will raise an exception when an error occurs.
        Wrap every API call with this function.
    
        Parameters
        ----------
        ret : int
            return value from API calls.
        """
        if ret != 0:
>           raise get_last_ffi_error()
E           mxnet.base.MXNetError: Traceback (most recent call last):
E             [bt] (5) /home/ubuntu/mxnet_master_develop/python/mxnet/../../build/libmxnet.so(MXRandomSeed+0x1a) [0x7f89ce0a0c0a]
E             [bt] (4) /home/ubuntu/mxnet_master_develop/python/mxnet/../../build/libmxnet.so(mxnet::resource::ResourceManagerImpl::SeedRandom(unsigned int)+0x30b) [0x7f89d11b081b]
E             [bt] (3) /home/ubuntu/mxnet_master_develop/python/mxnet/../../build/libmxnet.so(mxnet::engine::NaiveEngine::PushAsync(std::function<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete)>, mxnet::Context, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, mxnet::FnProperty, int, char const*, bool)+0x43b) [0x7f89ce1d08eb]
E             [bt] (2) /home/ubuntu/mxnet_master_develop/python/mxnet/../../build/libmxnet.so(std::_Function_handler<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete), mxnet::resource::ResourceManagerImpl::ResourceParallelRandom<mshadow::gpu>::SeedOne(unsigned long, unsigned int)::{lambda(mxnet::RunContext, mxnet::engine::CallbackOnComplete)#1}>::_M_invoke(std::_Any_data const&, mxnet::RunContext&&, mxnet::engine::CallbackOnComplete&&)+0x1e) [0x7f89d11ac6ce]
E             [bt] (1) /home/ubuntu/mxnet_master_develop/python/mxnet/../../build/libmxnet.so(mxnet::common::random::RandGenerator<mshadow::gpu, float>::Seed(mshadow::Stream<mshadow::gpu>*, unsigned int)+0x1e9) [0x7f89d1240f55]
E             [bt] (0) /home/ubuntu/mxnet_master_develop/python/mxnet/../../build/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x7f) [0x7f89cdf9b24f]
E             File "../src/common/random_generator.cu", line 58
E           Name: Check failed: err == cudaSuccess (10 vs. 0) : rand_generator_seed_kernel ErrStr:invalid device ordinal

python/mxnet/base.py:246: MXNetError
---------------------------- Captured stderr setup -----------------------------
WARNING:root:Unable to import numpy/mxnet. Skipping seeding for numpy/mxnet.
------------------------------ Captured log setup ------------------------------
WARNING  root:conftest.py:177 Unable to import numpy/mxnet. Skipping seeding for numpy/mxnet.
____________________ test_create_sparse_ndarray_gpu_to_cpu _____________________

To Reproduce

(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)

Steps to reproduce

(Paste the commands you ran that produced the error.)

What have you tried to solve it?

Environment

We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below:

curl --retry 10 -s https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | python

# paste outputs here
@xidulu xidulu added the Bug label Apr 30, 2020
@haojin2
Copy link
Contributor

haojin2 commented Apr 30, 2020

Is this possibly caused by #18025? @szha @leezu

@leezu
Copy link
Contributor

leezu commented May 1, 2020

Yes, it's very likely

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants