[Nightly Test Failure] Tutorial test_tutorials.test_gluon_end_to_end Test Failure #14026

Chancebair · 2019-01-30T10:51:12Z

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/NightlyTestsForBinaries/detail/master/206/pipeline

This appears to have been failing since Jan 24th as a result of #13411

======================================================================
FAIL: test_tutorials.test_gluon_end_to_end
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/work/mxnet/tests/tutorials/test_tutorials.py", line 155, in test_gluon_end_to_end
    assert _test_tutorial_nb('gluon/gluon_from_experiment_to_deployment')
AssertionError

[fail] 22.49% test_tutorials.test_gluon_end_to_end: 656.8260s

The text was updated successfully, but these errors were encountered:

mxnet-label-bot · 2019-01-30T10:51:16Z

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Test, Gluon

Chancebair · 2019-01-30T10:59:50Z

@mxnet-label-bot add [Test, Gluon]

mseth10 · 2019-01-30T18:13:34Z

@roywei can you please take a look at this failure. It is probably caused by your PR.

ThomasDelteil · 2019-01-30T22:33:40Z

@roywei it times out, too many epochs or too long dataset download

roywei · 2019-01-30T22:49:15Z

@mseth10 @ThomasDelteil I m looking into it.

vishaalkapoor · 2019-02-08T15:43:48Z

@roywei This appears to still be broken from the logs.

http://jenkins.mxnet-ci.amazon-ml.com/job/NightlyTestsForBinaries/job/master/220/console

Did you double-check the changes by running the NightlyTestsForBinaries locally? If it is failing in Jenkins but not locally, you should use docker containers to simulate the exact environment.

Please try to fix urgently as the Nightly tests have been broken for 15 days. Thanks!

roywei · 2019-02-08T19:32:50Z

@vishaalkapoor i m trying to fix it, it's passing on local tests in 120s, way below the timeout limit. But i m not able to run docker containers setup according to cwiki step 2

I m using Deep Learning Base AMI ubuntu on g3.8xlarge instance
following is the error:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/docker/api/client.py", line 229, in _raise_for_status
    response.raise_for_status()
  File "/home/ubuntu/.local/lib/python3.5/site-packages/requests/models.py", line 935, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http+docker://localhost/v1.35/containers/3b354aef9437fc2f703f5994dc6bf15694774877e8bec184a3a50d43a97daa23/start

vishaalkapoor · 2019-02-08T22:46:08Z

There's a connection issue in the logs. Perhaps has to do with running a docker image and being sandboxed in some manner.

As per http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/NightlyTestsForBinaries/detail/master/221/pipeline

ERROR:root:An error occurred while executing the following cell:

------------------

def test(net, val_data, ctx):

    metric = mx.metric.Accuracy()

    for i, (data, label) in enumerate(val_data):

        data = gluon.utils.split_and_load(data, ctx_list=ctx, even_split=False)

        label = gluon.utils.split_and_load(label, ctx_list=ctx, even_split=False)

        outputs = [net(x) for x in data]

        metric.update(label, outputs)

    return metric.get()



trainer = gluon.Trainer(finetune_net.collect_params(), optimizer=sgd_optimizer)



# start with epoch 1 for easier learning rate calculation

for epoch in range(1, epochs + 1):



    tic = time.time()

    train_loss = 0

    metric.reset()



    for i, (data, label) in enumerate(train_data):

        # get the images and labels

        data = gluon.utils.split_and_load(data, ctx_list=ctx, even_split=False)

        label = gluon.utils.split_and_load(label, ctx_list=ctx, even_split=False)

        with autograd.record():

            outputs = [finetune_net(x) for x in data]

            loss = [softmax_cross_entropy(yhat, y) for yhat, y in zip(outputs, label)]

        for l in loss:

            l.backward()



        trainer.step(batch_size)

        train_loss += sum([l.mean().asscalar() for l in loss]) / len(loss)

        metric.update(label, outputs)



    _, train_acc = metric.get()

    train_loss /= num_batch

    _, val_acc = test(finetune_net, val_data, ctx)



    print('[Epoch %d] Train-acc: %.3f, loss: %.3f | Val-acc: %.3f | learning-rate: %.3E | time: %.1f' %

          (epoch, train_acc, train_loss, val_acc, trainer.learning_rate, time.time() - tic))



_, test_acc = test(finetune_net, test_data, ctx)

print('[Finished] Test-acc: %.3f' % (test_acc))

------------------

---------------------------------------------------------------------------

ConnectionRefusedError                    Traceback (most recent call last)

<ipython-input-6-cfd10a99e63e> in <module>

     17     metric.reset()

     18 

---> 19     for i, (data, label) in enumerate(train_data):

     20         # get the images and labels

     21         data = gluon.utils.split_and_load(data, ctx_list=ctx, even_split=False)



/work/mxnet/python/mxnet/gluon/data/dataloader.py in __next__(self)

    441         assert self._rcvd_idx in self._data_buffer, "fatal error with _push_next, rcvd_idx missing"

    442         ret = self._data_buffer.pop(self._rcvd_idx)

--> 443         batch = pickle.loads(ret.get()) if self._dataset is None else ret.get()

    444         if self._pin_memory:

    445             batch = _as_in_context(batch, context.cpu_pinned())



/work/mxnet/python/mxnet/gluon/data/dataloader.py in rebuild_ndarray(pid, fd, shape, dtype)

     55             fd = multiprocessing.reduction.rebuild_handle(fd)

     56         else:

---> 57             fd = fd.detach()

     58         return nd.NDArray(nd.ndarray._new_from_shared_mem(pid, fd, shape, dtype))

     59 



/usr/lib/python3.5/multiprocessing/resource_sharer.py in detach(self)

     55         def detach(self):

     56             '''Get the fd.  This should only be called once.'''

---> 57             with _resource_sharer.get_connection(self._id) as conn:

     58                 return reduction.recv_handle(conn)

     59 



/usr/lib/python3.5/multiprocessing/resource_sharer.py in get_connection(ident)

     85         from .connection import Client

     86         address, key = ident

---> 87         c = Client(address, authkey=process.current_process().authkey)

     88         c.send((key, os.getpid()))

     89         return c



/usr/lib/python3.5/multiprocessing/connection.py in Client(address, family, authkey)

    485         c = PipeClient(address)

    486     else:

--> 487         c = SocketClient(address)

    488 

    489     if authkey is not None and not isinstance(authkey, bytes):



/usr/lib/python3.5/multiprocessing/connection.py in SocketClient(address)

    612     with socket.socket( getattr(socket, family) ) as s:

    613         s.setblocking(True)

--> 614         s.connect(address)

    615         return Connection(s.detach())

    616 


ConnectionRefusedError: [Errno 111] Connection refused

ConnectionRefusedError: [Errno 111] Connection refused

Re: Docker
One possible issue, make sure you're using the right platform and runtime. First line of http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/NightlyTestsForBinaries/branches/master/runs/221/nodes/76/steps/248/log/?start=0

+ ci/build.py --docker-registry mxnetci --nvidiadocker --platform ubuntu_nightly_gpu --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh nightly_tutorial_test_ubuntu_python2_gpu

(no need for docker registry above!)

Additionally, try a different region and/or use a higher verbosity with docker.
Vishaal

roywei · 2019-02-09T05:52:37Z

@marcoabreu @Chancebair could you reopen this issue?

On the test failure
I think the root cause is shared memory on docker is too small, leading to Gluon DataLoader hanging when using multi worker.
according to issue: #11872

That's why this test is passing on local and only fails on docker.

On docker reproduction failure
I m still not able to build the dependency, according to step 2.2 here, still the same error after changing regions.
without building dependencies, running the test will have a mxnet lib not found error.

building dependency on g3.8xlarge with cuda9.1, cudnn7, nvidia-docker2 using the following command

ci/build.py --docker-registry mxnetci --nvidiadocker  -p ubuntu_build_cuda /work/runtime_functions.sh build_ubuntu_gpu_cuda91_cudnn7

give the following error:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/docker/api/client.py", line 256, in _raise_for_status
    response.raise_for_status()
  File "/usr/local/lib/python3.5/dist-packages/requests/models.py", line 940, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http+docker://localhost/v1.35/containers/bf3bde2e3acca965512b321d1fd0df150cbf904a7dc0dfa04c207758d980e56a/start

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "ci/build.py", line 582, in <module>
    sys.exit(main())
  File "ci/build.py", line 506, in main
    local_ccache_dir=args.ccache_dir, cleanup=cleanup, environment=environment)
  File "ci/build.py", line 307, in container_run
    environment=environment)
  File "/usr/local/lib/python3.5/dist-packages/docker/models/containers.py", line 791, in run
    container.start()
  File "/usr/local/lib/python3.5/dist-packages/docker/models/containers.py", line 392, in start
    return self.client.api.start(self.id, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/docker/utils/decorators.py", line 19, in wrapped
    return f(self, resource_id, *args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/docker/api/container.py", line 1091, in start
    self._raise_for_status(res)
  File "/usr/local/lib/python3.5/dist-packages/docker/api/client.py", line 258, in _raise_for_status
    raise create_api_error_from_http_exception(e)
  File "/usr/local/lib/python3.5/dist-packages/docker/errors.py", line 31, in create_api_error_from_http_exception
    raise cls(e, response=response, explanation=explanation)
docker.errors.APIError: 500 Server Error: Internal Server Error ("OCI runtime create failed: container_linux.go:344: starting container process caused "process_linux.go:424: 
container init caused \"process_linux.go:407: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig.real --device=all --compute --utility --require=cuda>=9.1 
--pid=21215 /var/lib/docker/overlay2/514b187cccda325ae75d5a526a1b060aa2d301708c6fc7c712529289ab2179ca/merged]\\\\nnvidia-container-cli: initialization error: driver error: failed to process request\\\\n\\\"\"": unknown")

running the test without dependency built:

ci/build.py --docker-registry mxnetci --nvidiadocker --platform ubuntu_nightly_gpu --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh nightly_tutorial_test_ubuntu_python2_gpu

gives
mxnet not found error.

vishaalkapoor · 2019-02-11T17:57:15Z

Hi @roywei,

Cool - do you a log message that points to #14119?

If this is at the root cause stage or even at the stage pending a fix for #14119, I think it would be best to comment out the test, try to repro with docker using the same hardware instance type as the test runner, fix and then re-enable. The Berlin office hours are tomorrow morning if they're still Tuesday mornings. It might be helpful to debug the dockers issues so that you can more easily repro.

Vishaal

roywei · 2019-02-11T18:23:32Z

Hi @vishaalkapoor, I have created another PR to disable it and set the fix as WIP, will try to reproduce and test the fix before merge.
Thanks

vishaalkapoor · 2019-02-11T19:04:19Z

great, thank you @roywei :)

marcoabreu added Gluon Test labels Jan 30, 2019

roywei mentioned this issue Jan 31, 2019

fix nightly test on tutorials #14036

Merged

4 tasks

Roshrini closed this as completed in #14036 Feb 4, 2019

marcoabreu reopened this Feb 11, 2019

roywei mentioned this issue Feb 11, 2019

increased docker shared memory for nightly test #14119

Merged

5 tasks

roywei mentioned this issue Feb 11, 2019

Temporary disable gluon tutorial test #14120

Closed

7 tasks

roywei mentioned this issue Feb 12, 2019

follow up on fix nightly test #14134

Merged

5 tasks

sandeep-krishnamurthy closed this as completed in #14134 Feb 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Nightly Test Failure] Tutorial test_tutorials.test_gluon_end_to_end Test Failure #14026

[Nightly Test Failure] Tutorial test_tutorials.test_gluon_end_to_end Test Failure #14026

Chancebair commented Jan 30, 2019

mxnet-label-bot commented Jan 30, 2019

Chancebair commented Jan 30, 2019

mseth10 commented Jan 30, 2019

ThomasDelteil commented Jan 30, 2019

roywei commented Jan 30, 2019

vishaalkapoor commented Feb 8, 2019

roywei commented Feb 8, 2019

vishaalkapoor commented Feb 8, 2019 •

edited

Loading

roywei commented Feb 9, 2019

vishaalkapoor commented Feb 11, 2019

roywei commented Feb 11, 2019

vishaalkapoor commented Feb 11, 2019

[Nightly Test Failure] Tutorial test_tutorials.test_gluon_end_to_end Test Failure #14026

[Nightly Test Failure] Tutorial test_tutorials.test_gluon_end_to_end Test Failure #14026

Comments

Chancebair commented Jan 30, 2019

mxnet-label-bot commented Jan 30, 2019

Chancebair commented Jan 30, 2019

mseth10 commented Jan 30, 2019

ThomasDelteil commented Jan 30, 2019

roywei commented Jan 30, 2019

vishaalkapoor commented Feb 8, 2019

roywei commented Feb 8, 2019

vishaalkapoor commented Feb 8, 2019 • edited Loading

roywei commented Feb 9, 2019

vishaalkapoor commented Feb 11, 2019

roywei commented Feb 11, 2019

vishaalkapoor commented Feb 11, 2019

vishaalkapoor commented Feb 8, 2019 •

edited

Loading