Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[Nightly Test Failure] Tutorial test_tutorials.test_gluon_end_to_end Test Failure #14026

Closed
Chancebair opened this issue Jan 30, 2019 · 12 comments · Fixed by #14036 or #14134
Closed

[Nightly Test Failure] Tutorial test_tutorials.test_gluon_end_to_end Test Failure #14026

Chancebair opened this issue Jan 30, 2019 · 12 comments · Fixed by #14036 or #14134

Comments

@Chancebair
Copy link
Contributor

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/NightlyTestsForBinaries/detail/master/206/pipeline

This appears to have been failing since Jan 24th as a result of #13411

======================================================================
FAIL: test_tutorials.test_gluon_end_to_end
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/work/mxnet/tests/tutorials/test_tutorials.py", line 155, in test_gluon_end_to_end
    assert _test_tutorial_nb('gluon/gluon_from_experiment_to_deployment')
AssertionError

[fail] 22.49% test_tutorials.test_gluon_end_to_end: 656.8260s
@mxnet-label-bot
Copy link
Contributor

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Test, Gluon

@Chancebair
Copy link
Contributor Author

@mxnet-label-bot add [Test, Gluon]

@mseth10
Copy link
Contributor

mseth10 commented Jan 30, 2019

@roywei can you please take a look at this failure. It is probably caused by your PR.

@ThomasDelteil
Copy link
Contributor

@roywei it times out, too many epochs or too long dataset download

@roywei
Copy link
Member

roywei commented Jan 30, 2019

@mseth10 @ThomasDelteil I m looking into it.

@vishaalkapoor
Copy link
Contributor

@roywei This appears to still be broken from the logs.

http://jenkins.mxnet-ci.amazon-ml.com/job/NightlyTestsForBinaries/job/master/220/console

Did you double-check the changes by running the NightlyTestsForBinaries locally? If it is failing in Jenkins but not locally, you should use docker containers to simulate the exact environment.

Please try to fix urgently as the Nightly tests have been broken for 15 days. Thanks!

@roywei
Copy link
Member

roywei commented Feb 8, 2019

@vishaalkapoor i m trying to fix it, it's passing on local tests in 120s, way below the timeout limit. But i m not able to run docker containers setup according to cwiki step 2

I m using Deep Learning Base AMI ubuntu on g3.8xlarge instance
following is the error:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/docker/api/client.py", line 229, in _raise_for_status
    response.raise_for_status()
  File "/home/ubuntu/.local/lib/python3.5/site-packages/requests/models.py", line 935, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http+docker://localhost/v1.35/containers/3b354aef9437fc2f703f5994dc6bf15694774877e8bec184a3a50d43a97daa23/start

@vishaalkapoor
Copy link
Contributor

vishaalkapoor commented Feb 8, 2019

There's a connection issue in the logs. Perhaps has to do with running a docker image and being sandboxed in some manner.

As per http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/NightlyTestsForBinaries/detail/master/221/pipeline

ERROR:root:An error occurred while executing the following cell:

------------------

def test(net, val_data, ctx):

    metric = mx.metric.Accuracy()

    for i, (data, label) in enumerate(val_data):

        data = gluon.utils.split_and_load(data, ctx_list=ctx, even_split=False)

        label = gluon.utils.split_and_load(label, ctx_list=ctx, even_split=False)

        outputs = [net(x) for x in data]

        metric.update(label, outputs)

    return metric.get()



trainer = gluon.Trainer(finetune_net.collect_params(), optimizer=sgd_optimizer)



# start with epoch 1 for easier learning rate calculation

for epoch in range(1, epochs + 1):



    tic = time.time()

    train_loss = 0

    metric.reset()



    for i, (data, label) in enumerate(train_data):

        # get the images and labels

        data = gluon.utils.split_and_load(data, ctx_list=ctx, even_split=False)

        label = gluon.utils.split_and_load(label, ctx_list=ctx, even_split=False)

        with autograd.record():

            outputs = [finetune_net(x) for x in data]

            loss = [softmax_cross_entropy(yhat, y) for yhat, y in zip(outputs, label)]

        for l in loss:

            l.backward()



        trainer.step(batch_size)

        train_loss += sum([l.mean().asscalar() for l in loss]) / len(loss)

        metric.update(label, outputs)



    _, train_acc = metric.get()

    train_loss /= num_batch

    _, val_acc = test(finetune_net, val_data, ctx)



    print('[Epoch %d] Train-acc: %.3f, loss: %.3f | Val-acc: %.3f | learning-rate: %.3E | time: %.1f' %

          (epoch, train_acc, train_loss, val_acc, trainer.learning_rate, time.time() - tic))



_, test_acc = test(finetune_net, test_data, ctx)

print('[Finished] Test-acc: %.3f' % (test_acc))

------------------
---------------------------------------------------------------------------

ConnectionRefusedError                    Traceback (most recent call last)

<ipython-input-6-cfd10a99e63e> in <module>

     17     metric.reset()

     18 

---> 19     for i, (data, label) in enumerate(train_data):

     20         # get the images and labels

     21         data = gluon.utils.split_and_load(data, ctx_list=ctx, even_split=False)



/work/mxnet/python/mxnet/gluon/data/dataloader.py in __next__(self)

    441         assert self._rcvd_idx in self._data_buffer, "fatal error with _push_next, rcvd_idx missing"

    442         ret = self._data_buffer.pop(self._rcvd_idx)

--> 443         batch = pickle.loads(ret.get()) if self._dataset is None else ret.get()

    444         if self._pin_memory:

    445             batch = _as_in_context(batch, context.cpu_pinned())



/work/mxnet/python/mxnet/gluon/data/dataloader.py in rebuild_ndarray(pid, fd, shape, dtype)

     55             fd = multiprocessing.reduction.rebuild_handle(fd)

     56         else:

---> 57             fd = fd.detach()

     58         return nd.NDArray(nd.ndarray._new_from_shared_mem(pid, fd, shape, dtype))

     59 



/usr/lib/python3.5/multiprocessing/resource_sharer.py in detach(self)

     55         def detach(self):

     56             '''Get the fd.  This should only be called once.'''

---> 57             with _resource_sharer.get_connection(self._id) as conn:

     58                 return reduction.recv_handle(conn)

     59 



/usr/lib/python3.5/multiprocessing/resource_sharer.py in get_connection(ident)

     85         from .connection import Client

     86         address, key = ident

---> 87         c = Client(address, authkey=process.current_process().authkey)

     88         c.send((key, os.getpid()))

     89         return c



/usr/lib/python3.5/multiprocessing/connection.py in Client(address, family, authkey)

    485         c = PipeClient(address)

    486     else:

--> 487         c = SocketClient(address)

    488 

    489     if authkey is not None and not isinstance(authkey, bytes):



/usr/lib/python3.5/multiprocessing/connection.py in SocketClient(address)

    612     with socket.socket( getattr(socket, family) ) as s:

    613         s.setblocking(True)

--> 614         s.connect(address)

    615         return Connection(s.detach())

    616 


ConnectionRefusedError: [Errno 111] Connection refused

ConnectionRefusedError: [Errno 111] Connection refused

Re: Docker
One possible issue, make sure you're using the right platform and runtime. First line of http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/NightlyTestsForBinaries/branches/master/runs/221/nodes/76/steps/248/log/?start=0

+ ci/build.py --docker-registry mxnetci --nvidiadocker --platform ubuntu_nightly_gpu --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh nightly_tutorial_test_ubuntu_python2_gpu

(no need for docker registry above!)

Additionally, try a different region and/or use a higher verbosity with docker.
Vishaal

@roywei
Copy link
Member

roywei commented Feb 9, 2019

@marcoabreu @Chancebair could you reopen this issue?

On the test failure
I think the root cause is shared memory on docker is too small, leading to Gluon DataLoader hanging when using multi worker.
according to issue: #11872

That's why this test is passing on local and only fails on docker.

On docker reproduction failure
I m still not able to build the dependency, according to step 2.2 here, still the same error after changing regions.
without building dependencies, running the test will have a mxnet lib not found error.

building dependency on g3.8xlarge with cuda9.1, cudnn7, nvidia-docker2 using the following command

ci/build.py --docker-registry mxnetci --nvidiadocker  -p ubuntu_build_cuda /work/runtime_functions.sh build_ubuntu_gpu_cuda91_cudnn7   

give the following error:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/docker/api/client.py", line 256, in _raise_for_status
    response.raise_for_status()
  File "/usr/local/lib/python3.5/dist-packages/requests/models.py", line 940, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http+docker://localhost/v1.35/containers/bf3bde2e3acca965512b321d1fd0df150cbf904a7dc0dfa04c207758d980e56a/start

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "ci/build.py", line 582, in <module>
    sys.exit(main())
  File "ci/build.py", line 506, in main
    local_ccache_dir=args.ccache_dir, cleanup=cleanup, environment=environment)
  File "ci/build.py", line 307, in container_run
    environment=environment)
  File "/usr/local/lib/python3.5/dist-packages/docker/models/containers.py", line 791, in run
    container.start()
  File "/usr/local/lib/python3.5/dist-packages/docker/models/containers.py", line 392, in start
    return self.client.api.start(self.id, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/docker/utils/decorators.py", line 19, in wrapped
    return f(self, resource_id, *args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/docker/api/container.py", line 1091, in start
    self._raise_for_status(res)
  File "/usr/local/lib/python3.5/dist-packages/docker/api/client.py", line 258, in _raise_for_status
    raise create_api_error_from_http_exception(e)
  File "/usr/local/lib/python3.5/dist-packages/docker/errors.py", line 31, in create_api_error_from_http_exception
    raise cls(e, response=response, explanation=explanation)
docker.errors.APIError: 500 Server Error: Internal Server Error ("OCI runtime create failed: container_linux.go:344: starting container process caused "process_linux.go:424: 
container init caused \"process_linux.go:407: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig.real --device=all --compute --utility --require=cuda>=9.1 
--pid=21215 /var/lib/docker/overlay2/514b187cccda325ae75d5a526a1b060aa2d301708c6fc7c712529289ab2179ca/merged]\\\\nnvidia-container-cli: initialization error: driver error: failed to process request\\\\n\\\"\"": unknown")

running the test without dependency built:

ci/build.py --docker-registry mxnetci --nvidiadocker --platform ubuntu_nightly_gpu --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh nightly_tutorial_test_ubuntu_python2_gpu

gives
mxnet not found error.

@vishaalkapoor
Copy link
Contributor

Hi @roywei,

Cool - do you a log message that points to #14119?

If this is at the root cause stage or even at the stage pending a fix for #14119, I think it would be best to comment out the test, try to repro with docker using the same hardware instance type as the test runner, fix and then re-enable. The Berlin office hours are tomorrow morning if they're still Tuesday mornings. It might be helpful to debug the dockers issues so that you can more easily repro.

Vishaal

@roywei
Copy link
Member

roywei commented Feb 11, 2019

Hi @vishaalkapoor, I have created another PR to disable it and set the fix as WIP, will try to reproduce and test the fix before merge.
Thanks

@vishaalkapoor
Copy link
Contributor

great, thank you @roywei :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
7 participants