Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

MXNet cu100 nightly release breaks Horovod integration tests #15578

Closed
apeforest opened this issue Jul 17, 2019 · 6 comments
Closed

MXNet cu100 nightly release breaks Horovod integration tests #15578

apeforest opened this issue Jul 17, 2019 · 6 comments

Comments

@apeforest
Copy link
Contributor

Description

For detailed reproducible steps and environment information, please refer to: horovod/horovod#1217

@mxnet-label-bot
Copy link
Contributor

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Test

@ptrendx
Copy link
Member

ptrendx commented Jul 17, 2019

This should be fixed by #15551

@yuxihu
Copy link
Member

yuxihu commented Jul 18, 2019

Got into another issue:

mxnet-cu100==1.5.0b20190717

(mxnet_p36) ubuntu@ip-172-31-21-35:~$ pip install mxnet-cu100 --pre
Collecting mxnet-cu100
  Downloading https://files.pythonhosted.org/packages/b7/c4/1e0c4d21ed6dca0890a9db3e23f78c9d8df1651330547c7778aa1a974ea6/mxnet_cu100-1.5.0b20190717-py2.py3-none-manylinux1_x86_64.whl (540.1MB)
    100% |████████████████████████████████| 540.1MB 136kB/s
Requirement already satisfied: numpy<2.0.0,>1.16.0 in ./anaconda3/envs/mxnet_p36/lib/python3.6/site-packages (from mxnet-cu100) (1.17.0rc2)
Requirement already satisfied: requests<3,>=2.20.0 in ./anaconda3/envs/mxnet_p36/lib/python3.6/site-packages (from mxnet-cu100) (2.20.0)
Requirement already satisfied: graphviz<0.9.0,>=0.8.1 in ./anaconda3/envs/mxnet_p36/lib/python3.6/site-packages (from mxnet-cu100) (0.8.4)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in ./anaconda3/envs/mxnet_p36/lib/python3.6/site-packages (from requests<3,>=2.20.0->mxnet-cu100) (3.0.4)
Requirement already satisfied: idna<2.8,>=2.5 in ./anaconda3/envs/mxnet_p36/lib/python3.6/site-packages (from requests<3,>=2.20.0->mxnet-cu100) (2.6)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in ./anaconda3/envs/mxnet_p36/lib/python3.6/site-packages (from requests<3,>=2.20.0->mxnet-cu100) (1.23)
Requirement already satisfied: certifi>=2017.4.17 in ./anaconda3/envs/mxnet_p36/lib/python3.6/site-packages (from requests<3,>=2.20.0->mxnet-cu100) (2019.3.9)
Installing collected packages: mxnet-cu100
Successfully installed mxnet-cu100-1.5.0b20190717
You are using pip version 10.0.1, however version 19.1.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

Horovod master branch

(mxnet_p36) ubuntu@ip-172-31-21-35:~$ HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_WITHOUT_TENSORFLOW=1 HOROVOD_WITHOUT_PYTORCH=1 HOROVOD_WITH_MXNET=1 pip install --no-cache-dir ~/horovod
Processing ./horovod
Requirement already satisfied: cloudpickle in ./anaconda3/envs/mxnet_p36/lib/python3.6/site-packages (from horovod==0.16.4) (0.5.3)
Requirement already satisfied: psutil in ./anaconda3/envs/mxnet_p36/lib/python3.6/site-packages (from horovod==0.16.4) (5.4.5)
Requirement already satisfied: six in ./anaconda3/envs/mxnet_p36/lib/python3.6/site-packages (from horovod==0.16.4) (1.11.0)
Installing collected packages: horovod
  Found existing installation: horovod 0.16.4
    Uninstalling horovod-0.16.4:
      Successfully uninstalled horovod-0.16.4
  Running setup.py install for horovod ... done
Successfully installed horovod-0.16.4
You are using pip version 10.0.1, however version 19.1.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

Unit test failed due to undefined symbol introduced in #15551.

(mxnet_p36) ubuntu@ip-172-31-21-35:~$ ~/anaconda3/envs/mxnet_p36/bin/mpirun -np 2 -H localhost:2 pytest horovod/test/test_mxnet.py
============================= test session starts ==============================
platform linux -- Python 3.6.5, pytest-3.5.1, py-1.5.3, pluggy-0.6.0
============================= test session starts ==============================
platform linux -- Python 3.6.5, pytest-3.5.1, py-1.5.3, pluggy-0.6.0
rootdir: /home/ubuntu/horovod, inifile:
rootdir: /home/ubuntu/horovod, inifile:
plugins: remotedata-0.2.1, openfiles-0.3.0, doctestplus-0.1.3, arraydiff-0.2
plugins: remotedata-0.2.1, openfiles-0.3.0, doctestplus-0.1.3, arraydiff-0.2
collected 0 items / 1 errors
collected 0 items / 1 errors

==================================== ERRORS ====================================
_____________________ ERROR collecting test/test_mxnet.py ______________________
horovod/test/test_mxnet.py:20: in <module>
    import horovod.mxnet as hvd
anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/horovod/mxnet/__init__.py:25: in <module>
    from horovod.mxnet.mpi_ops import allgather
anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/horovod/mxnet/mpi_ops.py:29: in <module>
    _basics = _HorovodBasics(__file__, 'mpi_lib')
anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/horovod/common/basics.py:27: in __init__
    self.MPI_LIB_CTYPES = ctypes.CDLL(full_path, mode=ctypes.RTLD_GLOBAL)
anaconda3/envs/mxnet_p36/lib/python3.6/ctypes/__init__.py:348: in __init__
    self._handle = _dlopen(self._name, mode)
E   OSError: /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/horovod/mxnet/mpi_lib.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN5mxnet7Context13CudaLibChecksEv
!!!!!!!!!!!!!!!!!!! Interrupted: 1 errors during collection !!!!!!!!!!!!!!!!!!!!
=========================== 1 error in 1.52 seconds ============================

==================================== ERRORS ====================================
_____________________ ERROR collecting test/test_mxnet.py ______________________
horovod/test/test_mxnet.py:20: in <module>
    import horovod.mxnet as hvd
anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/horovod/mxnet/__init__.py:25: in <module>
    from horovod.mxnet.mpi_ops import allgather
anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/horovod/mxnet/mpi_ops.py:29: in <module>
    _basics = _HorovodBasics(__file__, 'mpi_lib')
anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/horovod/common/basics.py:27: in __init__
    self.MPI_LIB_CTYPES = ctypes.CDLL(full_path, mode=ctypes.RTLD_GLOBAL)
anaconda3/envs/mxnet_p36/lib/python3.6/ctypes/__init__.py:348: in __init__
    self._handle = _dlopen(self._name, mode)
E   OSError: /home/ubuntu/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/horovod/mxnet/mpi_lib.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN5mxnet7Context13CudaLibChecksEv
!!!!!!!!!!!!!!!!!!! Interrupted: 1 errors during collection !!!!!!!!!!!!!!!!!!!!
=========================== 1 error in 1.54 seconds ============================
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[34828,1],0]
  Exit code:    2
--------------------------------------------------------------------------

@ptrendx
Copy link
Member

ptrendx commented Jul 18, 2019

@DickJC123

@yuxihu
Copy link
Member

yuxihu commented Jul 18, 2019

I am working on a PR to fix this.

@yuxihu
Copy link
Member

yuxihu commented Jul 23, 2019

Fix has been merged in Horovod.

@yuxihu yuxihu closed this as completed Jul 23, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants