Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Cmake with NCCL flag does not work. #17239

Closed
apeforest opened this issue Jan 7, 2020 · 10 comments · Fixed by #17297
Closed

Cmake with NCCL flag does not work. #17239

apeforest opened this issue Jan 7, 2020 · 10 comments · Fixed by #17297
Assignees
Labels

Comments

@apeforest
Copy link
Contributor

Description

If I build mxnet with NCCL using cmake, it failed with "Could not find NCCL libraries" even though my NCCL is installed at /usr/local/cuda/include

Reproduce

cmake -GNinja -DUSE_CUDA=ON -DCMAKE_CUDA_COMPILER_LAUNCHER=ccache -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -DCMAKE_BUILD_TYPE=Release -DUSE_CUDNN=ON -DUSE_NCCL=ON ..

CMake Warning at CMakeLists.txt:299 (message):
  Could not find NCCL libraries

Environment

We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below:

curl --retry 10 -s https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | python

----------Python Info----------
Version      : 3.6.6
Compiler     : GCC 7.2.0
Build        : ('default', 'Jun 28 2018 17:14:51')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 19.3.1
Directory    : /home/ubuntu/anaconda3/lib/python3.6/site-packages/pip
----------MXNet Info-----------
No MXNet installed.
----------System Info----------
Platform     : Linux-4.4.0-1096-aws-x86_64-with-debian-stretch-sid
system       : Linux
node         : ip-172-31-20-50
release      : 4.4.0-1096-aws
version      : #107-Ubuntu SMP Thu Oct 3 01:51:58 UTC 2019
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                64
On-line CPU(s) list:   0-63
Thread(s) per core:    2
Core(s) per socket:    16
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping:              1
CPU MHz:               2699.984
CPU max MHz:           3000.0000
CPU min MHz:           1200.0000
BogoMIPS:              4600.11
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              46080K
NUMA node0 CPU(s):     0-15,32-47
NUMA node1 CPU(s):     16-31,48-63
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq monitor est ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt ida
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0060 sec, LOAD: 0.5026 sec.
Timing for GluonNLP GitHub: https://github.com/dmlc/gluon-nlp, DNS: 0.0011 sec, LOAD: 0.5116 sec.
Timing for GluonNLP: http://gluon-nlp.mxnet.io, DNS: 0.1051 sec, LOAD: 0.3917 sec.
Timing for D2L: http://d2l.ai, DNS: 0.0108 sec, LOAD: 0.2085 sec.
Timing for D2L (zh-cn): http://zh.d2l.ai, DNS: 0.1761 sec, LOAD: 0.1178 sec.
Timing for FashionMNIST: https://repo.mxnet.io/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.1306 sec, LOAD: 0.1471 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0123 sec, LOAD: 0.4014 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0120 sec, LOAD: 0.0739 sec.
@apeforest apeforest added Bug Build CMake CMake related bugs/issues/improvements labels Jan 7, 2020
@leezu
Copy link
Contributor

leezu commented Jan 7, 2020

We may refactor https://github.com/apache/incubator-mxnet/blob/master/cmake/Modules/FindNCCL.cmake to improve autodetection. In the meantime see the variables used for searching. If you set one of them to your nccl base directory, it should find nccl successfully?

@mjsML
Copy link

mjsML commented Jan 8, 2020

I experienced this too ... try using -DUSE_NCCL=1 -DUSE_NCCL_PATH=/usr/local/cuda/include (or as @leezu your NCCL path)

@apeforest
Copy link
Contributor Author

apeforest commented Jan 12, 2020

@mjsML Thanks, using that flag worked for me. @guanxinq or @ChaiBapchya interested in fixing FindNCCL.cmake as suggested? :)

@ChaiBapchya
Copy link
Contributor

ChaiBapchya commented Jan 13, 2020

I took a look at this auto-detection issue.

To solve this particular case, I added a check for symlink (if UNIX) - https://github.com/ChaiBapchya/incubator-mxnet/blob/nccl_autodetect/cmake/Modules/FindNCCL.cmake

If this is enough, I can submit a PR.

However, I'm not sure if it is complete. Coz I took a look at https://github.com/apache/incubator-mxnet/blob/master/cmake/Modules/FindCUDAToolkit.cmake
It has a fairly long drawn way of finding the Cuda Toolkit

  1. Language / user provided path
  2. If cuda_root cmake/env not specified, check
  • check symlink
  • check platform default

Is this what's needed? @leezu @apeforest
In that case it makes sense to "factor" out this check as it will be used at 2 places (findCudatoolkit and findNCCL)

@leezu
Copy link
Contributor

leezu commented Jan 13, 2020

@apeforest could you provide some background if NCCL is installed at /usr/local/cuda/include by default?

@ChaiBapchya your change seems to rely on CUDA_TOOLKIT_ROOT_DIR, but this variable is not among the variables exported by FindCUDAToolkit. In fact, you can see it's explicitly unset:

https://github.com/apache/incubator-mxnet/blob/master/cmake/Modules/FindCUDAToolkit.cmake#L708

Instead, let's use the result variables

https://github.com/apache/incubator-mxnet/blob/28e053edb4f2079743458bf087557bcac7e58c62/cmake/Modules/FindCUDAToolkit.cmake#L427-L464

Specifically CUDAToolkit_INCLUDE_DIRS and CUDAToolkit_LIBRARY_DIR? Or would the nccl library not be at the CUDAToolkit_LIBRARY_DIR?

Besides using the CUDAToolkit variables as additional defaults to find nccl, the NCCL_ROOT variable needs to be examined as per https://cmake.org/cmake/help/latest/policy/CMP0074.html
(which is done correctly currently I think)

@apeforest
Copy link
Contributor Author

In DLAMI, nccl is installed by default in the cuda directory: /usr/local/cuda/include/nccl.h

However, if user installed nccl manually by themselves, sudo apt install libnccl2 libnccl-dev, you may use the sudo dpkg-query -L libnccl-dev to find where it is.
https://askubuntu.com/questions/1134732/where-is-nccl-h

I would suggest @ChaiBapchya to first search /usr/local/cuda/include/. If not found, try sudo dpkg-query -L libnccl-dev instead. Would that work?

@apeforest
Copy link
Contributor Author

Thanks @ChaiBapchya for volunteering to work on this!

@leezu
Copy link
Contributor

leezu commented Jan 13, 2020

If not found, try sudo dpkg-query -L libnccl-dev instead.

That's would only work on Debian based platforms and only for one particular way of installing nccl on these systems. I think it's safe to require users to set NCCL_ROOT if they manually installed nccl to a different path.

To improve the user experience, we may fall-back to building nccl ourselves if nccl is required and not found. Pytorch does that for example.

@ChaiBapchya
Copy link
Contributor

ChaiBapchya commented Jan 13, 2020

Ya. Even when I looked at different autodetection files for cmake used in various other open-source frameworks

  1. Xgboost - https://github.com/dmlc/xgboost/blob/master/cmake/modules/FindNccl.cmake
  2. Flashlight - https://github.com/facebookresearch/flashlight/blob/master/cmake/FindNCCL.cmake
  3. Pytorch - https://github.com/pytorch/pytorch/blob/master/cmake/Modules/FindNCCL.cmake
  4. Caffe - https://github.com/BVLC/caffe/blob/master/cmake/Modules/FindNCCL.cmake
  5. Thunder - https://github.com/thuem/THUNDER/blob/master/cmake/FindNCCL.cmake

They have similar approach. Either look for default path, env var (NCCL_ROOT) or /usr/local/cuda

Agree with @leezu I haven't seen "dpkg-query" or equivalent "find" commands used in cmake. They are more of command line searches. In cmake, there's find_path, find_library which does similar job.

Thanks @apeforest @leezu for chiming in!

@leezu
Copy link
Contributor

leezu commented Jan 14, 2020

@ChaiBapchya BTW, unfortunately a lot of CMake usage out in the wild does not meet the modern CMake bar but is leftover from the early days of CMake. While not covering all use-cases of MXNet, sometimes we can refer to https://cliutils.gitlab.io/modern-cmake/ for best practices

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants