Cmake with NCCL flag does not work. #17239

apeforest · 2020-01-07T19:18:59Z

Description

If I build mxnet with NCCL using cmake, it failed with "Could not find NCCL libraries" even though my NCCL is installed at /usr/local/cuda/include

Reproduce

cmake -GNinja -DUSE_CUDA=ON -DCMAKE_CUDA_COMPILER_LAUNCHER=ccache -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -DCMAKE_BUILD_TYPE=Release -DUSE_CUDNN=ON -DUSE_NCCL=ON ..

CMake Warning at CMakeLists.txt:299 (message):
  Could not find NCCL libraries

Environment

We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below:

curl --retry 10 -s https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | python

----------Python Info----------
Version      : 3.6.6
Compiler     : GCC 7.2.0
Build        : ('default', 'Jun 28 2018 17:14:51')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 19.3.1
Directory    : /home/ubuntu/anaconda3/lib/python3.6/site-packages/pip
----------MXNet Info-----------
No MXNet installed.
----------System Info----------
Platform     : Linux-4.4.0-1096-aws-x86_64-with-debian-stretch-sid
system       : Linux
node         : ip-172-31-20-50
release      : 4.4.0-1096-aws
version      : #107-Ubuntu SMP Thu Oct 3 01:51:58 UTC 2019
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                64
On-line CPU(s) list:   0-63
Thread(s) per core:    2
Core(s) per socket:    16
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping:              1
CPU MHz:               2699.984
CPU max MHz:           3000.0000
CPU min MHz:           1200.0000
BogoMIPS:              4600.11
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              46080K
NUMA node0 CPU(s):     0-15,32-47
NUMA node1 CPU(s):     16-31,48-63
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq monitor est ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt ida
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0060 sec, LOAD: 0.5026 sec.
Timing for GluonNLP GitHub: https://github.com/dmlc/gluon-nlp, DNS: 0.0011 sec, LOAD: 0.5116 sec.
Timing for GluonNLP: http://gluon-nlp.mxnet.io, DNS: 0.1051 sec, LOAD: 0.3917 sec.
Timing for D2L: http://d2l.ai, DNS: 0.0108 sec, LOAD: 0.2085 sec.
Timing for D2L (zh-cn): http://zh.d2l.ai, DNS: 0.1761 sec, LOAD: 0.1178 sec.
Timing for FashionMNIST: https://repo.mxnet.io/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.1306 sec, LOAD: 0.1471 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0123 sec, LOAD: 0.4014 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0120 sec, LOAD: 0.0739 sec.

The text was updated successfully, but these errors were encountered:

leezu · 2020-01-07T19:23:49Z

We may refactor https://github.com/apache/incubator-mxnet/blob/master/cmake/Modules/FindNCCL.cmake to improve autodetection. In the meantime see the variables used for searching. If you set one of them to your nccl base directory, it should find nccl successfully?

mjsML · 2020-01-08T12:59:39Z

I experienced this too ... try using -DUSE_NCCL=1 -DUSE_NCCL_PATH=/usr/local/cuda/include (or as @leezu your NCCL path)

apeforest · 2020-01-12T07:40:11Z

@mjsML Thanks, using that flag worked for me. @guanxinq or @ChaiBapchya interested in fixing FindNCCL.cmake as suggested? :)

ChaiBapchya · 2020-01-13T03:38:25Z

I took a look at this auto-detection issue.

To solve this particular case, I added a check for symlink (if UNIX) - https://github.com/ChaiBapchya/incubator-mxnet/blob/nccl_autodetect/cmake/Modules/FindNCCL.cmake

If this is enough, I can submit a PR.

However, I'm not sure if it is complete. Coz I took a look at https://github.com/apache/incubator-mxnet/blob/master/cmake/Modules/FindCUDAToolkit.cmake
It has a fairly long drawn way of finding the Cuda Toolkit

Language / user provided path
If cuda_root cmake/env not specified, check

check symlink
check platform default

Is this what's needed? @leezu @apeforest
In that case it makes sense to "factor" out this check as it will be used at 2 places (findCudatoolkit and findNCCL)

leezu · 2020-01-13T11:37:27Z

@apeforest could you provide some background if NCCL is installed at /usr/local/cuda/include by default?

@ChaiBapchya your change seems to rely on CUDA_TOOLKIT_ROOT_DIR, but this variable is not among the variables exported by FindCUDAToolkit. In fact, you can see it's explicitly unset:

https://github.com/apache/incubator-mxnet/blob/master/cmake/Modules/FindCUDAToolkit.cmake#L708

Instead, let's use the result variables

https://github.com/apache/incubator-mxnet/blob/28e053edb4f2079743458bf087557bcac7e58c62/cmake/Modules/FindCUDAToolkit.cmake#L427-L464

Specifically CUDAToolkit_INCLUDE_DIRS and CUDAToolkit_LIBRARY_DIR? Or would the nccl library not be at the CUDAToolkit_LIBRARY_DIR?

Besides using the CUDAToolkit variables as additional defaults to find nccl, the NCCL_ROOT variable needs to be examined as per https://cmake.org/cmake/help/latest/policy/CMP0074.html
(which is done correctly currently I think)

apeforest · 2020-01-13T19:43:44Z

In DLAMI, nccl is installed by default in the cuda directory: /usr/local/cuda/include/nccl.h

However, if user installed nccl manually by themselves, sudo apt install libnccl2 libnccl-dev, you may use the sudo dpkg-query -L libnccl-dev to find where it is.
https://askubuntu.com/questions/1134732/where-is-nccl-h

I would suggest @ChaiBapchya to first search /usr/local/cuda/include/. If not found, try sudo dpkg-query -L libnccl-dev instead. Would that work?

apeforest · 2020-01-13T19:44:59Z

Thanks @ChaiBapchya for volunteering to work on this!

leezu · 2020-01-13T20:44:01Z

If not found, try sudo dpkg-query -L libnccl-dev instead.

That's would only work on Debian based platforms and only for one particular way of installing nccl on these systems. I think it's safe to require users to set NCCL_ROOT if they manually installed nccl to a different path.

To improve the user experience, we may fall-back to building nccl ourselves if nccl is required and not found. Pytorch does that for example.

ChaiBapchya · 2020-01-13T21:18:08Z

Ya. Even when I looked at different autodetection files for cmake used in various other open-source frameworks

They have similar approach. Either look for default path, env var (NCCL_ROOT) or /usr/local/cuda

Agree with @leezu I haven't seen "dpkg-query" or equivalent "find" commands used in cmake. They are more of command line searches. In cmake, there's find_path, find_library which does similar job.

Thanks @apeforest @leezu for chiming in!

leezu · 2020-01-14T14:55:12Z

@ChaiBapchya BTW, unfortunately a lot of CMake usage out in the wild does not meet the modern CMake bar but is leftover from the early days of CMake. While not covering all use-cases of MXNet, sometimes we can refer to https://cliutils.gitlab.io/modern-cmake/ for best practices

apeforest added Bug Build CMake CMake related bugs/issues/improvements labels Jan 7, 2020

apeforest mentioned this issue Jan 7, 2020

Fix and clean up Ubuntu build from source instructions #17229

Merged

5 tasks

apeforest added good first issue Call for Contribution labels Jan 12, 2020

apeforest assigned ChaiBapchya Jan 13, 2020

ChaiBapchya mentioned this issue Jan 14, 2020

Fix NCCL Cmake autodetect issue #17297

Merged

3 tasks

leezu closed this as completed in #17297 Jan 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cmake with NCCL flag does not work. #17239

Cmake with NCCL flag does not work. #17239

apeforest commented Jan 7, 2020

leezu commented Jan 7, 2020

mjsML commented Jan 8, 2020 •

edited

Loading

apeforest commented Jan 12, 2020 •

edited

Loading

ChaiBapchya commented Jan 13, 2020 •

edited

Loading

leezu commented Jan 13, 2020

apeforest commented Jan 13, 2020

apeforest commented Jan 13, 2020

leezu commented Jan 13, 2020 •

edited

Loading

ChaiBapchya commented Jan 13, 2020 •

edited

Loading

leezu commented Jan 14, 2020

Cmake with NCCL flag does not work. #17239

Cmake with NCCL flag does not work. #17239

Comments

apeforest commented Jan 7, 2020

Description

Reproduce

Environment

leezu commented Jan 7, 2020

mjsML commented Jan 8, 2020 • edited Loading

apeforest commented Jan 12, 2020 • edited Loading

ChaiBapchya commented Jan 13, 2020 • edited Loading

leezu commented Jan 13, 2020

apeforest commented Jan 13, 2020

apeforest commented Jan 13, 2020

leezu commented Jan 13, 2020 • edited Loading

ChaiBapchya commented Jan 13, 2020 • edited Loading

leezu commented Jan 14, 2020

mjsML commented Jan 8, 2020 •

edited

Loading

apeforest commented Jan 12, 2020 •

edited

Loading

ChaiBapchya commented Jan 13, 2020 •

edited

Loading

leezu commented Jan 13, 2020 •

edited

Loading

ChaiBapchya commented Jan 13, 2020 •

edited

Loading