Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trying to use 2 GPUs results in neverending process that can't be killed without restart #100

Open
RuABraun opened this issue Jan 29, 2020 · 11 comments

Comments

@RuABraun
Copy link

RuABraun commented Jan 29, 2020

I have two 1080tis. When I try and use kmcuda with both by setting CUDA_VISIBLE_DEVICES to the GPUs I use for compute, (I've tried this out with device=0 and device=3), it gets stuck on
transposing the samples..., util shows 100%, but power usage isn't high at all. Data has shape (69673, 256).

I've waited a couple minutes, it takes half a minute usually on 1 gpu.

If I control-c/z it or kill the ID, the process becomes a zombie and I'm forced to restart to kill it (and use the gpus again).

@vmarkovtsev
Copy link
Collaborator

I wonder if this is like a bad magic shape. Can you please try sizes from 69600 till 69700?

Besides, can you please post the exact log at maximum verbosity level. Also, are you using Python, R or native interface? Linux, Windows, or MacOS? CUDA version? Did you compile yourself or used the wheel?

@RuABraun
Copy link
Author

RuABraun commented Jan 29, 2020

I'm using the python interface. Ubuntu 16.04, CUDA 10.2.

I compiled using instructions in the README. Then moved the .so to where I'm using it.

It fails for me with 60 000 as well.

Here's the output (still running):

seni@seni-MS-7A32:/work/fun/subword-repr$ CUDA_VISIBLE_DEVICES="0,1" py3 km.py data/repr_nums_small                                                                                                         
2020-01-29 20:24:33.629 | INFO     | __main__:kmeans:11 - Starting, data shape is (60000, 256).
arguments: 1 0x7ffd7a3368e4 0.001 0.10 0 60000 256 6000 3 0 0 2 0x7ff717f96010 0x7ff7179b9010 0x2e1bae0 0x7ffd7a33690c
reassignments threshold: 60
yinyang groups: 600
GPU #0 memory: used 318242816 bytes (2.7%), free 11403264000 bytes, total 11721506816 bytes
GPU #1 memory: used 318242816 bytes (2.7%), free 11401887744 bytes, total 11720130560 bytes
GPU #0 has 49152 bytes of shared memory per block
GPU #1 has 49152 bytes of shared memory per block
transposing the samples...
transpose <<<(1875, 8), (8, 32)>>> 60000, 256, xyswap

Here's nvidia-smi

seni@seni-MS-7A32:/work/fun/subword-repr$ nvidia-smi 
Wed Jan 29 20:31:09 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GT 1030     Off  | 00000000:08:00.0  On |                  N/A |
| 43%   46C    P0    N/A /  30W |    391MiB /  2001MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:09:00.0 Off |                  N/A |
| 31%   55C    P2    82W / 250W |    431MiB / 11178MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  Off  | 00000000:42:00.0  On |                  N/A |
| 21%   56C    P2    88W / 250W |    431MiB / 11177MiB |    100%   E. Process |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1752      G   /usr/lib/xorg/Xorg                           222MiB |
|    0      2545      G   compiz                                       148MiB |
|    0      3636      G   ...quest-channel-token=9743865575637794024    18MiB |
|    1    116273      C   python3.6                                    291MiB |
|    2    116273      C   python3.6                                    291MiB |
+-----------------------------------------------------------------------------+

Maybe it's because of Exclusive Process?

@RuABraun
Copy link
Author

With verbose=3

arguments: 1 0x7ffda37bcbf4 0.001 0.10 0 60000 256 6000 3 0 0 3 0x7fcd46527010 0x7fcd45f4a010 0x1f43050 0x7ffda37bcc1c
reassignments threshold: 60
yinyang groups: 600
[0] *dest: 0x7fcd18000000 - 0x7fcd1ba98000 (61440000)
[1] *dest: 0x7fccfc000000 - 0x7fccffa98000 (61440000)
[0] device_centroids: 0x7fcd07000000 - 0x7fcd075dc000 (6144000)
[1] device_centroids: 0x7fcd07600000 - 0x7fcd07bdc000 (6144000)
[0] device_assignments: 0x7fccffc00000 - 0x7fccffc3a980 (240000)
[1] device_assignments: 0x7fccffe00000 - 0x7fccffe3a980 (240000)
[0] device_assignments_prev: 0x7fccffc3aa00 - 0x7fccffc75380 (240000)
[1] device_assignments_prev: 0x7fccffe3aa00 - 0x7fccffe75380 (240000)
[0] device_ccounts: 0x7fccffc75400 - 0x7fccffc7b1c0 (24000)
[1] device_ccounts: 0x7fccffe75400 - 0x7fccffe7b1c0 (24000)
[0] device_assignments_yy: 0x7fccffc7b200 - 0x7fccffc80fc0 (24000)
[1] device_assignments_yy: 0x7fccffe7b200 - 0x7fccffe80fc0 (24000)
[0] device_bounds_yy: 0x7fccf6000000 - 0x7fccfa4c76c0 (72120000)
[1] device_bounds_yy: 0x7fccf0000000 - 0x7fccf44c76c0 (72120000)
[0] device_drifts_yy: 0x7fccf4600000 - 0x7fccf4be1dc0 (6168000)
[1] device_drifts_yy: 0x7fccf4c00000 - 0x7fccf51e1dc0 (6168000)
[0] device_passed_yy: 0x7fccffc81000 - 0x7fccffc9e4c0 (120000)
[1] device_passed_yy: 0x7fccffe81000 - 0x7fccffe9e4c0 (120000)
[0] device_centroids_yy: 0x7fccffc9e600 - 0x7fccffd34600 (614400)
[1] device_centroids_yy: 0x7fccffe9e600 - 0x7fccfff34600 (614400)
GPU #0 memory: used 318242816 bytes (2.7%), free 11403264000 bytes, total 11721506816 bytes
GPU #1 memory: used 318242816 bytes (2.7%), free 11401887744 bytes, total 11720130560 bytes
GPU #0 has 49152 bytes of shared memory per block
GPU #1 has 49152 bytes of shared memory per block
transposing the samples...
transpose <<<(1875, 8), (8, 32)>>> 60000, 256, xyswap

@vmarkovtsev
Copy link
Collaborator

vmarkovtsev commented Jan 29, 2020

100% GPU means that the code entered an infinite cycle... It is at https://github.com/src-d/kmcuda/blob/master/src/transpose.cu#L30

CUDA 10+ has not been tested yet, so this must be a code compatibility problem and not a "real" bug. Let's try three things:

  1. Remove volatile in all the *.cu files. Then recompile the .so. I observed some miscompilations on the newer CUDA versions and even reported them to NVIDIA, but never had time to convince them that there was a problem. The efficiency will likely degrade, but we are trying to make the code work at all.

  2. Is there anything relevant printed in dmesg?

  3. Run the code with volatile under cuda-memcheck. It will run 20x times slower but we should see the problems reported, if any.

@RuABraun
Copy link
Author

RuABraun commented Jan 29, 2020

Did 1. with sed -i 's/volatile//g' src/*cu, compiled and ran again, also hanging.

Sorry I'm not that experienced with this sort of thing what do you mean with "anything relevant printed in dmesg"? If I run dmesg I get lots of output, the only part that is probably related is

[ 1208.838449] nvidia 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x0000000090000040 flags=0x0020]
[ 1208.838460] nvidia 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000ac139068 flags=0x0020]
[ 1208.838467] nvidia 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000900007c0 flags=0x0020]
[ 1208.838474] nvidia 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x0000000090001040 flags=0x0020]
[ 1208.838481] nvidia 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x0000000090000f40 flags=0x0020]
[ 1208.838488] nvidia 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000900016c0 flags=0x0020]
[ 1208.838495] nvidia 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x0000000090002040 flags=0x0020]
[ 1208.838502] nvidia 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x0000000090001e40 flags=0x0020]
[ 1208.838508] nvidia 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000900025c0 flags=0x0020]
[ 1208.838515] nvidia 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x0000000090003040 flags=0x0020]
[ 1208.838522] AMD-Vi: Event logged [IO_PAGE_FAULT device=09:00.0 domain=0x000f address=0x0000000090002d40 flags=0x0020]
[ 1208.838528] AMD-Vi: Event logged [IO_PAGE_FAULT device=09:00.0 domain=0x000f address=0x00000000900034c0 flags=0x0020]
[ 1208.838535] AMD-Vi: Event logged [IO_PAGE_FAULT device=09:00.0 domain=0x000f address=0x0000000090004040 flags=0x0020]
[ 1208.838542] AMD-Vi: Event logged [IO_PAGE_FAULT device=09:00.0 domain=0x000f address=0x0000000090003c40 flags=0x0020]
[ 1208.838548] AMD-Vi: Event logged [IO_PAGE_FAULT device=09:00.0 domain=0x000f address=0x00000000900043c0 flags=0x0020]
[ 1208.838554] AMD-Vi: Event logged [IO_PAGE_FAULT device=09:00.0 domain=0x000f address=0x0000000090005040 flags=0x0020]
[ 1208.838561] AMD-Vi: Event logged [IO_PAGE_FAULT device=09:00.0 domain=0x000f address=0x0000000090004b40 flags=0x0020]
[ 1208.838567] AMD-Vi: Event logged [IO_PAGE_FAULT device=09:00.0 domain=0x000f address=0x00000000900052c0 flags=0x0020]
[ 1208.838574] AMD-Vi: Event logged [IO_PAGE_FAULT device=09:00.0 domain=0x000f address=0x0000000090006040 flags=0x0020]
[ 1208.838580] AMD-Vi: Event logged [IO_PAGE_FAULT device=09:00.0 domain=0x000f address=0x0000000090005a40 flags=0x0020]

Note this is while the program is hanging on the GPUs. Is that what I was supposed to do?

@RuABraun
Copy link
Author

And I don't actually know will it work to call the python program with cuda-memcheck (my worry is it only works when calling an actual binary file)?

@vmarkovtsev
Copy link
Collaborator

The dmesg log is actually very insightful. IO_PAGE_FAULT signals about nasty system bus errors, and I also see AMD-Vi. AMD-Vi is the virtualization kernel driver that is well known for causing problems with multi-GPU communication along with Intel VT-x.

  1. Let's become 100% that this is the cause. Please compile and run nccl-tests. all_reduce should either fail or hang.
  2. I hope you are not running in the cloud because I have bad news: this error is therefore hard to fix as you cannot boot a VM with disabled virtualization...
  3. Anyway, boot the kernel with amd_iommu=off or amd_iommu=soft (add it to GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub and run update-grub, then reboot).

@RuABraun
Copy link
Author

RuABraun commented Jan 30, 2020

Ran build/all_reduce_prof and it worked for me as well as all the others.

Should I still try step 3?

@vmarkovtsev
Copy link
Collaborator

OK, then let's try cuda-memcheck.

@RuABraun
Copy link
Author

Not really getting much output...

seni@seni-MS-7A32:/work/fun/subword-repr$ cuda-memcheck --leak-check full --print-level info py3 km.py data/repr_nums_small 
========= CUDA-MEMCHECK
2020-01-30 12:23:43.157 | INFO     | __main__:kmeans:11 - Starting, data shape is (67673, 256).
arguments: 1 0x7ffdac337424 0.001 0.05 0 67673 256 33836 3 0 0 3 0x7fc777262010 0x7fc775156010 0x25bfc40 0x7ffdac33744c
reassignments threshold: 67
yinyang groups: 1691
[0] *dest: 0x7fc728000000 - 0x7fc72c216400 (69297152)
[1] *dest: 0x7fc755200000 - 0x7fc759416400 (69297152)
[0] device_centroids: 0x7fc72c400000 - 0x7fc72e50b000 (34648064)
[1] device_centroids: 0x7fc71c000000 - 0x7fc71e10b000 (34648064)
[0] device_assignments: 0x7fc71e200000 - 0x7fc71e242164 (270692)
[1] device_assignments: 0x7fc71e400000 - 0x7fc71e442164 (270692)
[0] device_assignments_prev: 0x7fc71e242200 - 0x7fc71e284364 (270692)
[1] device_assignments_prev: 0x7fc71e442200 - 0x7fc71e484364 (270692)
[0] device_ccounts: 0x7fc71e284400 - 0x7fc71e2a54b0 (135344)
[1] device_ccounts: 0x7fc71e484400 - 0x7fc71e4a54b0 (135344)
[0] device_assignments_yy: 0x7fc71e2a5600 - 0x7fc71e2c66b0 (135344)
[1] device_assignments_yy: 0x7fc71e4a5600 - 0x7fc71e4c66b0 (135344)
[0] device_bounds_yy: 0x7fc70e000000 - 0x7fc71ba665b0 (229008816)
[1] device_bounds_yy: 0x7fc700000000 - 0x7fc70da665b0 (229008816)
[0] device_drifts_yy: 0x7fc6fc000000 - 0x7fc6fe12c0b0 (34783408)
[1] device_drifts_yy: 0x7fc6f8000000 - 0x7fc6fa12c0b0 (34783408)
[0] device_passed_yy: 0x7fc71e2c6800 - 0x7fc71e2e931c (142108)
[1] device_passed_yy: 0x7fc71e4c6800 - 0x7fc71e4e931c (142108)
[0] device_centroids_yy: 0x7fc6fa200000 - 0x7fc6fa3a6c00 (1731584)
[1] device_centroids_yy: 0x7fc6fa400000 - 0x7fc6fa5a6c00 (1731584)
GPU #0 memory: used 544735232 bytes (4.6%), free 11176771584 bytes, total 11721506816 bytes
GPU #1 memory: used 544735232 bytes (4.6%), free 11175395328 bytes, total 11720130560 bytes
GPU #0 has 49152 bytes of shared memory per block
GPU #1 has 49152 bytes of shared memory per block
transposing the samples...
transpose <<<(2115, 8), (8, 32)>>> 67673, 256, xyswap

@vmarkovtsev
Copy link
Collaborator

So there are no memory errors except that there are memory errors according to dmesg.

Then try disabling the iommu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants