How to tell if medaka consensus is using cpu or gpu #65

HenrivdGeest · 2019-07-11T09:31:44Z

We have a centos machine with a GTX 2080Ti 11GB installed (+ quad core xeon with 16GB ram)
Cuda 10 and drivers 418:
NVIDIA-SMI 418.56 Driver Version: 418.56 CUDA Version: 10.1
I installed medaka from source using the git pull and make install command. I changed the requirements to tensorflow-gpu.
I can now run medaka consensus, but I am wondering if its really using the gpu. or still the cpu.
A few things make me doubt:
I can monitor load and clock speeds of the videocard while running. When idle its reporting this:

#Time        gpu   pwr gtemp mtemp    sm   mem   enc   dec  mclk  pclk
#HH:MM:SS    Idx     W     C     C     %     %     %     %   MHz   MHz
 11:00:20      0    17    43     -     0     2     0     0   405   300
 11:00:21      0    17    43     -     0     2     0     0   405   300

nvidia-smi show the memory load/programs using the card when idle:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56       Driver Version: 418.56       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:01:00.0  On |                  N/A |
|  0%   43C    P8    17W / 300W |     99MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1922      G   /usr/bin/X                                    39MiB |
|    0      2633      G   /usr/bin/gnome-shell                          58MiB |
+-----------------------------------------------------------------------------+

If I fire up the medaka consensus tool for the ecoli example I see the following in the stdout: ( the consensus part takes a bit less than 2 minutes, which is much faster than 49 minutes (nanopolish) but also slower than 7 seconds compared to the benchmark info.

(medaka) [geest@gt-mapper medaka_walkthrough]$ medaka_consensus -i ${BASECALLS} -d ${DRAFT} -o ${CONSENSUS} -t ${NPROC}
Checking program versions
Program    Version    Required   Pass     
bgzip      1.9        1.9        True     
minimap2   2.11       2.11       True     
samtools   1.9        1.9        True     
tabix      1.9        1.9        True     
Warning: Output consensus already exists, may use old results.
Not aligning basecalls to draft, calls_to_draft.bam exists.
Running medaka consensus
[11:08:19 - Predict] Processing region(s): utg000001c:0-4702069
[11:08:19 - Predict] Setting tensorflow threads to 4.
[11:08:19 - Predict] Processing 5 long region(s) with batching.
[11:08:19 - ModelLoad] Building model (steps, features, classes): (10000, 10, 5)
[11:08:19 - ModelLoad] With cudnn: False
....
[11:08:45 - PWorker] 18.4% Done (0.9/4.7 Mbases) in 24.6s
....
[11:10:02 - PWorker] 100.0% Done (4.7/4.7 Mbases) in 101.8s
....
[11:10:04 - Stitch] Processing utg000001c.

During running I see that the tool is using some (150Mbyte) amount of memory on the gpu:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56       Driver Version: 418.56       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:01:00.0  On |                  N/A |
|  0%   44C    P2    52W / 300W |    264MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1922      G   /usr/bin/X                                    39MiB |
|    0      2189      C   .../geest/bin/medaka-0.8.0/venv/bin/python   155MiB |
|    0      2633      G   /usr/bin/gnome-shell                          57MiB |
+-----------------------------------------------------------------------------+

Also, the load monitor shows ome short elevation of clock speed and usage during the run, however just for 15 seconds before it seems idle again:

#Time        gpu   pwr gtemp mtemp    sm   mem   enc   dec  mclk  pclk
#HH:MM:SS    Idx     W     C     C     %     %     %     %   MHz   MHz
 11:08:19      0    15    43     -     0     2     0     0   405   300
 11:08:20      0    52    44     -     1     1     0     0  6800  1350
 11:08:21      0    52    44     -     0     0     0     0  6800  1350
 11:08:34      0    53    45     -     0     0     0     0  6800  1350
 11:08:35      0    53    45     -     0     0     0     0  6800  1350
....
 11:08:36      0    34    44     -     0     0     0     0   405   420
 11:08:37      0    17    44     -     0     0     0     0   405   315
 11:08:38      0    16    44     -     0     2     0     0   405   315
 11:08:39      0    16    44     -     0     2     0     0   405   300

So somehow it seems that something is using the gpu, but its barely using it.
If I force the batch size (-b 1000) to something very big, I force it to crash to capture the error message:

[11:14:47 - Sampler] Took 1.56s to make features.
[11:14:47 - Sampler] Pileup for utg000001c:3999000.0-4702068.0 is of width 1485904
Traceback (most recent call last):
....
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[10000,1000,128] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu

I read from other reports that it says by allocater gpu, but my error says 'cpu'. ( so I guess that I do not have enough system memory)

The python package listing show both tensorflow and tensorflow-gpu:

ls ll  ~/bin/medaka-0.8.0/venv/lib/python3.6/site-packages/
....
tensorboard
tensorboard-1.14.0.dist-info
tensorflow
tensorflow_estimator
tensorflow_estimator-1.14.0.dist-info
....

Any ideas on this? Is it using the gpu or did I miss something?

The text was updated successfully, but these errors were encountered:

cjw85 · 2019-07-11T10:03:46Z

I believe you are correct that the GPU is not being used correctly. A tell tale sign is this in the log:

[11:08:19 - ModelLoad] With cudnn: False

indicating that tensorflow has not created a GPU optimised model. Do you have CuDNN installed on your computer as well as CUDA 10? This is usually acquired as a separate download.

That being said, I don't think having both tensorflow and tensorflow gpu installed in your environment is correct. Try uninstalling both and then installing simply tensorflow-gpu, something like:

pip uninstall tensorflow-gpu tensorflow
pip install tensorflow-gpu==1.14.0

Or just run make clean install, after editing the requirements.txt to specify the gpu version.

So a complete example:

git clone [email protected]:nanoporetech/medaka.git
cd medaka
sed -i "s/tensorflow/tensorflow-gpu/" requirements.txt
make install
. venv/bin/activate
cd ..
wget https://s3-eu-west-1.amazonaws.com/ont-research/medaka_walkthrough_no_reads.tar.gz
tar -xzf medaka_walkthrough_no_reads.tar.gz
cd data
medaka_consensus -d draft_assm.fa -i basecalls.fa -t 8 -b 100

Note the change in the batch size here. The default used to be appropriate for a GPU with 11GB (I'm using a 1080Ti), but it looks like something has changed recently which makes the default too big. The relevant part of stdout for the above:

Running medaka consensus
[10:55:18 - Predict] Processing region(s): utg000001c:0-4703280
[10:55:18 - Predict] Setting tensorflow threads to 8.
[10:55:18 - Predict] Processing 5 long region(s) with batching.
[10:55:18 - ModelLoad] Building model (steps, features, classes): (10000, 10, 5)
[10:55:18 - ModelLoad] With cudnn: True
[10:55:18 - ModelLoad] Loading weights from /media/scratch/cwright/medaka_gh65/medaka/venv/lib/python3.5/site-packages/medaka-0.8.0-py3.5-linux-x86_64.egg/medaka/data/r941_min_high_model.hdf5
[10:55:19 - PWorker] Running inference for 4.7M draft bases.
[10:55:19 - Sampler] Initializing sampler for consensus of region utg000001c:0-1000000.
[10:55:20 - Feature] Processed utg000001c:0.0-999999.1 (median depth 76.0)
[10:55:20 - Sampler] Took 1.34s to make features.
[10:55:20 - Sampler] Pileup for utg000001c:0.0-999999.1 is of width 2073349
[10:55:20 - Sampler] Initializing sampler for consensus of region utg000001c:999000-2000000.
[10:55:22 - Feature] Processed utg000001c:999000.0-1999999.0 (median depth 85.0)
[10:55:22 - Sampler] Took 1.62s to make features.
[10:55:22 - Sampler] Pileup for utg000001c:999000.0-1999999.0 is of width 2190799
[10:55:22 - Sampler] Initializing sampler for consensus of region utg000001c:1999000-3000000.
[10:55:23 - Feature] Processed utg000001c:1999000.0-2999999.1 (median depth 89.0)
[10:55:23 - Sampler] Took 1.65s to make features.
[10:55:23 - Sampler] Pileup for utg000001c:1999000.0-2999999.1 is of width 2203455
[10:55:23 - Sampler] Initializing sampler for consensus of region utg000001c:2999000-4000000.
[10:55:25 - Feature] Processed utg000001c:2999000.0-3999999.1 (median depth 89.0)
[10:55:25 - Sampler] Took 1.63s to make features.
[10:55:25 - Sampler] Pileup for utg000001c:2999000.0-3999999.1 is of width 2208176
[10:55:25 - Sampler] Initializing sampler for consensus of region utg000001c:3999000-4703280.
[10:55:26 - Feature] Processed utg000001c:3999000.0-4703279.0 (median depth 80.0)
[10:55:26 - Sampler] Took 1.18s to make features.
[10:55:26 - Sampler] Pileup for utg000001c:3999000.0-4703279.0 is of width 1488424
[10:55:29 - PWorker] 97.3% Done (4.6/4.7 Mbases) in 10.5s
[10:55:30 - PWorker] All done, 0 remainder regions.
[10:55:30 - Predict] Finished processing all regions.
Running medaka stitch
[10:55:32 - DataIndex] Loaded sample-index from 1/1 (100.00%) of feature files.
[10:55:32 - Stitch] Processing utg000001c.
Polished assembly written to medaka/consensus.fasta, have a nice day.

During which gpustat shows:

HenrivdGeest · 2019-07-12T11:27:10Z

Thanks, I figured that cuda10.1 is not compatible with tensorflow 1.14.0. I tried installing tensorflow 2.0 package but that does not work at all. I will try to get a docker image for this.

cjw85 · 2019-07-19T08:39:57Z

Hi @HenrivdGeest,

Have you managed to get something working? My machine has Nvidia driver 418.67, CUDA 10.1 and cuDNN 7.4.2. I don't think there is a problem with CUDA10.1 and tensorflow 1.14.0 per se; I think the issue is more likely the version of cuDNN you have installed.

The pypi tensorflow 1.12.0 package complains when the correct cuDNN library cannot be found which made things fairly obvious. We tested the behaviour of medaka with the tf1.14 package in the absence of cuDNN 7.4: it does not complain and carries on as in your original post.

Various cuDNN versions can be downloaded from: https://developer.nvidia.com/rdp/cudnn-archive

cjw85 · 2019-08-09T08:48:19Z

@HenrivdGeest,

We have found a workaround for using medaka with a 2080 GPU which is working for @devindrown. Can you try this if you are still having problems?

cjw85 · 2019-09-11T21:35:53Z

The latest release, v0.9.0, has additional logging concerning GPU use and tips for RTX users.

cjw85 mentioned this issue Aug 8, 2019

could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR #70

Closed

cjw85 closed this as completed Sep 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to tell if medaka consensus is using cpu or gpu #65

How to tell if medaka consensus is using cpu or gpu #65

HenrivdGeest commented Jul 11, 2019

cjw85 commented Jul 11, 2019 •

edited

Loading

HenrivdGeest commented Jul 12, 2019

cjw85 commented Jul 19, 2019

cjw85 commented Aug 9, 2019

cjw85 commented Sep 11, 2019

How to tell if medaka consensus is using cpu or gpu #65

How to tell if medaka consensus is using cpu or gpu #65

Comments

HenrivdGeest commented Jul 11, 2019

cjw85 commented Jul 11, 2019 • edited Loading

HenrivdGeest commented Jul 12, 2019

cjw85 commented Jul 19, 2019

cjw85 commented Aug 9, 2019

cjw85 commented Sep 11, 2019

cjw85 commented Jul 11, 2019 •

edited

Loading