When I degraded the jax, the code can run on GPU, but Not Enough GPU memory? #228

Tiramisu023 · 2024-04-15T02:00:43Z

What is your installation issue?

Hello, I met the "Not Enough GPU memory" problem after I solved the problem of jax not recognition the GPU device.

The following is the error process.

I install Localcolabfold using "install_colabbatch_linux.sh". When I run the "colabfold_batch", the error of "no GPU detected, will be using CPU" occured. Then I checked whether the jax could recognize the GPU device (refer to #209).

$HOME/software/localcolabfold/colabfold-conda/bin/python3.10
# Python 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0] on linux
>>> import jax
>>> print(jax.local_devices()[0].platform)
# CUDA backend failed to initialize: Unable to load cuDNN. Is it installed? (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
# cpu

Then I checked the jax and jaxlib version，

COLABFOLDDIR="/public1/users/liyulong/software/localcolabfold"
"$COLABFOLDDIR/colabfold-conda/bin/pip" list | grep "jax"
# jax                          0.4.23
# jaxlib                       0.4.23+cuda11.cudnn86

I degraded the jax version to "jax==0.4.7, jaxlib==0.4.7+cuda11.cudnn86" (refer to #209). Then the jax can recognize the GPU device.

$HOME/software/localcolabfold/colabfold-conda/bin/python3.10
# Python 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0] on linux
>>> import jax
>>> print(jax.local_devices()[0].platform)
gpu

Then the colabfold_batch met the problem "No module named 'jax.extend'" (refer to #224). I reinstalled the "dm-haiku==0.0.10". And the colabfold_batch could run on the GPU device. However, I met a new problem "Could not predict HNUJ.ctg90.87. Not Enough GPU memory? FAILED_PRECONDITION: DNN library initialization failed. Look at the errors above for more details.".

I have two 2080 Ti (11GB * 2).

$ nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.113.01             Driver Version: 535.113.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2080 Ti     Off | 00000000:17:00.0 Off |                  N/A |
| 38%   41C    P0              52W / 250W |      0MiB / 11264MiB |      1%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 2080 Ti     Off | 00000000:25:00.0 Off |                  N/A |
| 25%   30C    P0              21W / 250W |      0MiB / 11264MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

I have 450 amino acids in the fasta file. Is this problem caused by insufficient video memory? It seems that 40 GB of video memory still has this problem? (refer to #90)

In addition, given that I added CUDA 12.1 in my $PATH, I also tried to modify the "install_colabbatch_linux.sh" as suggested by A-Talavera (refer to #210).

I changed "$COLABFOLDDIR/colabfold-conda/bin/pip" install --upgrade "jax[cuda11_pip]==0.4.23"
to "$COLABFOLDDIR/colabfold-conda/bin/pip" install --upgrade "jax[cuda12_pip]==0.4.23"

And the jaxlib-0.4.23+cuda12.cudnn89 will be installed by default. Then I tried to degrade the jax to "jaxlib-0.4.7+cuda12.cudnn88" just following the same process as above. I can run colabfold_batch on GPU. But it still tell me "Could not predict HNUJ.ctg90.87. Not Enough GPU memory? FAILED_PRECONDITION: DNN library initialization failed. Look at the errors above for more details". And in #224, you said "jax-0.4.23+cuda11.cudnn86" was also ok for CUDA 12.1.

Computational environment

OS: [e.g. Ubuntu 22.04, Windows10 & WSL2, macOS...]

$ cat /etc/redhat-release
CentOS Linux release 7.5.1804 (Core)

CUDA version if Linux (Show the output of /usr/local/cuda/bin/nvcc --version.)

$ /usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Mar_21_19:15:46_PDT_2021
Cuda compilation tools, release 11.3, V11.3.58
Build cuda_11.3.r11.3/compiler.29745058_0

Since LocalColabFold requires CUDA 11.8+, I added CUDA 12.1 to the environment variable $PATH.

$ which nvcc
/usr/local/cuda-12.1/bin/nvcc

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

Is it because the program calls CUDA 11.3 (/usr/local/cuda/bin/nvcc) instead of CUDA 12.1 in $PATH by default?

Looking forward to your reply. Thank you.

Yulong Li

The text was updated successfully, but these errors were encountered:

Tiramisu023 · 2024-04-15T06:04:30Z

I finally solved this "not Enough GPU memory" accoding to the solution in #224

pip install --upgrade nvidia-cudnn-cu11==8.5.0.96

This issue could be closed. I'm sorry to take up your time.

Tiramisu023 changed the title ~~When I degraded the jax, the code can run on GPU but Not Enough GPU memory?~~ When I degraded the jax, the code can run on GPU, but Not Enough GPU memory? Apr 15, 2024

Tiramisu023 closed this as completed Apr 15, 2024

YoshitakaMo mentioned this issue May 8, 2024

Question:WARNING: no GPU detected, will be using CPU #210

Open

crshin mentioned this issue May 13, 2024

jaxlib.xla_extension.XlaRuntimeError: FAILED_PRECONDITION: DNN library initialization failed jax-ml/jax#15361

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When I degraded the jax, the code can run on GPU, but Not Enough GPU memory? #228

When I degraded the jax, the code can run on GPU, but Not Enough GPU memory? #228

Tiramisu023 commented Apr 15, 2024 •

edited

Loading

Tiramisu023 commented Apr 15, 2024

When I degraded the jax, the code can run on GPU, but Not Enough GPU memory? #228

When I degraded the jax, the code can run on GPU, but Not Enough GPU memory? #228

Comments

Tiramisu023 commented Apr 15, 2024 • edited Loading

Tiramisu023 commented Apr 15, 2024

Tiramisu023 commented Apr 15, 2024 •

edited

Loading