Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When I degraded the jax, the code can run on GPU, but Not Enough GPU memory? #228

Closed
Tiramisu023 opened this issue Apr 15, 2024 · 1 comment

Comments

@Tiramisu023
Copy link

Tiramisu023 commented Apr 15, 2024

What is your installation issue?

Hello, I met the "Not Enough GPU memory" problem after I solved the problem of jax not recognition the GPU device.

The following is the error process.

I install Localcolabfold using "install_colabbatch_linux.sh". When I run the "colabfold_batch", the error of "no GPU detected, will be using CPU" occured. Then I checked whether the jax could recognize the GPU device (refer to #209).

$HOME/software/localcolabfold/colabfold-conda/bin/python3.10
# Python 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0] on linux
>>> import jax
>>> print(jax.local_devices()[0].platform)
# CUDA backend failed to initialize: Unable to load cuDNN. Is it installed? (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
# cpu

Then I checked the jax and jaxlib version,

COLABFOLDDIR="/public1/users/liyulong/software/localcolabfold"
"$COLABFOLDDIR/colabfold-conda/bin/pip" list | grep "jax"
# jax                          0.4.23
# jaxlib                       0.4.23+cuda11.cudnn86

I degraded the jax version to "jax==0.4.7, jaxlib==0.4.7+cuda11.cudnn86" (refer to #209). Then the jax can recognize the GPU device.

$HOME/software/localcolabfold/colabfold-conda/bin/python3.10
# Python 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0] on linux
>>> import jax
>>> print(jax.local_devices()[0].platform)
gpu

Then the colabfold_batch met the problem "No module named 'jax.extend'" (refer to #224). I reinstalled the "dm-haiku==0.0.10". And the colabfold_batch could run on the GPU device. However, I met a new problem "Could not predict HNUJ.ctg90.87. Not Enough GPU memory? FAILED_PRECONDITION: DNN library initialization failed. Look at the errors above for more details.".

I have two 2080 Ti (11GB * 2).

$ nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.113.01             Driver Version: 535.113.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2080 Ti     Off | 00000000:17:00.0 Off |                  N/A |
| 38%   41C    P0              52W / 250W |      0MiB / 11264MiB |      1%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 2080 Ti     Off | 00000000:25:00.0 Off |                  N/A |
| 25%   30C    P0              21W / 250W |      0MiB / 11264MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

I have 450 amino acids in the fasta file. Is this problem caused by insufficient video memory? It seems that 40 GB of video memory still has this problem? (refer to #90)

In addition, given that I added CUDA 12.1 in my $PATH, I also tried to modify the "install_colabbatch_linux.sh" as suggested by A-Talavera (refer to #210).

I changed "$COLABFOLDDIR/colabfold-conda/bin/pip" install --upgrade "jax[cuda11_pip]==0.4.23"
to "$COLABFOLDDIR/colabfold-conda/bin/pip" install --upgrade "jax[cuda12_pip]==0.4.23"

And the jaxlib-0.4.23+cuda12.cudnn89 will be installed by default. Then I tried to degrade the jax to "jaxlib-0.4.7+cuda12.cudnn88" just following the same process as above. I can run colabfold_batch on GPU. But it still tell me "Could not predict HNUJ.ctg90.87. Not Enough GPU memory? FAILED_PRECONDITION: DNN library initialization failed. Look at the errors above for more details". And in #224, you said "jax-0.4.23+cuda11.cudnn86" was also ok for CUDA 12.1.

Computational environment

  • OS: [e.g. Ubuntu 22.04, Windows10 & WSL2, macOS...]
$ cat /etc/redhat-release
CentOS Linux release 7.5.1804 (Core)
  • CUDA version if Linux (Show the output of /usr/local/cuda/bin/nvcc --version.)
$ /usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Mar_21_19:15:46_PDT_2021
Cuda compilation tools, release 11.3, V11.3.58
Build cuda_11.3.r11.3/compiler.29745058_0

Since LocalColabFold requires CUDA 11.8+, I added CUDA 12.1 to the environment variable $PATH.

$ which nvcc
/usr/local/cuda-12.1/bin/nvcc

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

Is it because the program calls CUDA 11.3 (/usr/local/cuda/bin/nvcc) instead of CUDA 12.1 in $PATH by default?

Looking forward to your reply. Thank you.

Yulong Li

@Tiramisu023 Tiramisu023 changed the title When I degraded the jax, the code can run on GPU but Not Enough GPU memory? When I degraded the jax, the code can run on GPU, but Not Enough GPU memory? Apr 15, 2024
@Tiramisu023
Copy link
Author

I finally solved this "not Enough GPU memory" accoding to the solution in #224

pip install --upgrade nvidia-cudnn-cu11==8.5.0.96

This issue could be closed. I'm sorry to take up your time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant