jax and jaxlib versions #20

huhlim · 2021-07-17T15:31:11Z

TL;DR is it okay to use the jaxlib version of 0.1.68+cuda110 instead of 0.1.69+cuda110?

I have tried to write a script and construct a conda environment that does not use Docker. When I used the same versions of jax and jaxlib defined in the docker/Dockerfile, I had some issues during the inference time. Scripts were working fine for model_{1,3,4} but raised CUDA_ERROR_ILLEGAL_ADDRESS errors for model_{2,5}. I have no idea why it happened...
So, I tested many variants of the environment and found out that jax=0.2.17 (probably, it is the same version of the original) and jaxlib=0.1.68+cuda110 (it is the version for installing jax with a command
pip3 install jax[cuda110] -f https://storage.googleapis.com/jax-releases/jax_releases.html ) are okay to run smoothly without Docker, but with my custom conda environment.

The text was updated successfully, but these errors were encountered:

tfgg · 2021-07-19T13:17:00Z

Hi, we require version of 0.1.69 jaxlib to be able to use CUDA unified memory for running long sequences. If you don't need this you can probably run with 0.1.68, but that might be related to the illegal address error that you see. How long was the sequence you were trying to run?

Some of the other open issues about CUDA versions might also be of help.

huhlim · 2021-07-19T19:09:36Z

I was benchmarking with the CASP14 targets. T1026 (172 residues) raised the issue.
I realized that some of the targets still have issues of the CUDA_ERROR_ILLEGAL_ADDRESS, even though I used jax==0.2.17 and jaxlib==0.1.68+cuda110. Those targets were running okay on CPUs.

For my system information,

NVIDIA driver: 450.36.06
CUDA version: 11.0
jax: 0.1.68
jaxlib: 0.1.68+cuda110
tensorflow: 2.5.0

tfgg · 2021-07-19T19:19:03Z

That's a very small protein, so I'm surprised it's an issue. What GPU are you using? Is it possible to try using the Dockerfile?

You could try disabling unified memory by commenting out these two lines in your script, if you have them:
https://github.com/deepmind/alphafold/blob/main/docker/run_docker.py#L171-L172

huhlim · 2021-07-19T20:04:32Z

I tested with Quadro RTX 6000 and RTX 2080Ti.
I have tested with
(1) jaxlib==0.1.68+cuda110, jax==0.2.17, cudatoolkit=11.0.3 for my custom non-Docker version
(2) jaxlib==0.1.69+cuda110, jax==0.2.17, cudatoolkit=11.0.3 for my custom non-Docker version
(3) (1) or (2) + commenting out the two lines for the unified memory
(4) the same as (2), but with a docker container (the original one)

There was no issue with the (4)... So, there may be some differences between my non-Docker version and the original Docker version... (I thought I implemented my custom non-Docker version with the exact same version of libraries...) I will try it again.

chrisroat · 2021-08-04T14:15:43Z

@huhlim Did you solve your CUDA_ERROR_ILLEGAL_ADDRESS problems? I just ran ~100 proteins from an internal sample, and this cropped up for me in some cases. As I investigate, it would be helpful if you follow-up here with anything you learned and/or how you resolved your problem. (I am using Docker at an A100)

huhlim · 2021-08-04T14:23:24Z

@christroat I could not fully resolve the issue. When I turned off the jax.jit compilation of models (initialization of the RunModel class in alphafold/model/model.py), it reduced the chance of the error but did not resolve the issue. I have not had the issue with my Docker system, so I guess my problem is related to our cluster setup... Unfortunately, I gave up to tackle the issue.

huhlim closed this as completed Jul 19, 2021

huhlim mentioned this issue Jul 19, 2021

Non docker setup #24

Closed

vhu43 mentioned this issue Jul 27, 2021

CUDA error out of memory #33

Closed

Augustin-Zidek mentioned this issue Mar 24, 2022

CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered #406

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jax and jaxlib versions #20

jax and jaxlib versions #20

huhlim commented Jul 17, 2021

tfgg commented Jul 19, 2021

huhlim commented Jul 19, 2021

tfgg commented Jul 19, 2021

huhlim commented Jul 19, 2021

chrisroat commented Aug 4, 2021

huhlim commented Aug 4, 2021

jax and jaxlib versions #20

jax and jaxlib versions #20

Comments

huhlim commented Jul 17, 2021

tfgg commented Jul 19, 2021

huhlim commented Jul 19, 2021

tfgg commented Jul 19, 2021

huhlim commented Jul 19, 2021

chrisroat commented Aug 4, 2021

huhlim commented Aug 4, 2021