Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AF2 non-docker in a cluster enviroment #339

Closed
JuergenUniVie opened this issue Jan 17, 2022 · 6 comments
Closed

AF2 non-docker in a cluster enviroment #339

JuergenUniVie opened this issue Jan 17, 2022 · 6 comments
Labels
duplicate This issue or pull request already exists setup Issue setting up AlphaFold usage question Further information is requested

Comments

@JuergenUniVie
Copy link

Hello,

is there a way to run alphafold in a cluster environment with a job scheduling system (slurm/openpbs)?
I have several nodes with strong GPUs available, I would like to use them as well.

best wishes,
Juergen

@Augustin-Zidek Augustin-Zidek added setup Issue setting up AlphaFold usage question Further information is requested duplicate This issue or pull request already exists labels Jan 19, 2022
@DelilahYM
Copy link

We are running alphafold on our cluster.
1st, you need to have alphafold setup on your cluster including download all the database.
2nd, allocate resources with GPU
3rd, run it. (with proper options, of course)

@JuergenUniVie
Copy link
Author

Dear DelilahYM,

1st, all the Databases are installed and AF2 runs without a dock.
I use the alphafold script from https://github.com/kalininalab/alphafold_non_docker
Would you please help me and describe how to get it to work with slurm or openpbs?
Did you write a script for this or do you start via run_alphafold.sh?
Which data did you adapt?

many thanks and best wishes

@DelilahYM
Copy link

Funny thing. I am actually working on my installation at the moment (non-docker version of course), and just tested working.
We use slurm.
You install it as instructed https://github.com/kalininalab/alphafold_non_docker
create your own conda env with the required packages. make sure to check your cuda version .etc so the specific version of the packages are supported (jax, jaxlib)
I downloaded the database with https://github.com/kalininalab/alphafold_non_docker/blob/main/download_db.sh (do make sure you have enough storage). it takes a long time to download all the database, so I suggest you write a slurm script and submit a job to run the download script.
once you have everything downloaded, one thing I noticed is alphafold/alphafold/common/stereo_chemical_props.txt is missing. (it was supposed to be downloaded during the docker build, if docker is used). previous version of alphafold had it in the git, then somehow, new version doesn't.
Once you have that. you can test with your own data. I used https://github.com/kalininalab/alphafold_non_docker/blob/main/run_alphafold.sh
to run this script, you do need to be in the alphafold folder that you pulled earlier in the setup. so make sure you cd into that in your slurm script if you have your submission script somewhere else.
each cluster is a little different. so slurm script will look different.

@Augustin-Zidek
Copy link
Collaborator

We currently provide support only for running via Docker.

You can find more information about running under Singularity here: #10 and #24.

@yijietseng
Copy link

yijietseng commented Oct 6, 2022

Hello,

We are trying to the setup non-docker AF2 on our cluster using the scripts from https://github.com/kalininalab/alphafold_non_docker. But while testing, we got the following error. And we just want to see if any of you have any suggestions on how to fix this problem.

I1006 07:12:34.027368 140369674590016 templates.py:857] Using precomputed obsolete pdbs ./AF2_DB/pdb_mmcif/obsolete.dat.
I1006 07:12:34.936928 140369674590016 tpu_client.py:54] Starting the local TPU driver.
I1006 07:12:34.949247 140369674590016 xla_bridge.py:212] Unable to initialize backend 'tpu_driver': Not found: Unable to find driver in registry given worker: local://
I1006 07:12:35.068033 140369674590016 xla_bridge.py:212] Unable to initialize backend 'tpu': Invalid argument: TpuPlatform is not available.
I1006 07:12:41.648868 140369674590016 run_alphafold.py:376] Have 5 models: ['model_1_pred_0', 'model_2_pred_0', 'model_3_pred_0', 'model_4_pred_0', 'model_5_pred_0']
I1006 07:12:41.649096 140369674590016 run_alphafold.py:393] Using random seed 320158810403615912 for the data pipeline
I1006 07:12:41.649371 140369674590016 run_alphafold.py:161] Predicting 1TEL_WT3
I1006 07:12:41.685194 140369674590016 jackhmmer.py:133] Launching subprocess "/home/tseng3/miniconda3/envs/af2/bin/jackhmmer -o /dev/null -A /tmp/tmpyculok8_/output.sto --noali --F1 0.0005 --F2 5e-05 --F3 5e-07 --incE 0.0001 -E 0.0001 --cpu 8 -N 1 ./input/1TEL_WT3.fasta ./AF2_DB/uniref90/uniref90.fasta"
I1006 07:12:41.761307 140369674590016 utils.py:36] Started Jackhmmer (uniref90.fasta) query
I1006 07:18:21.009910 140369674590016 utils.py:40] Finished Jackhmmer (uniref90.fasta) query in 339.248 seconds
I1006 07:18:21.044890 140369674590016 jackhmmer.py:133] Launching subprocess "/home/tseng3/miniconda3/envs/af2/bin/jackhmmer -o /dev/null -A /tmp/tmpnwqqqyqw/output.sto --noali --F1 0.0005 --F2 5e-05 --F3 5e-07 --incE 0.0001 -E 0.0001 --cpu 8 -N 1 ./input/1TEL_WT3.fasta ./AF2_DB/mgnify/mgy_clusters_2018_12.fa"
I1006 07:18:21.138897 140369674590016 utils.py:36] Started Jackhmmer (mgy_clusters_2018_12.fa) query
I1006 07:24:25.972769 140369674590016 utils.py:40] Finished Jackhmmer (mgy_clusters_2018_12.fa) query in 364.833 seconds
I1006 07:24:26.036239 140369674590016 hhsearch.py:85] Launching subprocess "/home/tseng3/miniconda3/envs/af2/bin/hhsearch -i /tmp/tmpj_d_icnl/query.a3m -o /tmp/tmpj_d_icnl/output.hhr -maxseq 1000000 -d ./AF2_DB/pdb70/pdb70"
I1006 07:24:26.117509 140369674590016 utils.py:36] Started HHsearch query
I1006 07:28:41.636944 140369674590016 utils.py:40] Finished HHsearch query in 255.519 seconds
I1006 07:28:41.699008 140369674590016 hhblits.py:128] Launching subprocess "/home/tseng3/miniconda3/envs/af2/bin/hhblits -i ./input/1TEL_WT3.fasta -cpu 4 -oa3m /tmp/tmpgzc0tjn9/output.a3m -o /dev/null -n 3 -e 0.001 -maxseq 1000000 -realign_max 100000 -maxfilt 100000 -min_prefilter_hits 1000 -d ./AF2_DB/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt -d ./AF2_DB/uniclust30/uniclust30_2018_08/uniclust30_2018_08"
I1006 07:28:41.779823 140369674590016 utils.py:36] Started HHblits query
I1006 10:16:02.015240 140369674590016 utils.py:40] Finished HHblits query in 10040.235 seconds
I1006 10:16:02.124383 140369674590016 templates.py:878] Searching for template for: MGSSHHHHHHSIALPAHLRLQPIYWSRDDVAQWLKWAENEFSLSPIDSNTFEMNGKALLLLTKEDFRYRSPHSGDELYELLQHILGGGGG
I1006 10:16:02.544437 140369674590016 templates.py:267] Found an exact template match 2qar_B.
I1006 10:16:02.785476 140369674590016 templates.py:267] Found an exact template match 1sv0_B.
I1006 10:16:02.953306 140369674590016 templates.py:267] Found an exact template match 1sv4_B.
I1006 10:16:04.026005 140369674590016 templates.py:267] Found an exact template match 1sxd_A.
I1006 10:16:05.506565 140369674590016 templates.py:267] Found an exact template match 1x66_A.
I1006 10:16:07.745375 140369674590016 templates.py:267] Found an exact template match 2jv3_A.
I1006 10:16:10.091728 140369674590016 templates.py:267] Found an exact template match 2dkx_A.
I1006 10:16:11.120151 140369674590016 templates.py:267] Found an exact template match 1sxe_A.
I1006 10:16:11.294945 140369674590016 templates.py:267] Found an exact template match 1ji7_B.
I1006 10:16:11.606558 140369674590016 templates.py:267] Found an exact template match 4mhv_B.
I1006 10:16:11.765944 140369674590016 templates.py:267] Found an exact template match 2qb1_B.
I1006 10:16:12.136505 140369674590016 templates.py:267] Found an exact template match 2qb0_D.
I1006 10:16:12.348974 140369674590016 templates.py:267] Found an exact template match 5l0p_A.
I1006 10:16:14.605201 140369674590016 templates.py:267] Found an exact template match 2ytu_A.
I1006 10:16:14.614635 140369674590016 templates.py:267] Found an exact template match 5l0p_A.
I1006 10:16:15.052233 140369674590016 templates.py:267] Found an exact template match 1lky_C.
I1006 10:16:15.056710 140369674590016 templates.py:267] Found an exact template match 1sv0_C.
I1006 10:16:16.604707 140369674590016 templates.py:267] Found an exact template match 2e8p_A.
I1006 10:16:16.611046 140369674590016 templates.py:267] Found an exact template match 5l0p_A.
I1006 10:16:17.920321 140369674590016 templates.py:267] Found an exact template match 1wwu_A.
I1006 10:16:17.986045 140369674590016 pipeline.py:234] Uniref90 MSA size: 2031 sequences.
I1006 10:16:17.986166 140369674590016 pipeline.py:235] BFD MSA size: 1064 sequences.
I1006 10:16:17.986241 140369674590016 pipeline.py:236] MGnify MSA size: 24 sequences.
I1006 10:16:17.986308 140369674590016 pipeline.py:237] Final (deduplicated) MSA size: 2573 sequences.
I1006 10:16:17.986528 140369674590016 pipeline.py:239] Total number of templates (NB: this can include bad templates and is later filtered to top 4): 20.
I1006 10:16:20.240771 140369674590016 run_alphafold.py:190] Running model model_1_pred_0 on 1TEL_WT3
2022-10-06 10:16:23.663265: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusolver.so.11'; dlerror: libcusolver.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /apps/cuda/10.1/lib64:/apps/cuda/10.1/nvvm/lib64:/apps/cuda/10.1/jre/lib
2022-10-06 10:16:23.687921: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1766] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
I1006 10:16:24.459771 140369674590016 model.py:165] Running predict with shape(feat) = {'aatype': (4, 90), 'residue_index': (4, 90), 'seq_length': (4,), 'template_aatype': (4, 4, 90), 'template_all_atom_masks': (4, 4, 90, 37), 'template_all_atom_positions': (4, 4, 90, 37, 3), 'template_sum_probs': (4, 4, 1), 'is_distillation': (4,), 'seq_mask': (4, 90), 'msa_mask': (4, 508, 90), 'msa_row_mask': (4, 508), 'random_crop_to_size_seed': (4, 2), 'template_mask': (4, 4), 'template_pseudo_beta': (4, 4, 90, 3), 'template_pseudo_beta_mask': (4, 4, 90), 'atom14_atom_exists': (4, 90, 14), 'residx_atom14_to_atom37': (4, 90, 14), 'residx_atom37_to_atom14': (4, 90, 37), 'atom37_atom_exists': (4, 90, 37), 'extra_msa': (4, 5120, 90), 'extra_msa_mask': (4, 5120, 90), 'extra_msa_row_mask': (4, 5120), 'bert_mask': (4, 508, 90), 'true_msa': (4, 508, 90), 'extra_has_deletion': (4, 5120, 90), 'extra_deletion_value': (4, 5120, 90), 'msa_feat': (4, 508, 90, 49), 'target_feat': (4, 90, 22)}
2022-10-06 10:17:03.914926: W external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusolver.so.11'; dlerror: libcusolver.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /apps/cuda/10.1/lib64:/apps/cuda/10.1/nvvm/lib64:/apps/cuda/10.1/jre/lib
Traceback (most recent call last):
File "/home/tseng3/compute/af2/alphafold-2.2.0/run_alphafold.py", line 422, in
app.run(main)
File "/home/tseng3/miniconda3/envs/af2/lib/python3.8/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/home/tseng3/miniconda3/envs/af2/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "/home/tseng3/compute/af2/alphafold-2.2.0/run_alphafold.py", line 398, in main
predict_structure(
File "/home/tseng3/compute/af2/alphafold-2.2.0/run_alphafold.py", line 198, in predict_structure
prediction_result = model_runner.predict(processed_feature_dict,
File "/nobackup/scratch/usr/tseng3/af2/alphafold-2.2.0/alphafold/model/model.py", line 167, in predict
result = self.apply(self.params, jax.random.PRNGKey(random_seed), feat)
File "/home/tseng3/miniconda3/envs/af2/lib/python3.8/site-packages/jax/_src/traceback_util.py", line 183, in reraise_with_filtered_traceback
return fun(*args, **kwargs)
File "/home/tseng3/miniconda3/envs/af2/lib/python3.8/site-packages/jax/_src/api.py", line 424, in cache_miss
out_flat = xla.xla_call(
File "/home/tseng3/miniconda3/envs/af2/lib/python3.8/site-packages/jax/core.py", line 1560, in bind
return call_bind(self, fun, *args, **params)
File "/home/tseng3/miniconda3/envs/af2/lib/python3.8/site-packages/jax/core.py", line 1551, in call_bind
outs = primitive.process(top_trace, fun, tracers, params)
File "/home/tseng3/miniconda3/envs/af2/lib/python3.8/site-packages/jax/core.py", line 1563, in process
return trace.process_call(self, fun, tracers, params)
File "/home/tseng3/miniconda3/envs/af2/lib/python3.8/site-packages/jax/core.py", line 606, in process_call
return primitive.impl(f, *tracers, **params)
File "/home/tseng3/miniconda3/envs/af2/lib/python3.8/site-packages/jax/interpreters/xla.py", line 592, in _xla_call_impl
compiled_fun = _xla_callable(fun, device, backend, name, donated_invars,
File "/home/tseng3/miniconda3/envs/af2/lib/python3.8/site-packages/jax/linear_util.py", line 262, in memoized_fun
ans = call(fun, *args)
File "/home/tseng3/miniconda3/envs/af2/lib/python3.8/site-packages/jax/interpreters/xla.py", line 723, in _xla_callable
out_nodes = jaxpr_subcomp(
File "/home/tseng3/miniconda3/envs/af2/lib/python3.8/site-packages/jax/interpreters/xla.py", line 462, in jaxpr_subcomp
ans = rule(c, axis_env, extend_name_stack(name_stack, eqn.primitive.name),
File "/home/tseng3/miniconda3/envs/af2/lib/python3.8/site-packages/jax/_src/lax/control_flow.py", line 350, in _while_loop_translation_rule
new_z = xla.jaxpr_subcomp(body_c, body_jaxpr.jaxpr, backend, axis_env,
File "/home/tseng3/miniconda3/envs/af2/lib/python3.8/site-packages/jax/interpreters/xla.py", line 462, in jaxpr_subcomp
ans = rule(c, axis_env, extend_name_stack(name_stack, eqn.primitive.name),
File "/home/tseng3/miniconda3/envs/af2/lib/python3.8/site-packages/jax/interpreters/xla.py", line 1040, in f
outs = jaxpr_subcomp(c, jaxpr, backend, axis_env, _xla_consts(c, consts),
File "/home/tseng3/miniconda3/envs/af2/lib/python3.8/site-packages/jax/interpreters/xla.py", line 462, in jaxpr_subcomp
ans = rule(c, axis_env, extend_name_stack(name_stack, eqn.primitive.name),
File "/home/tseng3/miniconda3/envs/af2/lib/python3.8/site-packages/jax/_src/lax/control_flow.py", line 350, in _while_loop_translation_rule
new_z = xla.jaxpr_subcomp(body_c, body_jaxpr.jaxpr, backend, axis_env,
File "/home/tseng3/miniconda3/envs/af2/lib/python3.8/site-packages/jax/interpreters/xla.py", line 453, in jaxpr_subcomp
ans = rule(c, *in_nodes, **eqn.params)
File "/home/tseng3/miniconda3/envs/af2/lib/python3.8/site-packages/jax/_src/lax/linalg.py", line 500, in _eigh_cpu_gpu_translation_rule
v, w, info = syevd_impl(c, operand, lower=lower)
File "/home/tseng3/miniconda3/envs/af2/lib/python3.8/site-packages/jaxlib/cusolver.py", line 281, in syevd
lwork, opaque = cusolver_kernels.build_syevj_descriptor(
jax._src.traceback_util.UnfilteredStackTrace: RuntimeError: cuSolver internal error

The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/tseng3/compute/af2/alphafold-2.2.0/run_alphafold.py", line 422, in
app.run(main)
File "/home/tseng3/miniconda3/envs/af2/lib/python3.8/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/home/tseng3/miniconda3/envs/af2/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "/home/tseng3/compute/af2/alphafold-2.2.0/run_alphafold.py", line 398, in main
predict_structure(
File "/home/tseng3/compute/af2/alphafold-2.2.0/run_alphafold.py", line 198, in predict_structure
prediction_result = model_runner.predict(processed_feature_dict,
File "/nobackup/scratch/usr/tseng3/af2/alphafold-2.2.0/alphafold/model/model.py", line 167, in predict
result = self.apply(self.params, jax.random.PRNGKey(random_seed), feat)
File "/home/tseng3/miniconda3/envs/af2/lib/python3.8/site-packages/jaxlib/cusolver.py", line 281, in syevd
lwork, opaque = cusolver_kernels.build_syevj_descriptor(
RuntimeError: cuSolver internal error

@avapirev
Copy link

avapirev commented Feb 9, 2023

Considering that shared HPC clusters prefer Singularity/Apptainer and other root-less container managers (in order to avoid root with Docker) I do not see why there is no Singularity/Apptainer support. The requirements of AlphaFold are well beyond the compute capabilities of even small lab clusters, not to mention personal computers, where users might have root permissions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue or pull request already exists setup Issue setting up AlphaFold usage question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants