This project aims at understanding and quantifying the relationship betweeen pLDDTs, MSAs and embedding-based disorder predictions.
Therefore we developed a Python library and accompanying scripts to calculate Neff (number of effective sequences) and gap count scores.
The commands described in the following subchapters are meant to be executed in a shell in the project root directory.
The project uses Python for scripts. The root directory contains a environment.yml
file that can be used to create an
appropriate environment.
It is recommended to use conda, you can create a new requirement with all dependencies installed like the following:
conda create --file=environment.yml
conda activate env
Note that our project requires Python 3.10.
Install the af22c
package in development mode with the following command in the project root directory:
pip install -e .
Our code can be used either as a Python library or as standalone programs. In the following, we present common usecases for both methods.
You can use our library to calculate Neff or gapcount scores from an MSA in the following way. Specify a torch device in the functions to use GPU acceleration.
>>> from af22c import neff, gapcount
>>> msa = ["SEQWENCE","--QWENCE","--QWEN--","--QEEN--"]; print("\n".join(msa))
SEQWENCE
--QWENCE
--QWEN--
--QEEN--
>>> # calculate Neff and gapcount scores for the MSA
>>> neff(msa, device="cuda")
tensor([1., 1., 3., 3., 3., 3., 2., 2.], device='cuda:0')
>>> gapcount(msa, device="cuda")
tensor([3, 3, 0, 0, 0, 0, 2, 2], device='cuda:0')
In our case, we have MSAs for the entire human proteome. They are inside a .tar file. With our library, you can extract .a3m MSA files from a .tar file and use it as an input for score calculation. In the following, we show how you can calculate an protein-specific average Neff score for APP. Note that a ProteomeTar
can be indexed using the /
operator; like in the example below, where the .tar file containing all human proteins is indexed with Q92624, corresponding to APP.
>>> from af22c import neff
>>> from af22c.utils import ProteomeTar
>>> human = ProteomeTar("data/UP000005640_9606.tar")
>>> neff(human / "Q92624", device="cuda").mean()
tensor(209.4100, device='cuda:0')
Hint: To get output while running, you can specify verbose=True
. For debugging CUDA out-of-memory issues when running neff
, you can lower the batch_size
(default: 2**12
) parameter used for pairwise sequence identity calculation.
TODO: say that we also have ref impl available
We also provide scripts to extract Neff scores on the command line.
./scripts/neff_gpu.py ./data/A0A0A0MRZ7.a3m /tmp/neff.json
cat /tmp/neff.json # only truncated output shown:
[1564.3165283203125, 1660.9517822265625, 1906.4017333984375, 1983.4613037109375, 2283.94921875, 2345.59912109375, 2614.16015625, 2852.3935546875, 2907.48046875, 3112.87890625, 3278.29541015625, 3376.743896484375, 3506.353759765625, ...]
TODO: say that we also have support for .tar files containing entire proteomes.
Our project deals with the comparison between Neff, SETH predictions, and pLDDT scores. Therefore, we need utilities to obtain and handle these pLDDT scores. We include our utility programs in this repository. In the following, we will show how one can gather pLDDT scores from AlphaFold 2's database.
For using the download script, the wget
utility must be installed on the system.
The application relies on data from the AlphaFold 2 database ([1] and [2])
and precomputed disorder predictions from SETH [3]. The download_human.sh
script downloads both datasets into
the ./data
subdirectory. (If not yet present, the directory is created for you.)
./af22c/download_human.sh
Before downloading, the script checks whether the data has changed on the server. To force a download, delete the
corresponding file in the ./data
subdirectory.
The pLDDT scores are stored alongside the 3D residue position inside the PDB
files predicted by AlphaFold 2. The extract_plddts.py
utility extracts and stores them in a ..._plddts.json
file
inside the ./data
subdirectory.
./scripts/extract_plddts.py
Note, that the program extracts the .tar
file downloaded from the AlphaFold 2 database inside a temporary directory.
Since every .pdb.gz
file from the archive is copied, this temporarily requires as much addition disk memory as the
initial download. This happens to ease the use of multiple processes using a Python ProcessPoolExecutor
.
If proteins have more than 2700 residues, they are predicted in fragments by AF2.
This also means, the pLDDTs are stored in multiple files, one for each fragment. The filter_protfrags.py
utility
filters the pLDDTs to only contain non-fragmented proteins and stores the filtered pLDDTs in a ..._plddts_fltrd.json
file inside the ./data
subdirectory.
./scripts/filter_protfrags.py
The environment.yml
file can be created using this command:
conda env export | head -n -1 > environment.yml
The call to head
removes the platform-specific prefix
line generated by conda's export command.
[1] Jumper, J., Evans, R., Pritzel, A. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021). https://doi.org/10.1038/s41586-021-03819-2
[2] Mihaly Varadi, Stephen Anyango, Mandar Deshpande, Sreenath Nair, Cindy Natassia, Galabina Yordanova, David Yuan, Oana Stroe, Gemma Wood, Agata Laydon, Augustin Žídek, Tim Green, Kathryn Tunyasuvunakool, Stig Petersen, John Jumper, Ellen Clancy, Richard Green, Ankur Vora, Mira Lutfi, Michael Figurnov, Andrew Cowie, Nicole Hobbs, Pushmeet Kohli, Gerard Kleywegt, Ewan Birney, Demis Hassabis, Sameer Velankar, AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models, Nucleic Acids Research, Volume 50, Issue D1, 7 January 2022, Pages D439–D444, https://doi.org/10.1093/nar/gkab1061
[3] Ilzhoefer, D., Heinzinger, M. & Rost, B. 2022. SETH predicts nuances of residue disorder from protein embeddings, https://doi.org/10.1101/2022.06.23.497276 (Preprint)