ProtLGN is pre-trained on wild-type proteins for AA-type denoising tasks with equivariant graph neural networks to derive the joint distribution of the recovered AA types (red).
For a protein to mutate, the predicted probabilities suggest the fitness score for associated mutations (blue).
With additional mutation evaluations from wet biochemical assessments, the pre-trained model can be updated to better fit the specific protein and protein functionality (green).
- [2024.06.06] We recently developed two more advanced protein engineering tools named ProtSSN and ProSST for zero-shot prediction. We recommend you try the new models!
Please follow these simple example steps to get start! 😊
Please make sure you have installed Anaconda3 or Miniconda3.
Enviroment.
conda env create -f environment.yaml
conda activate protlgn
pip install torch_scatter torch_sparse torch_cluster -f https://data.pyg.org/whl/torch-2.3.0+cu121.html
We use the dataset from CATH 4.2
, you can download from https://www.cathdb.info/.
mkdir -p data/cath_k10/raw
cd data/cath_k10/raw
wget https://huggingface.co/datasets/tyang816/cath/blob/main/dompdb.tar
# or wget https://lianglab.sjtu.edu.cn/files/ProtSSN-2024/dompdb.tar
tar -xvf dompdb.tar
see script/build_cath_dataset.sh
see run_pretrain.sh
You can use your own checkpoint for zero-shot inference.
Data map:
|—— eval_dataset
|——|—— DATASET
|——|——|—— Protein1
|——|——|——|—— Protein1.tsv (DMS file)
|——|——|——|—— Protein1.pdb (pdb file)
|——|——|——|—— Protein1.fasta (sequence)
|——|——|—— Protein2
|——|——|——|...
see script/build_mutant_dataset.sh
see script/mutant_predict.sh
CUDA_VISIBLE_DEVICES=0 python mutant_predict.py \
--checkpoint ckpt/ProtLGN.pt \
--c_alpha_max_neighbors 10 \
--gnn egnn \
--use_sasa \
--layer_num 6 \
--gnn_config src/Egnnconfig/egnn_mutant.yaml \
--mutant_dataset data/example
Please cite our paper:
@article{zhou2024protlgn,
title={Protein engineering with lightweight graph denoising neural networks},
author={Zhou, Bingxin and Zheng, Lirong and Wu, Banghao and Tan, Yang and Lv, Outongyi and Yi, Kai and Fan, Guisheng and Hong, Liang},
journal={Journal of Chemical Information and Modeling},
volume={64},
number={9},
pages={3650--3661},
year={2024},
publisher={ACS Publications}
}
@article{tan2023protssn
title={Semantical and Topological Protein Encoding Toward Enhanced Bioactivity and Thermostability},
author={Tan, Yang and Zhou, Bingxin and Zheng, Lirong and Fan, Guisheng and Hong, Liang},
journal={bioRxiv},
pages={2023--12},
year={2023},
publisher={Cold Spring Harbor Laboratory}
}
Distributed under the MIT License. See LICENSE.txt
for more information.