A Unified Framework for Protein Sequence Design from Any Protein Text Description
ProtDAT is implemented with Python3 (>=3.9). We recommend you to use a virtual environment to install the dependencies :
conda create -n ProtDAT python=3.9
PyTorch can be installed by selecting the corresponding version through https://pytorch.org/
.
After that, install other requirements by :
pip install -r requirements.txt
Finally, activate the virtual environment by :
conda activate ProtDAT
Before using ProtDAT, there are several steps:
- Download the ESM1b and PubMedBERT models and place them in the
esm1b
andpubmedbert
subfolders within themodel
directory. - Download ProtDAT model weight file state_dict.pth
For generating protein sequences one by one or in batches, separately refer to gen_single_seq.py
and gen_batch_seqs.py
.
The cases of protein sequences and text descriptions are in the data
directory. For example :
Description: FUNCTION: Component of the acetyl coenzyme A carboxylase complex. SUBCELLULAR LOCATION: Cytoplasm. SIMILARITY: Belongs to the AccA family.
Sequence: MAVSDRKLQLLDFEKPLAELEDRIEQIRSLSEQNGVDVTDQIAQLEGRAEQLRQEIFSSLTPMQELQLARHPRRPSTLDYIHAISDEWMELHGDRRGYDDPAIVGGVGRIGGQPVLMLGHQKGRDTKDNVARNFGMPFPSGYRKAMRL...
The generation codes below determine whether the process is guided solely by text or by a combination of text and sequence.
seq=None, # Only protein descriptions guide the generation process
seq=tokenized_seqs['input_ids'][...,:1].to(device), # Both sequence fragments and descriptions guide the generation process
You can build a custom protein text-sequence dataset with a specific pattern and train it using the architecture in Decoder.py
.
If you find ProtDAT useful, cite the relevant paper:
@article{guo2024protdat,
title={ProtDAT: A Unified Framework for Protein Sequence Design from Any Protein Text Description},
author={Guo, Xiao-Yu and Li, Yi-Fan and Liu, Yuan and Pan, Xiaoyong and Shen, Hong-Bin},
journal={arXiv preprint arXiv:2412.04069},
year={2024}
}
- The ProtDAT source codes are licensed under the MIT license.
- The ESM1b model can be found at ESM1b, which is under the MIT license
- The PubMedBERT model can be found at PubMedBERT, which is under the Apache License 2.0
The ProtDAT parameters are made availabe under a Creative Commons Attribution 4.0 International License.