Skip to content

A Unified Framework for Protein Sequence Design from Any Protein Text Description

License

Notifications You must be signed in to change notification settings

GXY0116/ProtDAT

Repository files navigation

ProtDAT

A Unified Framework for Protein Sequence Design from Any Protein Text Description

Preparation

ProtDAT is implemented with Python3 (>=3.9). We recommend you to use a virtual environment to install the dependencies :

conda create -n ProtDAT python=3.9

PyTorch can be installed by selecting the corresponding version through https://pytorch.org/. After that, install other requirements by :

pip install -r requirements.txt

Finally, activate the virtual environment by :

conda activate ProtDAT

Download models and files

Before using ProtDAT, there are several steps:

  1. Download the ESM1b and PubMedBERT models and place them in the esm1b and pubmedbert subfolders within the model directory.
  2. Download ProtDAT model weight file state_dict.pth

Usage

Generate protein sequences with protein descriptions (and protein sequence fragments)

For generating protein sequences one by one or in batches, separately refer to gen_single_seq.py and gen_batch_seqs.py.

The cases of protein sequences and text descriptions are in the data directory. For example :

Description: FUNCTION: Component of the acetyl coenzyme A carboxylase complex. SUBCELLULAR LOCATION: Cytoplasm. SIMILARITY: Belongs to the AccA family.
Sequence: MAVSDRKLQLLDFEKPLAELEDRIEQIRSLSEQNGVDVTDQIAQLEGRAEQLRQEIFSSLTPMQELQLARHPRRPSTLDYIHAISDEWMELHGDRRGYDDPAIVGGVGRIGGQPVLMLGHQKGRDTKDNVARNFGMPFPSGYRKAMRL...

The generation codes below determine whether the process is guided solely by text or by a combination of text and sequence.

seq=None,                                           # Only protein descriptions guide the generation process
seq=tokenized_seqs['input_ids'][...,:1].to(device), # Both sequence fragments and descriptions guide the generation process

Train new model based on ProtDAT

You can build a custom protein text-sequence dataset with a specific pattern and train it using the architecture in Decoder.py.

Citations

If you find ProtDAT useful, cite the relevant paper:

@article{guo2024protdat,
  title={ProtDAT: A Unified Framework for Protein Sequence Design from Any Protein Text Description},
  author={Guo, Xiao-Yu and Li, Yi-Fan and Liu, Yuan and Pan, Xiaoyong and Shen, Hong-Bin},
  journal={arXiv preprint arXiv:2412.04069},
  year={2024}
}

License

Code License

  1. The ProtDAT source codes are licensed under the MIT license.
  2. The ESM1b model can be found at ESM1b, which is under the MIT license
  3. The PubMedBERT model can be found at PubMedBERT, which is under the Apache License 2.0

Model Parameters License

Creative Commons License
The ProtDAT parameters are made availabe under a Creative Commons Attribution 4.0 International License.

About

A Unified Framework for Protein Sequence Design from Any Protein Text Description

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages