Skip to content

ProTrek: Navigating the Protein Universe through Tri-Modal Contrastive Learning

License

Notifications You must be signed in to change notification settings

westlake-repl/ProTrek

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ProTrek: Navigating the Protein Universe through Tri-Modal Contrastive Learning

The repository is an official implementation of ProTrek: Navigating the Protein Universe through Tri-Modal Contrastive Learning.

ProTrek4Search (old): https://huggingface.co/spaces/westlake-repl/Demo_ProTrek_650M_UniRef50old

ProTrek4Search (new): http://search-protrek.com/ (Update: adding OMG database)

ColabProTrek: https://colab.research.google.com/drive/1On2xQU0d7351bIBgZpz2T0VUp2gZium0?usp=sharing

ColabProTrek has joined OPMC. If you find it useful for your research, please consider also cite the original OPMC paper.

If you have any question about the paper or the code, feel free to raise an issue!

Table of contents

Overview

ProTrek is a tri-modal protein language model that jointly models protein sequence, structure and function (SSF). It employs contrastive learning with three core alignment strategies: (1) using structure as the supervision signal for AA sequences and vice versa, (2) mutual supervision between sequences and functions, and (3) mutual supervision between structures and functions. This tri-modal alignment training enables ProTrek to tightly associate SSF by bringing genuine sample pairs (sequence-structure, sequence-function, and structure-function) closer together while pushing negative samples farther apart in the latent space.

ProTrek achieves over 30x and 60x improvements in sequence-function and function-sequence retrieval, is 100x faster than Foldseek and MMseqs2 in protein-protein search, and outperforms ESM-2 in 9 of 11 downstream prediction tasks.

Environment installation

Create a virtual environment

conda create -n protrek python=3.10 --yes
conda activate protrek

Clone the repo and install packages

bash environment.sh  

Download model weights

ProTrek provides pre-trained models with different sizes (35M and 650M), as shown below. For each pre-trained model, Please download all files and put them in the weights directory, e.g. weights/ProTrek_35M_UniRef50/....

Name Size (protein sequence encoder) Size (protein structure encoder) Size (text encoder) Dataset
ProTrek_35M_UniRef50 35M parameters 35M parameters 130M parameters Swiss-Prot + UniRef50
ProTrek_650M_UniRef50 650M parameters 150M parameters 130M parameters Swiss-Prot + UniRef50

We provide an example to download the pre-trained model weights.

huggingface-cli download westlake-repl/ProTrek_650M_UniRef50 \
                         --repo-type model \
                         --local-dir weights/ProTrek_650M_UniRef50

Note: if you cannot access the huggingface website, you can try to connect to the mirror site through "export HF_ENDPOINT=https://hf-mirror.com"

Download Foldseek binary file

To run examples correctly and deploy your demo locally, please at first download the Foldseek binary file from here and place it in the bin folder. Then add the execute permission to the binary file.

chmod +x bin/foldseek

Obtain embeddings and calculate similarity score

Below is an example of how to obtain embeddings and calculate similarity score using the pre-trained ProTrek model.

import torch

from model.ProTrek.protrek_trimodal_model import ProTrekTrimodalModel
from utils.foldseek_util import get_struc_seq

# Load model
config = {
    "protein_config": "weights/ProTrek_650M_UniRef50/esm2_t33_650M_UR50D",
    "text_config": "weights/ProTrek_650M_UniRef50/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext",
    "structure_config": "weights/ProTrek_650M_UniRef50/foldseek_t30_150M",
    "load_protein_pretrained": False,
    "load_text_pretrained": False,
    "from_checkpoint": "weights/ProTrek_650M_UniRef50/ProTrek_650M_UniRef50.pt"
}

device = "cuda"
model = ProTrekTrimodalModel(**config).eval().to(device)

# Load protein and text
pdb_path = "example/8ac8.cif"
seqs = get_struc_seq("bin/foldseek", pdb_path, ["A"])["A"]
aa_seq = seqs[0]
foldseek_seq = seqs[1].lower()
text = "Replication initiator in the monomeric form, and autogenous repressor in the dimeric form."

with torch.no_grad():
    # Obtain protein sequence embedding
    seq_embedding = model.get_protein_repr([aa_seq])
    print("Protein sequence embedding shape:", seq_embedding.shape)
    
    # Obtain protein structure embedding
    struc_embedding = model.get_structure_repr([foldseek_seq])
    print("Protein structure embedding shape:", struc_embedding.shape)
    
    # Obtain text embedding
    text_embedding = model.get_text_repr([text])
    print("Text embedding shape:", text_embedding.shape)
    
    # Calculate similarity score between protein sequence and structure
    seq_struc_score = seq_embedding @ struc_embedding.T / model.temperature
    print("Similarity score between protein sequence and structure:", seq_struc_score.item())

    # Calculate similarity score between protein sequence and text
    seq_text_score = seq_embedding @ text_embedding.T / model.temperature
    print("Similarity score between protein sequence and text:", seq_text_score.item())
    
    # Calculate similarity score between protein structure and text
    struc_text_score = struc_embedding @ text_embedding.T / model.temperature
    print("Similarity score between protein structure and text:", struc_text_score.item())
   

"""
Protein sequence embedding shape: torch.Size([1, 1024])
Protein structure embedding shape: torch.Size([1, 1024])
Text embedding shape: torch.Size([1, 1024])
Similarity score between protein sequence and structure: 28.506675720214844
Similarity score between protein sequence and text: 17.842409133911133
Similarity score between protein structure and text: 11.866174697875977
"""

Deploy your demo locally

We provide an online demo for ProTrek. For users who want to deploy the demo locally, please follow the steps below.

Step 1: Download the Foldseek binary file

Please follow the instructions in the Download Foldseek binary file section.

Step 2: Download the pre-trained model weights

Currently we support the deployment of ProTrek_650M_UniRef50. Please download all files and put them in the weights directory, e.g. weights/ProTrek_650M_UniRef50/.... The example code is in the Download model weights section.

Step 3: Download pre-computed faiss index

We provide pre-computed protein embeddings and text embeddings using ProTrek_650M_UniRef50, and build faiss index for fast similarity search. Please download the pre-computed faiss index from here and put it in the weights/faiss_index directory, e.g. weights/faiss_index/faiss_index_ProTrek_650M_UniRef50/.... We provide an example to download the pre-computed faiss index.

huggingface-cli download westlake-repl/faiss_index_ProTrek_650M_UniRef50 \
                         --repo-type dataset \
                         --local-dir weights/faiss_index/faiss_index_ProTrek_650M_UniRef50

Step 4: Run the demo

After all data and files are prepared, you can run the demo by executing the following command.

python demo/run.py