Integrating Embeddings from Multiple Protein Language Models to Improve Protein O-GlcNAc Site Prediction
http://kcdukkalab.org/LMOGlcNAcSite/
Suresh Pokharel1, Pawel Pratyush1, Hamid D. Ismail1, Junfeng Ma2, Dukka B KC1*
1Department of Computer Science, Michigan Technological University, Houghton, MI, USA
2
Department of Oncology, Lombardi Comprehensive Cancer Center, Georgetown University Medical Center, Georgetown University, Washington, DC 20057, USA
* Corresponding Author: [email protected]
If Git is installed on your system, clone the repository by running the following command in your terminal:
git clone [email protected]:KCLabMTU/LM-OGlcNAc-Site.git
If you do not have Git or perfer to download directly: Download the repository directly from GitHub. Click Here to download the repository as a zip file.
Python version: 3.10.0
To intall the required libraries, run the following command:
pip install -r requirements.txt
Required libraries and versions:
ankh==1.10.0
Bio==1.7.0
biopython==1.83
datasets==2.19.0
fair_esm==2.0.0
keras==2.8.0
numpy==1.26.4
pandas==2.2.2
protobuf==3.20.*
scikit_learn==1.4.2
scipy==1.13.0
tensorflow==2.8.0
torch==2.3.0
tqdm==4.66.2
transformers==4.40.1
In order to predict succinylation site using your own sequence, you need to have two inputs:
- Copy sequences you want to predict to
input/sequence.fasta
- Run
python predict.py
- Find results inside the
output
folder in a csv file namedresults.csv
Use the following command to determine input and output files:
python predict.py --input [input_path] --output [output_path]
or in short form notation,
python predict.py -i [input_path] --output [output_path]
Replace:
[Input]
with the path of the input file you want to run the model onto MUST BE a .fasta FILE
[Output]
with the path of the output file you want the result to be returned to MUST BE A .csv FILE
Example:
python predict.py -i input.fasta -o output.csv
Note:
- You an always use the '-h' or '--help' flag to get detailed information about the available command-line arguments.
- You may also utilize the web server [here] (http://kcdukkalab.org/LMOGlcNAcSite/)
Pokharel, S.; Pratyush, P.; Ismail, H.D.; Ma, J.; KC, D.B. Integrating Embeddings from Multiple Protein Language Models to Improve Protein O-GlcNAc Site Prediction. Int. J. Mol. Sci. 2023, 24, 16000. https://doi.org/10.3390/ijms242116000
Paper Link: https://www.mdpi.com/1422-0067/24/21/16000
Please send an email to [email protected] (CC: [email protected], [email protected] for any kind of queries and discussions.