PEDL

PEDL is a method for predicting protein-protein assocations from text. The paper describing it will be presented at ISMB 2020.

Requirements

python >= 3.6
pip install -r requirements.txt
pytorch >= 1.3.1 (has to be installed manually, due to different CUDA versions)

Generate data

We use two types of data sets: Data generated from the BioNLP-ST event extraction data sets and the distantly supervised PID data set

Generate BioNLP

./conversion/make_bionlp_data.sh generates the BioNLP data sets for both PEDL and comb-dist

All experiments in the paper have been performed with the masked version of the data, e.g. distant_supervision/data/BioNLP-ST_2011/train_masked.json.

Generate PID

Generating the PID data is a bit more involved:

First, we have to download the raw PubMed Central texts: python download_pmc.py. CAUTION: This produces over 200 GB of files and spawns multiple processes.
Then, we have to download the PubTator Central file (ftp://ftp.ncbi.nlm.nih.gov/pub/lu/PubTatorCentral/bioconcepts2pubtatorcentral.offset.gz) and place it into the root directory. This file consumes another 80 GB when decompressed.
Generate the raw PID data: ./conversion/generate_raw_pid.sh
Generate the final PID data: ./conversion_make_pid.sh

Training PEDL

Before training, SciBERT has to be downloaded and placed to some directory (called $bert_dir from now on).

The vocabulary of SciBERT has to be adapted to include the entity markers and protein masks: cp distant_supervision/vocab.txt $bert_dir

PEDL can be trained with python -m distant_supervision.train_pedl, (see train_pedl.sh for exact suitable arguments.

If you just want to reproduce the experiments from the paper, this can be achieved with ./train_pedl.sh.

Pretrained model

As an alternative to training your own model, you can use this version of PEDL that was trained on PID and used for the experiments in the paper.

Predicting with PEDL

The trained PEDL model can be used to predict PPAs for a new data set. See predict_pedl.sh for details.

Disclaimer

Note, that this is highly experimental research code which is not suitable for production usage. We do not provide warranty of any kind. Use at your own risk.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PEDL

Requirements

Generate data

Generate BioNLP

Generate PID

Training PEDL

Pretrained model

Predicting with PEDL

Disclaimer

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
conversion		conversion
data		data
distant_supervision		distant_supervision
.gitignore		.gitignore
README.md		README.md
predict_pedl.sh		predict_pedl.sh
requirements.txt		requirements.txt
train_pedl.sh		train_pedl.sh

leonweber/pedl_ismb20

Folders and files

Latest commit

History

Repository files navigation

PEDL

Requirements

Generate data

Generate BioNLP

Generate PID

Training PEDL

Pretrained model

Predicting with PEDL

Disclaimer

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages