Skip to content

Proformer: Predicting Gene Expression Using Millions of Random Promoter Sequences

Notifications You must be signed in to change notification settings

gongx030/dream_PGE

Repository files navigation

DREAM Challenge 2022

Predicting gene expression using millions of random promoter sequences

Wuming Gong, Byeong-Chan Kim, Juhyun Lee, Il-Youp Kwak (team Unlock_DNA)

Abstract

The breakthrough high-throughput measurement of the cis-regulatory activity of millions of randomly generated promoters provides an unprecedented opportunity for us to systematically decode the cis-regulatory logic that determines the expression values. In this DREAM challenge, we developed an end-to-end Transformer encoder architecture (Proformer) to predict the expression values from DNA sequences. Proformer used a Macaron-like Transformer encoder architecture, where two half-step feed forward (FFN) layers were placed at the beginning and the end of each encoder block, and a separable 1D convolution layer was inserted after the first FFN layer and in front of the multi-head attention layer. The sliding k-mers from one-hot encoded sequences were mapped onto a continuous embedding, combined with the learnt positional embedding and strand embedding (forward strand vs reverse complement strand) as the sequence input. Proformer used multiple expression heads that predicted expression value for each head and used the mean of the prediction of all heads as the final predicted expression value. We empirically found that this design had significantly better performance than the conventional design such as using the global pooling layer as the output layer for the regression task. We believe that Proformer provides a novel method of learning and characterizing how cis-regulatory sequences determine the expression values. Proformer (team Unlock_DNA) ranked in 3rd place in the final standing of the DREAM challenge. The Proformer manuscript can be found here.

Talk in RSG/DREAM 2022

rsgdream_talk

The presentations can be found here in pdf or in powerpoint.

Proformer model

A hybrid Macaron transformer model predicts expression values from promoter sequences.

Screen Shot 2022-10-26 at 11 53 35 AM

  • Model checkpoint

https://s3.msi.umn.edu/gongx030/projects/dream_PGE/notebooks_msi/m20220727e/tf_ckpts.tar

  • Notebook for converting training sequences to a tf.data object

https://github.com/gongx030/dream_PGE/blob/main/prepare_tfdatasets.ipynb

  • Notebook for model training and prediction

https://github.com/gongx030/dream_PGE/blob/main/mode_training.ipynb

  • The Conda environment file:

https://github.com/gongx030/dream_PGE/blob/main/tf26_py37_a100.yml

  • The JSON file for prediction:

https://s3.msi.umn.edu/gongx030/projects/dream_PGE/predictions/m20220727e/pred.json

  • The tsv file for prediction:

https://s3.msi.umn.edu/gongx030/projects/dream_PGE/predictions/m20220727e/pred.tsv

  • Final report

https://github.com/gongx030/dream_PGE/blob/main/report.pdf

The guide to training the model

  1. Setup the hardware and the conda environment accroding according to the yml file.
  2. Run notebook prepare_tfdatasets.ipynb to generate a tf.data file for all training data. The resulting tf.data file can be found at ./s3.msi.umn.edu/gongx030/projects/dream_PGE/training_data/pct_ds=1/.
  3. Run notebook mode_training.ipynb to train the model on the training data and make predictions on the testing data. The model was original trained on a machine with 4 A100 GPU with cuda version of 11.7.
  4. The checkpoint should be found at ./s3.msi.umn.edu/gongx030/projects/dream_PGE/predictions/m20220727e/tf_ckpts.
  5. The final output file should be found at ./s3.msi.umn.edu/gongx030/projects/dream_PGE/predictions/m20220727e/pred.tsv.

About

Proformer: Predicting Gene Expression Using Millions of Random Promoter Sequences

Resources

Stars

Watchers

Forks