DREAM Challenge 2022

Predicting gene expression using millions of random promoter sequences

Wuming Gong, Byeong-Chan Kim, Juhyun Lee, Il-Youp Kwak (team Unlock_DNA)

Abstract

The breakthrough high-throughput measurement of the cis-regulatory activity of millions of randomly generated promoters provides an unprecedented opportunity for us to systematically decode the cis-regulatory logic that determines the expression values. In this DREAM challenge, we developed an end-to-end Transformer encoder architecture (Proformer) to predict the expression values from DNA sequences. Proformer used a Macaron-like Transformer encoder architecture, where two half-step feed forward (FFN) layers were placed at the beginning and the end of each encoder block, and a separable 1D convolution layer was inserted after the first FFN layer and in front of the multi-head attention layer. The sliding k-mers from one-hot encoded sequences were mapped onto a continuous embedding, combined with the learnt positional embedding and strand embedding (forward strand vs reverse complement strand) as the sequence input. Proformer used multiple expression heads that predicted expression value for each head and used the mean of the prediction of all heads as the final predicted expression value. We empirically found that this design had significantly better performance than the conventional design such as using the global pooling layer as the output layer for the regression task. We believe that Proformer provides a novel method of learning and characterizing how cis-regulatory sequences determine the expression values. Proformer (team Unlock_DNA) ranked in 3rd place in the final standing of the DREAM challenge. The Proformer manuscript can be found here.

Talk in RSG/DREAM 2022

The presentations can be found here in pdf or in powerpoint.

Proformer model

A hybrid Macaron transformer model predicts expression values from promoter sequences.

Model checkpoint

https://s3.msi.umn.edu/gongx030/projects/dream_PGE/notebooks_msi/m20220727e/tf_ckpts.tar

Notebook for converting training sequences to a tf.data object

https://github.com/gongx030/dream_PGE/blob/main/prepare_tfdatasets.ipynb

Notebook for model training and prediction

https://github.com/gongx030/dream_PGE/blob/main/mode_training.ipynb

The Conda environment file:

https://github.com/gongx030/dream_PGE/blob/main/tf26_py37_a100.yml

The JSON file for prediction:

https://s3.msi.umn.edu/gongx030/projects/dream_PGE/predictions/m20220727e/pred.json

The tsv file for prediction:

https://s3.msi.umn.edu/gongx030/projects/dream_PGE/predictions/m20220727e/pred.tsv

Final report

https://github.com/gongx030/dream_PGE/blob/main/report.pdf

The guide to training the model

Setup the hardware and the conda environment accroding according to the yml file.
Run notebook prepare_tfdatasets.ipynb to generate a tf.data file for all training data. The resulting tf.data file can be found at ./s3.msi.umn.edu/gongx030/projects/dream_PGE/training_data/pct_ds=1/.
Run notebook mode_training.ipynb to train the model on the training data and make predictions on the testing data. The model was original trained on a machine with 4 A100 GPU with cuda version of 11.7.
The checkpoint should be found at ./s3.msi.umn.edu/gongx030/projects/dream_PGE/predictions/m20220727e/tf_ckpts.
The final output file should be found at ./s3.msi.umn.edu/gongx030/projects/dream_PGE/predictions/m20220727e/pred.tsv.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
notebooks		notebooks
README.md		README.md
mode_training.ipynb		mode_training.ipynb
prepare_tfdatasets.ipynb		prepare_tfdatasets.ipynb
report.docx		report.docx
report.pdf		report.pdf
rsgdream_talk.pdf		rsgdream_talk.pdf
rsgdream_talk.pptx		rsgdream_talk.pptx
tf26_py37_a100.yml		tf26_py37_a100.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DREAM Challenge 2022

Predicting gene expression using millions of random promoter sequences

Abstract

Talk in RSG/DREAM 2022

Proformer model

A hybrid Macaron transformer model predicts expression values from promoter sequences.

The guide to training the model

About

Languages

gongx030/dream_PGE

Folders and files

Latest commit

History

Repository files navigation

DREAM Challenge 2022

Predicting gene expression using millions of random promoter sequences

Abstract

Talk in RSG/DREAM 2022

Proformer model

A hybrid Macaron transformer model predicts expression values from promoter sequences.

The guide to training the model

About

Resources

Stars

Watchers

Forks

Languages