Multilayer modelling of the human transcriptome and biological mechanisms of complex diseases and traits
Tiago Azevedo, Giovanna Maria Dimitri, Pietro Lió, Eric R. Gamazon
This repository contains all the code necessary to run and further extend the experiments presented in the following paper accepted at npj Systems Biology and Applications: https://doi.org/10.1038/s41540-021-00186-6.
Here, we performed a comprehensive intra-tissue and inter-tissue multilayer network analysis of the human transcriptome. We generated an atlas of communities in gene co-expression networks in 49 tissues (GTEx v8), evaluated their tissue specificity, and investigated their methodological implications. UMAP embeddings of gene expression from the communities (representing nearly 18% of all genes) robustly identified biologically-meaningful clusters. Notably, new gene expression data can be embedded into our algorithmically derived models to accelerate discoveries in high-dimensional molecular datasets and downstream diagnostic or prognostic applications. We demonstrate the generalisability of our approach through systematic testing in external genomic and transcriptomic datasets. Methodologically, prioritisation of the communities in a transcriptome-wide association study of the biomarker C-reactive protein (CRP) in 361,194 individuals in the UK Biobank identified genetically-determined expression changes associated with CRP and led to considerably improved performance. Furthermore, a deep learning framework applied to the communities in nearly 11,000 tumors profiled by The Cancer Genome Atlas across 33 different cancer types learned biologically-meaningful latent spaces, representing metastasis () and stemness (). Our study provides a rich genomic resource to catalyse research into inter-tissue regulatory mechanisms, and their downstream consequences on human disease.
This repository contains all the scripts which were used in the paper. The number in each script's name (in the root of this repository) corresponds to the order in which they are run in the paper.
Each folder used in this repository is explained as follows:
-
meta_data
: This folder includes some completementary files to GTEx necessary to run some experiments. Examples of such files include phenotype information, as well as information about conversion of gene names and identification of reactome pathways. -
outputs
: This folder contains the outputs of some of the numbered scripts. Some contain important information used in the paper, as it is, for example, the filesoutput_02_01.txt
,output_04_02.txt
, andoutput_06_02.txt
. -
results
: Some result files, like the communities identified from the Louvain algorithm. -
svm_results
: Files with the metrics resulting from the SVM predictions. -
track_hub
: Files in the format required for Track Hub
In the repository there are also some jupyter notebooks which we hope can help researchers in using our results in their own experiments, as well as improve the reproducibility of this paper:
-
09_community_info.ipyb
: Instructions on how to check information regarding each community characterisation, including generation of LaTeX code. -
10_reactomes_per_tissue.ipynb
: Instructions on how to check information regarding which reactomes were able to predict each tissue. -
11_multiplex_enrichment.ipynb
: Instructions on how to check the group of genes identified in each multiplex network. -
12_tcga.ipynb
: The code used in the paper to analyse the TCGA dataset within the GTEx pipeline of the paper, as well as a targeted R code (12_01_correct_confounds_tcga.R
) used to correct the data. -
13_plots_for_paper.ipynb
: The code used to generate the plots from the paper. -
14_track_hub.ipynb
: Code and explanations on how we generated the needed files for Track Hub
These scripts were tested in a Linux Ubuntu 16.04 operating system, with environments created using Anaconda.
We include a working dependency file in python_environment.yml
describing the exact dependencies used to run the python scripts. In order to install all the dependencies automatically with Anaconda, one can easily just run the following command in the terminal to create an Anaconda environment:
$ conda env create --force --file python_environment.yml
$ conda activate gtex-env
To summarise the python_environment.yml
file, the main dependencies needed to run these scripts are:
- gseapy 0.9.16
- jupyterlab 1.1.4
- matplotlib 3.1.0
- networkx 2.4
- numpy 1.17.3
- pandas 1.0.1
- python 3.7.5
- scikit-learn 0.21.3
- statsmodels 0.10.2
- umap-learn 0.4.2
- bctpy 0.5.0
We used R to run the unsupervised correction package sva, which is briefly described in the paper. To make things easier, we also include the R dependencies in which these scripts were run, with Anaconda. Similarly to python, one can install them using the following commands:
$ conda env create --force --file r_environment.yml
$ conda activate r_env
After the R environment is created and activated, one should install the sva package as descibred in the original repository:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("sva")
We decided to keep python and R scripts in separate environments to avoid dependency issues given they are distinct programming languages.
To details on the data used, which cannot be publicly shared in this repository, please see the paper.
In order to analyse and see our jupyter notebooks, one just needs to start the jupyter engine in the root of this repository:
$ jupyter lab --port=8895
This command will print a link in the local machine at port 8895, which can be accessed using a browser.
The python scripts in this repository are numbered, suggesting an order by which they should be executed. However, each python script contains at the beginning of the file a small documentation explaining what it does. To run python scripts from 01
to 04_02
, and 06_02
, one just needs to run the following command:
$ python -u PYTHON_FILE | tee outputs/output_file.txt
The previous command will run PYTHON_FILE
and log the output of the script in outputs/output_file.txt. All the other python scripts expect one or two flags to be passed. Information about each flag can be seen in each parser.add_argument
command in each file, which contains a small documentation of what it means. For example, python script 05_01
expects the flag --tissue_num
; therefore, that flag needs to be passed when executing the script:
$ python -u 05_01_svms_communities.py --tissue_num NUM | tee outputs/output_05_01_NUM.txt
where NUM
corresponds to the value for that flag.
The following command is an example of how to run an R script:
$ Rscript --no-save --no-restore --verbose 02_01_correct_confounds.R > outputs/output_02_01.txt 2>&1
The previous command will run the 02_01_correct_confounds.R
script and log the output of the script in outputs/output_02_01.txt.
The scripts for the multilayer modeling approach to TWAS/PrediXcan (CRP in UKB) and Variational Autoencoder model (TCGA) are in this external repository.
To cite our work, we provide the following BibTeX:
@article{Azevedo2021,
doi = {10.1038/s41540-021-00186-6},
url = {https://doi.org/10.1038/s41540-021-00186-6},
year = {2021},
month = may,
publisher = {Springer Science and Business Media {LLC}},
volume = {7},
number = {1},
pages={1--13},
author = {Tiago Azevedo and Giovanna Maria Dimitri and Pietro Li{\'{o}} and Eric R. Gamazon},
title = {Multilayer modelling of the human transcriptome and biological mechanisms of complex diseases and traits},
journal = {npj Systems Biology and Applications}
}
If you run into any problem or have a question, please just open an issue.